Quick Start¶
A simple example about usage of cogdata
.
Installation¶
pip install cogdata --index-url https://test.pypi.org/simple
sudo install_unrarlib.sh
Initialization¶
Firstly, build a data folder and move the data files into it:
.
└── test_ds
├── infolist.json
└── n10148035.tar
Create Dataset¶
Use cogdata create_dataset
commend to create a dataset:
cogdata create_dataset [-h], [--help] # show the help and exit
[--description DESCRIPTION] # description of the handling dataset.
[--text_files TEXT_FILES [TEXT_FILES ...]] # file names of the handling dataset.
[--text_format TEXT_FORMAT] # format of data files.
--data_files DATA_FILES [DATA_FILES ...] # file name of the text.
--data_format DATA_FORMAT # format of the text file.
name # name of the handling dataset.
This commend just creates a cogdata_info.json
in “name” folder. Here let Dataset’s name be same with the data folder.
Example:
cogdata create_dataset --text_files infolist.json --text_format dict --data_files n10148035.tar --data_format TarDataset test_ds
Directory structure:
.
└── test_ds
├── cogdata_info.json
├── infolist.json
└── n10148035.tar
Create Task¶
Use cogdata create_task
commend to create a task:
cogdata create_task [-h], [--help] # show the help and exit
[--description DESCRIPTION] # description of the new task.
[--length_per_sample LENGTH_PER_SAMPLE] # data length of one sample (Bytes).
[--img_sizes IMG_SIZES [IMG_SIZES ...]] # sizes of a pre-tokenized image.
[--txt_len TXT_LEN] # length of text in one sample.
[--dtype {int32,int64,float32,uint8,bool}] # data type of samples.
[--model_path MODEL_PATH] # path of image tokeizer
--task_type TASK_TYPE # type of the handling task.
--saver_type SAVER_TYPE # saver mode.
task_id # id of the new task.
Example:
# Don't forget to modify "model_path"
cogdata create_task --description test --task_type ImageTextTokenizationTask --saver_type BinarySaver --length_per_sample 1088 --img_sizes 256 --txt_len 64 --dtype int32 --model_path='/dataset/fd5061f6/cogview/vqvae_hard_biggerset_011.pt' test_task
Directory structure:
.
├── cogdata_task_test_task
│ └── cogdata_config.json
└── test_ds
├── cogdata_info.json
├── infolist.json
└── n10148035.tar
Check Datasets and Tasks¶
Now we can use cogdata list
commend to check:
cogdata list [-h], [--help] # show the help and exit.
[-t TASK_ID], [ --task_id TASK_ID] # id of the handling task.
Example: list dataset:
cogdata list
Expected Output:
--------------------------- All Raw Datasets --------------------------
test_ds(207.7MB)
------------------------------- Summary -------------------------------
Total 1 datasets
Total size: 207.7MB
Example: list task:
cogdata list -t test_task
Expected Output:
--------------------------- All Raw Datasets --------------------------
test_ds(207.7MB)
------------------------------- Summary -------------------------------
Total 1 datasets
Total size: 207.7MB
------------------------------ Task Info ------------------------------
Task Id: test_task
Task Type: ImageTextTokenizationTask
Description: test
Processed: FORMAT: dataset_name(raw_size => processed_size)
Hanging: FORMAT: dataset_name(raw_size)[create_time]
Additional: FORMAT: dataset_name(processed_size)
Unprocessed: FORMAT: dataset_name(raw_size)
test_ds(207.7MB)
“test_ds” is in Unprocessed group.
Process¶
Use cogdata process
commend to process datasets:
cogdata process
[-h], [--help] # show the help and exit
[--nproc NPROC] # number of processes to launch.
[--dataloader_num_workers DATALOADER_NUM_WORKERS] # number of processes for dataloader per computational process.
[--ratio RATIO] # ratio of data to process.
-t TASK_ID, --task_id TASK_ID # id of the handling task.
[datasets [datasets ...]] # dataset names, None means all possible datasets.
Example:
cogdata process --task_id test_task --nproc 2 --dataloader_num_workers 1 --ratio 1 test_ds
Expected Output:
All datasets: test_ds
Processing test_ds
dataset: test_ds, rank 0:[#########################] 100% Speed: 92.66 samples/s
dataset: test_ds, rank 1:[#########################] 100% Speed: 92.66 samples/s
Waiting torch.launch to terminate...
Now “test_task” is processed. It can be examined by cogdata list -t test_task
:
------------------------------ Task Info ------------------------------
Task Id: test_task
Task Type: ImageTextTokenizationTask
Description: test
Processed: FORMAT: dataset_name(raw_size => processed_size)
test_ds(207.7MB => 5.4MB)
Hanging: FORMAT: dataset_name(raw_size)[create_time]
Additional: FORMAT: dataset_name(processed_size)
Unprocessed: FORMAT: dataset_name(raw_size)
Directory structure:
.
├── cogdata_task_test_task
│ ├── cogdata_config.json
│ ├── main_pid_35218.log
│ └── test_ds
│ ├── logs
│ │ ├── rank_0.log
│ │ ├── rank_0.progress
│ │ ├── rank_1.log
│ │ └── rank_1.progress
│ ├── meta_info.json
│ ├── test_ds.bin.part_0.cogdata
│ └── test_ds.bin.part_1.cogdata
└── test_ds
├── cogdata_info.json
├── infolist.json
└── n10148035.tar
Merge¶
There are 2 processed files now, test_ds.bin.part_0.cogdata
and test_ds.bin.part_1.cogdata
. Because nproc=2
in process.
So we need to merge them by cogdata merge
:
cogdata merge [-h], [--help] # show the help message and exit
-t TASK_ID, --task_id TASK_ID # id of the handling task
Example:
cogdata merge -t test_task
Directory structure:
.
├── cogdata_task_test_task
│ ├── cogdata_config.json
│ ├── main_pid_35218.log
│ ├── merge.bin
│ └── test_ds
│ ├── logs
│ │ ├── rank_0.log
│ │ ├── rank_0.progress
│ │ ├── rank_1.log
│ │ └── rank_1.progress
│ ├── meta_info.json
│ ├── test_ds.bin.part_0.cogdata
│ └── test_ds.bin.part_1.cogdata
└── test_ds
├── cogdata_info.json
├── infolist.json
└── n10148035.tar
Split¶
Use cogdata split
to random split the merge result into some average subsets:
cogdata split [-h], [--help] # show the help message and exit.
-t TASK_ID, --task_id TASK_ID # id of the handling task.
n # number of split pieces for the merge result.
Example:
cogdata split -t test_task 3
Directory structure:
.
├── cogdata_task_test_task
│ ├── cogdata_config.json
│ ├── main_pid_40494.log
│ ├── merge.bin
│ ├── split_merged_files
│ │ ├── merge.bin.part0
│ │ ├── merge.bin.part1
│ │ └── merge.bin.part2
│ └── test_ds
│ ├── logs
│ │ ├── rank_0.log
│ │ ├── rank_0.progress
│ │ ├── rank_1.log
│ │ └── rank_1.progress
│ ├── meta_info.json
│ ├── test_ds.bin.part_0.cogdata
│ └── test_ds.bin.part_1.cogdata
└── test_ds
├── cogdata_info.json
├── infolist.json
└── n10148035.tar
Clean¶
If a task crash or stay “Hanging” for too long, cogdata clean
can help to remove damaged files in the task folder:
cogdata clean [-h], [--help] # show the help message and exit
-t TASK_ID, --task_id TASK_ID # id of the handling task
Example:
cogdata clean -t test_task