Quick Start =========== A simple example about usage of ``cogdata``. Installation ------------ :: pip install cogdata --index-url https://test.pypi.org/simple sudo install_unrarlib.sh Initialization -------------- Firstly, build a data folder and move the data files into it:: . └── test_ds ├── infolist.json └── n10148035.tar Create Dataset -------------- Use ``cogdata create_dataset`` commend to create a dataset:: cogdata create_dataset [-h], [--help] # show the help and exit [--description DESCRIPTION] # description of the handling dataset. [--text_files TEXT_FILES [TEXT_FILES ...]] # file names of the handling dataset. [--text_format TEXT_FORMAT] # format of data files. --data_files DATA_FILES [DATA_FILES ...] # file name of the text. --data_format DATA_FORMAT # format of the text file. name # name of the handling dataset. This commend just creates a ``cogdata_info.json`` in "name" folder. Here let Dataset's name be same with the data folder. Example:: cogdata create_dataset --text_files infolist.json --text_format dict --data_files n10148035.tar --data_format TarDataset test_ds Directory structure:: . └── test_ds ├── cogdata_info.json ├── infolist.json └── n10148035.tar Create Task ----------- Use ``cogdata create_task`` commend to create a task:: cogdata create_task [-h], [--help] # show the help and exit [--description DESCRIPTION] # description of the new task. [--length_per_sample LENGTH_PER_SAMPLE] # data length of one sample (Bytes). [--img_sizes IMG_SIZES [IMG_SIZES ...]] # sizes of a pre-tokenized image. [--txt_len TXT_LEN] # length of text in one sample. [--dtype {int32,int64,float32,uint8,bool}] # data type of samples. [--model_path MODEL_PATH] # path of image tokeizer --task_type TASK_TYPE # type of the handling task. --saver_type SAVER_TYPE # saver mode. task_id # id of the new task. Example:: # Don't forget to modify "model_path" cogdata create_task --description test --task_type ImageTextTokenizationTask --saver_type BinarySaver --length_per_sample 1088 --img_sizes 256 --txt_len 64 --dtype int32 --model_path='/dataset/fd5061f6/cogview/vqvae_hard_biggerset_011.pt' test_task Directory structure:: . ├── cogdata_task_test_task │ └── cogdata_config.json └── test_ds ├── cogdata_info.json ├── infolist.json └── n10148035.tar Check Datasets and Tasks ------------------------- Now we can use ``cogdata list`` commend to check:: cogdata list [-h], [--help] # show the help and exit. [-t TASK_ID], [ --task_id TASK_ID] # id of the handling task. Example: list dataset:: cogdata list Expected Output:: --------------------------- All Raw Datasets -------------------------- test_ds(207.7MB) ------------------------------- Summary ------------------------------- Total 1 datasets Total size: 207.7MB Example: list task:: cogdata list -t test_task Expected Output:: --------------------------- All Raw Datasets -------------------------- test_ds(207.7MB) ------------------------------- Summary ------------------------------- Total 1 datasets Total size: 207.7MB ------------------------------ Task Info ------------------------------ Task Id: test_task Task Type: ImageTextTokenizationTask Description: test Processed: FORMAT: dataset_name(raw_size => processed_size) Hanging: FORMAT: dataset_name(raw_size)[create_time] Additional: FORMAT: dataset_name(processed_size) Unprocessed: FORMAT: dataset_name(raw_size) test_ds(207.7MB) "test_ds" is in Unprocessed group. Process ------- Use ``cogdata process`` commend to process datasets:: cogdata process [-h], [--help] # show the help and exit [--nproc NPROC] # number of processes to launch. [--dataloader_num_workers DATALOADER_NUM_WORKERS] # number of processes for dataloader per computational process. [--ratio RATIO] # ratio of data to process. -t TASK_ID, --task_id TASK_ID # id of the handling task. [datasets [datasets ...]] # dataset names, None means all possible datasets. Example:: cogdata process --task_id test_task --nproc 2 --dataloader_num_workers 1 --ratio 1 test_ds Expected Output:: All datasets: test_ds Processing test_ds dataset: test_ds, rank 0:[#########################] 100% Speed: 92.66 samples/s dataset: test_ds, rank 1:[#########################] 100% Speed: 92.66 samples/s Waiting torch.launch to terminate... Now "test_task" is processed. It can be examined by ``cogdata list -t test_task``:: ------------------------------ Task Info ------------------------------ Task Id: test_task Task Type: ImageTextTokenizationTask Description: test Processed: FORMAT: dataset_name(raw_size => processed_size) test_ds(207.7MB => 5.4MB) Hanging: FORMAT: dataset_name(raw_size)[create_time] Additional: FORMAT: dataset_name(processed_size) Unprocessed: FORMAT: dataset_name(raw_size) Directory structure:: . ├── cogdata_task_test_task │ ├── cogdata_config.json │ ├── main_pid_35218.log │ └── test_ds │ ├── logs │ │ ├── rank_0.log │ │ ├── rank_0.progress │ │ ├── rank_1.log │ │ └── rank_1.progress │ ├── meta_info.json │ ├── test_ds.bin.part_0.cogdata │ └── test_ds.bin.part_1.cogdata └── test_ds ├── cogdata_info.json ├── infolist.json └── n10148035.tar Merge ------ There are 2 processed files now, ``test_ds.bin.part_0.cogdata`` and ``test_ds.bin.part_1.cogdata``. Because ``nproc=2`` in process. So we need to merge them by ``cogdata merge``:: cogdata merge [-h], [--help] # show the help message and exit -t TASK_ID, --task_id TASK_ID # id of the handling task Example:: cogdata merge -t test_task Directory structure:: . ├── cogdata_task_test_task │ ├── cogdata_config.json │ ├── main_pid_35218.log │ ├── merge.bin │ └── test_ds │ ├── logs │ │ ├── rank_0.log │ │ ├── rank_0.progress │ │ ├── rank_1.log │ │ └── rank_1.progress │ ├── meta_info.json │ ├── test_ds.bin.part_0.cogdata │ └── test_ds.bin.part_1.cogdata └── test_ds ├── cogdata_info.json ├── infolist.json └── n10148035.tar Split ------ Use ``cogdata split`` to random split the merge result into some average subsets:: cogdata split [-h], [--help] # show the help message and exit. -t TASK_ID, --task_id TASK_ID # id of the handling task. n # number of split pieces for the merge result. Example:: cogdata split -t test_task 3 Directory structure:: . ├── cogdata_task_test_task │ ├── cogdata_config.json │ ├── main_pid_40494.log │ ├── merge.bin │ ├── split_merged_files │ │ ├── merge.bin.part0 │ │ ├── merge.bin.part1 │ │ └── merge.bin.part2 │ └── test_ds │ ├── logs │ │ ├── rank_0.log │ │ ├── rank_0.progress │ │ ├── rank_1.log │ │ └── rank_1.progress │ ├── meta_info.json │ ├── test_ds.bin.part_0.cogdata │ └── test_ds.bin.part_1.cogdata └── test_ds ├── cogdata_info.json ├── infolist.json └── n10148035.tar Clean ------ If a task crash or stay "Hanging" for too long, ``cogdata clean`` can help to remove damaged files in the task folder:: cogdata clean [-h], [--help] # show the help message and exit -t TASK_ID, --task_id TASK_ID # id of the handling task Example:: cogdata clean -t test_task