Quick Start
===========
A simple example about usage of ``cogdata``.

Installation
------------
::

    pip install cogdata --index-url https://test.pypi.org/simple
    sudo install_unrarlib.sh


Initialization 
--------------
Firstly, build a data folder and move the data files into it::

    .
    └── test_ds
        ├── infolist.json
        └── n10148035.tar


Create Dataset
--------------
Use ``cogdata create_dataset`` commend to create a dataset::

    cogdata create_dataset  [-h], [--help]                              # show the help and exit
                            [--description DESCRIPTION]                 # description of the handling dataset.
                            [--text_files TEXT_FILES [TEXT_FILES ...]]  # file names of the handling dataset.
                            [--text_format TEXT_FORMAT]                 # format of data files.
                            --data_files DATA_FILES [DATA_FILES ...]    # file name of the text.
                            --data_format DATA_FORMAT                   # format of the text file.

                            name                                        # name of the handling dataset.

This commend just creates a ``cogdata_info.json`` in "name" folder. Here let Dataset's name be same with the data folder.

Example::
    
    cogdata create_dataset --text_files infolist.json --text_format dict --data_files n10148035.tar --data_format TarDataset test_ds

Directory structure::

    .
    └── test_ds
        ├── cogdata_info.json
        ├── infolist.json
        └── n10148035.tar

Create Task
-----------
Use ``cogdata create_task`` commend to create a task::

    cogdata create_task [-h], [--help]                              # show the help and exit
                        [--description DESCRIPTION]                 # description of the new task.
                        [--length_per_sample LENGTH_PER_SAMPLE]     # data length of one sample (Bytes).
                        [--img_sizes IMG_SIZES [IMG_SIZES ...]]     # sizes of a pre-tokenized image.
                        [--txt_len TXT_LEN]                         # length of text in one sample.
                        [--dtype {int32,int64,float32,uint8,bool}]  # data type of samples.
                        [--model_path MODEL_PATH]                   # path of image tokeizer
                        --task_type TASK_TYPE                       # type of the handling task.
                        --saver_type SAVER_TYPE                     # saver mode.

                        task_id                                     # id of the new task.

Example::

    # Don't forget to modify "model_path"
    cogdata create_task --description test --task_type ImageTextTokenizationTask --saver_type BinarySaver --length_per_sample 1088 --img_sizes 256 --txt_len 64 --dtype int32 --model_path='/dataset/fd5061f6/cogview/vqvae_hard_biggerset_011.pt' test_task


Directory structure::

    .
    ├── cogdata_task_test_task
    │   └── cogdata_config.json
    └── test_ds
        ├── cogdata_info.json
        ├── infolist.json
        └── n10148035.tar


Check Datasets and Tasks
-------------------------
Now we can use ``cogdata list`` commend to check::

    cogdata list [-h], [--help]                       # show the help and exit.
                 [-t TASK_ID], [ --task_id TASK_ID]   # id of the handling task.

Example: list dataset::

    cogdata list

Expected Output::

    --------------------------- All Raw Datasets --------------------------    
    test_ds(207.7MB)
    ------------------------------- Summary -------------------------------
    Total 1 datasets
    Total size: 207.7MB

Example: list task::

    cogdata list -t test_task

Expected Output::

    --------------------------- All Raw Datasets --------------------------    
    test_ds(207.7MB)
    ------------------------------- Summary -------------------------------
    Total 1 datasets
    Total size: 207.7MB
    ------------------------------ Task Info ------------------------------
    Task Id: test_task
    Task Type: ImageTextTokenizationTask
    Description: test
    Processed:  FORMAT: dataset_name(raw_size => processed_size)

    Hanging:  FORMAT: dataset_name(raw_size)[create_time]

    Additional:  FORMAT: dataset_name(processed_size)

    Unprocessed:  FORMAT: dataset_name(raw_size)
    test_ds(207.7MB) 

"test_ds" is in Unprocessed group.

Process
-------
Use ``cogdata process`` commend to process datasets::

    cogdata process 
                    [-h], [--help]                                      # show the help and exit
                    [--nproc NPROC]                                     # number of processes to launch.
                    [--dataloader_num_workers DATALOADER_NUM_WORKERS]   # number of processes for dataloader per computational process.
                    [--ratio RATIO]                                     # ratio of data to process.
                    -t TASK_ID, --task_id TASK_ID                       # id of the handling task.

                    [datasets [datasets ...]]                           # dataset names, None means all possible datasets.

Example::

    cogdata process --task_id test_task --nproc 2 --dataloader_num_workers 1 --ratio 1 test_ds

Expected Output::

    All datasets: test_ds
    Processing test_ds
    dataset: test_ds, rank 0:[#########################] 100%  Speed: 92.66 samples/s
    dataset: test_ds, rank 1:[#########################] 100%  Speed: 92.66 samples/s
    Waiting torch.launch to terminate...

Now "test_task" is processed. It can be examined by ``cogdata list -t test_task``::

    ------------------------------ Task Info ------------------------------
    Task Id: test_task
    Task Type: ImageTextTokenizationTask
    Description: test
    Processed:  FORMAT: dataset_name(raw_size => processed_size)
    test_ds(207.7MB => 5.4MB) 
    Hanging:  FORMAT: dataset_name(raw_size)[create_time]

    Additional:  FORMAT: dataset_name(processed_size)

    Unprocessed:  FORMAT: dataset_name(raw_size)    
    
Directory structure::

    .
    ├── cogdata_task_test_task
    │   ├── cogdata_config.json
    │   ├── main_pid_35218.log
    │   └── test_ds
    │       ├── logs
    │       │   ├── rank_0.log
    │       │   ├── rank_0.progress
    │       │   ├── rank_1.log
    │       │   └── rank_1.progress
    │       ├── meta_info.json
    │       ├── test_ds.bin.part_0.cogdata
    │       └── test_ds.bin.part_1.cogdata
    └── test_ds
        ├── cogdata_info.json
        ├── infolist.json
        └── n10148035.tar

Merge
------
There are 2 processed files now, ``test_ds.bin.part_0.cogdata`` and ``test_ds.bin.part_1.cogdata``. Because ``nproc=2`` in process.

So we need to merge them by ``cogdata merge``::

    cogdata merge [-h], [--help]                    # show the help message and exit
                  -t TASK_ID, --task_id TASK_ID     # id of the handling task

Example::

    cogdata merge -t test_task

Directory structure::

    .
    ├── cogdata_task_test_task
    │   ├── cogdata_config.json
    │   ├── main_pid_35218.log
    │   ├── merge.bin
    │   └── test_ds
    │       ├── logs
    │       │   ├── rank_0.log
    │       │   ├── rank_0.progress
    │       │   ├── rank_1.log
    │       │   └── rank_1.progress
    │       ├── meta_info.json
    │       ├── test_ds.bin.part_0.cogdata
    │       └── test_ds.bin.part_1.cogdata
    └── test_ds
        ├── cogdata_info.json
        ├── infolist.json
        └── n10148035.tar


Split
------
Use ``cogdata split`` to random split the merge result into some average subsets::

    cogdata split [-h], [--help]                    # show the help message and exit.
                  -t TASK_ID, --task_id TASK_ID     # id of the handling task.
                  n                                 # number of split pieces for the merge result.

Example::

    cogdata split -t test_task 3

Directory structure::

    .
    ├── cogdata_task_test_task
    │   ├── cogdata_config.json
    │   ├── main_pid_40494.log
    │   ├── merge.bin
    │   ├── split_merged_files
    │   │   ├── merge.bin.part0
    │   │   ├── merge.bin.part1
    │   │   └── merge.bin.part2
    │   └── test_ds
    │       ├── logs
    │       │   ├── rank_0.log
    │       │   ├── rank_0.progress
    │       │   ├── rank_1.log
    │       │   └── rank_1.progress
    │       ├── meta_info.json
    │       ├── test_ds.bin.part_0.cogdata
    │       └── test_ds.bin.part_1.cogdata
    └── test_ds
        ├── cogdata_info.json
        ├── infolist.json
        └── n10148035.tar

Clean
------
If a task crash or stay "Hanging" for too long, ``cogdata clean`` can help to remove damaged files in the task folder::
    
    cogdata clean [-h], [--help]                    # show the help message and exit
                  -t TASK_ID, --task_id TASK_ID     # id of the handling task

Example::

    cogdata clean -t test_task