cogdata

cogdata.arguments module

This module defines functions for parsing console arguments.

cogdata.cli module

This module is the entry of console commend.

cogdata.data_manager module

This file defines DataManager.

class cogdata.data_manager.DataManager

Bases: object

A manager of all datasets.

static clean(args)

Clean all files in a task subfolder

Parameters

args (argparse.Namespace) – Argument provided by the console

static fetch_datasets(base_dir)

Get all names of created datasets(dataset with `config.json`)

Parameters

base_dir (str) – The root folder path

Returns

A list of created dataset names.

Return type

list[str]

static fetch_datasets_states(base_dir, task_id)

Get datasets status(dataset with `config.json`) in the task which id is `task_id`

Parameters
  • base_dir (str) – The root folder path

  • task_id (str) – An ID of an exist task.

Returns

a tuple containing:
  • all_datasets([str]): A list of created dataset names.

  • processed([str]): A list of processed dataset names.

  • hanging([str]): A list of processing dataset names.

  • unprocessed:([str]): A list of unprocessed dataset names.

  • additional:([str]) A list of only processed dataset names, from migration.

Return type

tuple

static list(args)

List all datasets in current dir

Parameters

args (argparse.Namespace) – Arguments provided by the console

Note

dataset1(233 MB) rar json processed(10MB) dataset2(10 GB) zip json_ks unprocessed

current taskname: image_text_tokenization number(2) raw(10.23GB) processed(10MB) unprocessed: dataset2

static load_task(base_dir, id)

Load task config(json) by task id

Parameters
  • base_dir (str) – The root folder path

  • id (str) – An ID of an exist task.

Returns

Config json of the task

Return type

dict

static merge(args)

merge all current processed datasets.

Parameters

args (argparse.Namespace) – Arguments provided by the console

static new_dataset(args)

Create a dataset subfolder and a template (cogdata_info.json) in it. One should manually handle the data files.

Parameters

args (argparse.Namespace) – Arguments provided by the console

static new_task(args)

create a cogdata_workspace subfolder and cogdata_config.json with configs in args.

Parameters

args (argparse.Namespace) – Arguments provided by the console

static process(args)

process one or some (in args) unprocessed dataset (detected).

Parameters

args (argparse.Namespace) – Arguments provided by the console

static process_single(args)
Parameters

args (argparse.Namespace) – Arguments provided by the console

static split(args)

split the merged files into N parts.

Parameters

args (argparse.Namespace) – Argument provided by the console

cogdata.data_processor module

This file defines DataProcessor

class cogdata.data_processor.DataProcessor

Bases: object

Multiple GPUs processor

run_monitor(current_dir, taskid, args)
Launch k run_single processes (by cmd, not multiprocess for dataloader)

Monitor all the progresses by outputs in tmp files, clean tmp files from previous runs at first. use utils.progress_record ! Wait and merge k files (use the helper in saver).

Parameters
  • current_dir (str) – Task folder created by DataManager.new_task

  • taskid (str) – ID of the on going task

  • args (argparse.Namespace) – Arguments provided by console

run_single(local_rank, args_dict)

really process, create datasets with task.transform_fn, iterating the dataloader and run task.process

Parameters
  • local_rank (int) – Local rank set by torch.distributed.launch

  • args_dict (dict) – Parse from a string provided by console

cogdata.data_processor.initialize_distributed(local_rank, world_size, rank=None, master_addr=None, master_port=None)

Initialize torch.distributed.

Parameters
  • local_rank (int) – Local rank set by torch.distributed.launch

  • world_size (int) – The number of available GPUs

  • rank (int) – The real rank which considers both local rank and global rank

  • master_addr (str) – IP address for torch.distributed TCP connection

  • master_port (str) – Port for torch.distributed TCP connection

cogdata.process_single_entry module

This module is the entry of DataProcessor.run_single, called by DataProcessor.run_monitor