cogdata¶
Subpackages¶
cogdata.arguments module¶
This module defines functions for parsing console arguments.
cogdata.cli module¶
This module is the entry of console commend.
cogdata.data_manager module¶
This file defines DataManager.
-
class
cogdata.data_manager.
DataManager
¶ Bases:
object
A manager of all datasets.
-
static
clean
(args)¶ Clean all files in a task subfolder
- Parameters
args (argparse.Namespace) – Argument provided by the console
-
static
fetch_datasets
(base_dir)¶ Get all names of created datasets(dataset with
`config.json`
)- Parameters
base_dir (str) – The root folder path
- Returns
A list of created dataset names.
- Return type
list[str]
-
static
fetch_datasets_states
(base_dir, task_id)¶ Get datasets status(dataset with
`config.json`
) in the task which id is`task_id`
- Parameters
base_dir (str) – The root folder path
task_id (str) – An ID of an exist task.
- Returns
- a tuple containing:
all_datasets([str]): A list of created dataset names.
processed([str]): A list of processed dataset names.
hanging([str]): A list of processing dataset names.
unprocessed:([str]): A list of unprocessed dataset names.
additional:([str]) A list of only processed dataset names, from migration.
- Return type
tuple
-
static
list
(args)¶ List all datasets in current dir
- Parameters
args (argparse.Namespace) – Arguments provided by the console
Note
dataset1(233 MB) rar json processed(10MB) dataset2(10 GB) zip json_ks unprocessed
current taskname: image_text_tokenization number(2) raw(10.23GB) processed(10MB) unprocessed: dataset2
-
static
load_task
(base_dir, id)¶ Load task config(json) by task id
- Parameters
base_dir (str) – The root folder path
id (str) – An ID of an exist task.
- Returns
Config json of the task
- Return type
dict
-
static
merge
(args)¶ merge all current processed datasets.
- Parameters
args (argparse.Namespace) – Arguments provided by the console
-
static
new_dataset
(args)¶ Create a dataset subfolder and a template (cogdata_info.json) in it. One should manually handle the data files.
- Parameters
args (argparse.Namespace) – Arguments provided by the console
-
static
new_task
(args)¶ create a cogdata_workspace subfolder and cogdata_config.json with configs in args.
- Parameters
args (argparse.Namespace) – Arguments provided by the console
-
static
process
(args)¶ process one or some (in args) unprocessed dataset (detected).
- Parameters
args (argparse.Namespace) – Arguments provided by the console
-
static
process_single
(args)¶ - Parameters
args (argparse.Namespace) – Arguments provided by the console
-
static
split
(args)¶ split the merged files into N parts.
- Parameters
args (argparse.Namespace) – Argument provided by the console
-
static
cogdata.data_processor module¶
This file defines DataProcessor
-
class
cogdata.data_processor.
DataProcessor
¶ Bases:
object
Multiple GPUs processor
-
run_monitor
(current_dir, taskid, args)¶ - Launch k run_single processes (by cmd, not multiprocess for dataloader)
Monitor all the progresses by outputs in tmp files, clean tmp files from previous runs at first. use utils.progress_record ! Wait and merge k files (use the helper in saver).
- Parameters
current_dir (str) – Task folder created by DataManager.new_task
taskid (str) – ID of the on going task
args (argparse.Namespace) – Arguments provided by console
-
run_single
(local_rank, args_dict)¶ really process, create datasets with task.transform_fn, iterating the dataloader and run task.process
- Parameters
local_rank (int) – Local rank set by torch.distributed.launch
args_dict (dict) – Parse from a string provided by console
-
-
cogdata.data_processor.
initialize_distributed
(local_rank, world_size, rank=None, master_addr=None, master_port=None)¶ Initialize torch.distributed.
- Parameters
local_rank (int) – Local rank set by torch.distributed.launch
world_size (int) – The number of available GPUs
rank (int) – The real rank which considers both local rank and global rank
master_addr (str) – IP address for torch.distributed TCP connection
master_port (str) – Port for torch.distributed TCP connection
cogdata.process_single_entry module¶
This module is the entry of DataProcessor.run_single
,
called by DataProcessor.run_monitor