cogdata.tasks package

cogdata.tasks.image_text_tokenization_task module

class cogdata.tasks.image_text_tokenization_task.ImageTextTokenizationTask(saver, img_sizes, **kwargs)

Bases: cogdata.tasks.base_task.BaseTask

handle tokenization

__init__(saver, img_sizes, **kwargs)None

config saver

get_transform_fn(transform=None)
Parameters

transform (torchvision.transforms) – a transform in torchvision, do not use ToTensor().

Returns

A transform function for images

Return type

function

process(sub_datasets, progress_record=None, dataset_dir='', **kwargs)
Use cuda to process batch data from dataloader,

save via Saver, report progress every 1/5000 ? final commit saver

Parameters
  • sub_datasets ([Dataset]) – All datasets in processing list

  • progress_record (ProgressBar) – The progress bar for this task

  • dataset_dir (str) – The path of the dataset folder

Returns

0 - Process successfully

Return type

int

read_text(txt_files, mode)

Read text dict from text files

Parameters
  • txt_files ([str]) – All names of the text files

  • mode (str) – The mode of the text, including json,txt,json_ks,tsv,dict