cogdata.datasets package

cogdata.datasets.binary_dataset module

class cogdata.datasets.binary_dataset.BinaryDataset(path, length_per_sample, dtype='int32', preload=False, **kwargs)

Bases: Generic[torch.utils.data.dataset.T_co]

Datasets for numpy binary files

__getitem__(index)

Get a item by index

Parameters

idx (int) – The selected item’s index.

Returns

item – A torch tensor built from numpy array

Return type

Tensor

__init__(path, length_per_sample, dtype='int32', preload=False, **kwargs)

Split data for multiple process, Get the file pointer and filenames of valid samples, set transform function.

Parameters
  • path (str) – The path of the zip file.

  • lenth_per_sample (int) – Length of a sample(bytes)

  • dtype (str) – Type of numpy array

  • preload (bool) – Load data in __init__ if preload is True. Set directly map by mmap if preload is False

__len__()

Get the total number of the valid samples.

Returns

The total number of the valid samples.

Return type

int

cogdata.datasets.rar_dataset module

class cogdata.datasets.rar_dataset.StreamingRarDataset(path, world_size=1, rank=0, transform_fn=None)

Bases: torch.utils.data.dataset.Dataset[torch.utils.data.dataset.T_co]

__del__()

Close the rar file

__init__(path, world_size=1, rank=0, transform_fn=None)

Split data for multiple process, Get the file pointer and filenames of valid samples, set transform function.

Parameters
  • path (str) – The path of the zip file.

  • world_size (int) – The total number of GPUs

  • rank (int) – The local rank of current process

  • transform_fn (function) – Used in __getitem__

__iter__()

StreamingRarDataset is iterable

__len__()

Get the total number of the valid samples.

Returns

The total number of the valid samples.

Return type

int

__next__()

Returns the next sample in the dataset

cogdata.datasets.tar_dataset module

class cogdata.datasets.tar_dataset.TarDataset(path, world_size=1, rank=0, transform_fn=None)

Bases: Generic[torch.utils.data.dataset.T_co]

__getitem__(idx)

Get a item by index

Parameters

idx (int) – The selected item’s index.

Returns

item – A torch tensor built from numpy array

Return type

Tensor

__init__(path, world_size=1, rank=0, transform_fn=None)

Split data for multiple process, Get the file pointer and filenames of valid samples, set transform function.

Parameters
  • path (str) – The path of the zip file.

  • world_size (int) – The total number of GPUs

  • rank (int) – The local rank of current process

  • transform_fn (function) – Used in __getitem__

__len__()

Get the total number of the valid samples.

cogdata.datasets.zip_dataset module

class cogdata.datasets.zip_dataset.ZipDataset(path, *args, world_size=1, rank=0, transform_fn=None)

Bases: Generic[torch.utils.data.dataset.T_co]

Datasets for zip files

__getitem__(idx)

Get a item by index

Parameters

idx (int) – The selected item’s index.

Returns

if transform_fn is not None,

return a result of the transform_fn.

if transform_fn is None,

return a tuple containing

  • fp(file pointer): file pointer fo zip file.

  • full_filename(str): filename of the image.

  • file_size: The size of a raw file in the zip file.

__init__(path, *args, world_size=1, rank=0, transform_fn=None)

Split data for multiple processes. Get the file pointer and filenames of valid samples. Set transform function.

Parameters
  • path (str) – The path of the zip file.

  • world_size (int) – The total number of GPUs

  • rank (int) – The local rank of current process

  • transform_fn (function) – Used in __getitem__

__len__()

Get the total number of the valid samples.

Returns

The total number of the valid samples.

Return type

int