cogdata.datasets package¶

cogdata.datasets.binary_dataset module¶

class cogdata.datasets.binary_dataset.BinaryDataset(path, length_per_sample, dtype='int32', preload=False, **kwargs)¶

Bases: Generic[torch.utils.data.dataset.T_co]

Datasets for numpy binary files

__getitem__(index)¶

Get a item by index

Parameters: idx (int) – The selected item’s index.
Returns: item – A torch tensor built from numpy array
Return type: Tensor

__init__(path, length_per_sample, dtype='int32', preload=False, **kwargs)¶

Split data for multiple process, Get the file pointer and filenames of valid samples, set transform function.

Parameters

path (str) – The path of the zip file.
lenth_per_sample (int) – Length of a sample(bytes)
dtype (str) – Type of numpy array
preload (bool) – Load data in __init__ if preload is True. Set directly map by mmap if preload is False

__len__()¶

Get the total number of the valid samples.

Returns: The total number of the valid samples.
Return type: int

cogdata.datasets.rar_dataset module¶

class cogdata.datasets.rar_dataset.StreamingRarDataset(path, world_size=1, rank=0, transform_fn=None)¶

Bases: torch.utils.data.dataset.Dataset[torch.utils.data.dataset.T_co]

__del__()¶: Close the rar file

__init__(path, world_size=1, rank=0, transform_fn=None)¶

Split data for multiple process, Get the file pointer and filenames of valid samples, set transform function.

Parameters

path (str) – The path of the zip file.
world_size (int) – The total number of GPUs
rank (int) – The local rank of current process
transform_fn (function) – Used in __getitem__

__iter__()¶: StreamingRarDataset is iterable

__len__()¶

Get the total number of the valid samples.

Returns: The total number of the valid samples.
Return type: int

__next__()¶: Returns the next sample in the dataset

cogdata.datasets.tar_dataset module¶

class cogdata.datasets.tar_dataset.TarDataset(path, world_size=1, rank=0, transform_fn=None)¶

Bases: Generic[torch.utils.data.dataset.T_co]

__getitem__(idx)¶

Get a item by index

Parameters: idx (int) – The selected item’s index.
Returns: item – A torch tensor built from numpy array
Return type: Tensor

__init__(path, world_size=1, rank=0, transform_fn=None)¶

Split data for multiple process, Get the file pointer and filenames of valid samples, set transform function.

Parameters

path (str) – The path of the zip file.
world_size (int) – The total number of GPUs
rank (int) – The local rank of current process
transform_fn (function) – Used in __getitem__

__len__()¶: Get the total number of the valid samples.

cogdata.datasets.zip_dataset module¶

class cogdata.datasets.zip_dataset.ZipDataset(path, *args, world_size=1, rank=0, transform_fn=None)¶

Bases: Generic[torch.utils.data.dataset.T_co]

Datasets for zip files

__getitem__(idx)¶

Get a item by index

Parameters

idx (int) – The selected item’s index.

Returns

if transform_fn is not None,

return a result of the transform_fn.

if transform_fn is None,

return a tuple containing

fp(file pointer): file pointer fo zip file.
full_filename(str): filename of the image.
file_size: The size of a raw file in the zip file.

__init__(path, *args, world_size=1, rank=0, transform_fn=None)¶

Split data for multiple processes. Get the file pointer and filenames of valid samples. Set transform function.

Parameters

path (str) – The path of the zip file.
world_size (int) – The total number of GPUs
rank (int) – The local rank of current process
transform_fn (function) – Used in __getitem__

__len__()¶

Get the total number of the valid samples.

Returns: The total number of the valid samples.
Return type: int