cogdata.datasets package¶
cogdata.datasets.binary_dataset module¶
-
class
cogdata.datasets.binary_dataset.
BinaryDataset
(path, length_per_sample, dtype='int32', preload=False, **kwargs)¶ Bases:
Generic
[torch.utils.data.dataset.T_co
]Datasets for numpy binary files
-
__getitem__
(index)¶ Get a item by index
- Parameters
idx (int) – The selected item’s index.
- Returns
item – A torch tensor built from numpy array
- Return type
Tensor
-
__init__
(path, length_per_sample, dtype='int32', preload=False, **kwargs)¶ Split data for multiple process, Get the file pointer and filenames of valid samples, set transform function.
- Parameters
path (str) – The path of the zip file.
lenth_per_sample (int) – Length of a sample(bytes)
dtype (str) – Type of numpy array
preload (bool) – Load data in __init__ if
preload
is True. Set directly map bymmap
ifpreload
is False
-
__len__
()¶ Get the total number of the valid samples.
- Returns
The total number of the valid samples.
- Return type
int
-
cogdata.datasets.rar_dataset module¶
-
class
cogdata.datasets.rar_dataset.
StreamingRarDataset
(path, world_size=1, rank=0, transform_fn=None)¶ Bases:
torch.utils.data.dataset.Dataset
[torch.utils.data.dataset.T_co
]-
__del__
()¶ Close the rar file
-
__init__
(path, world_size=1, rank=0, transform_fn=None)¶ Split data for multiple process, Get the file pointer and filenames of valid samples, set transform function.
- Parameters
path (str) – The path of the zip file.
world_size (int) – The total number of GPUs
rank (int) – The local rank of current process
transform_fn (function) – Used in __getitem__
-
__iter__
()¶ StreamingRarDataset is iterable
-
__len__
()¶ Get the total number of the valid samples.
- Returns
The total number of the valid samples.
- Return type
int
-
__next__
()¶ Returns the next sample in the dataset
-
cogdata.datasets.tar_dataset module¶
-
class
cogdata.datasets.tar_dataset.
TarDataset
(path, world_size=1, rank=0, transform_fn=None)¶ Bases:
Generic
[torch.utils.data.dataset.T_co
]-
__getitem__
(idx)¶ Get a item by index
- Parameters
idx (int) – The selected item’s index.
- Returns
item – A torch tensor built from numpy array
- Return type
Tensor
-
__init__
(path, world_size=1, rank=0, transform_fn=None)¶ Split data for multiple process, Get the file pointer and filenames of valid samples, set transform function.
- Parameters
path (str) – The path of the zip file.
world_size (int) – The total number of GPUs
rank (int) – The local rank of current process
transform_fn (function) – Used in __getitem__
-
__len__
()¶ Get the total number of the valid samples.
-
cogdata.datasets.zip_dataset module¶
-
class
cogdata.datasets.zip_dataset.
ZipDataset
(path, *args, world_size=1, rank=0, transform_fn=None)¶ Bases:
Generic
[torch.utils.data.dataset.T_co
]Datasets for zip files
-
__getitem__
(idx)¶ Get a item by index
- Parameters
idx (int) – The selected item’s index.
- Returns
- if
transform_fn
is notNone
, return a result of the
transform_fn
.- if
transform_fn
isNone
, return a tuple containing
fp(file pointer): file pointer fo zip file.
full_filename(str): filename of the image.
file_size: The size of a raw file in the zip file.
- if
-
__init__
(path, *args, world_size=1, rank=0, transform_fn=None)¶ Split data for multiple processes. Get the file pointer and filenames of valid samples. Set transform function.
- Parameters
path (str) – The path of the zip file.
world_size (int) – The total number of GPUs
rank (int) – The local rank of current process
transform_fn (function) – Used in __getitem__
-
__len__
()¶ Get the total number of the valid samples.
- Returns
The total number of the valid samples.
- Return type
int
-