libauc.datasets
This module aims to provide several popular dataset wrappers collected and adapted from public codebases to test LibAUC algorithms. We modified some of them to make them adaptable to some LibAUC algorithms that require tracking an index for each individual sample, such as APLoss
, pAUCLoss
, GCLoss
. We recommend users to cite the original papers when using these wrappers. Here is an overview of this module:
Dataset |
Reference |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Please refer to the source code for more details about each implementation.
libauc.datasets.cat_vs_dog
libauc.datasets.cifar
- class CIFAR10(root: str, train: bool = True, transform: Callable | None = None, target_transform: Callable | None = None, download: bool = True, return_index: bool = False)[source]
CIFAR10 Dataset.
- Parameters:
root (string) – Root directory of dataset where directory
cifar-10-batches-py
exists or will be saved to if download is set to True.train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g,
transforms.RandomCrop
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
return_index (bool, optional) – returns a tuple containing data, target, and index if return_index is set to True. Otherwise, it returns a tuple containing data and target only (default:
False
)
- base_folder = 'cifar-10-batches-py'
- filename = 'cifar-10-python.tar.gz'
- meta = {'filename': 'batches.meta', 'key': 'label_names', 'md5': '5ff9c542aee3614f3951f8cda6e48888'}
- test_list = [['test_batch', '40351d587109b95175f43aff81a1287e']]
- tgz_md5 = 'c58f30108f718f92721af3b95e74349a'
- train_list = [['data_batch_1', 'c99cafc152244af753f735de768cd75f'], ['data_batch_2', 'd4bba439e000b95fd0a9bffe97cbabec'], ['data_batch_3', '54ebc095f3ab1f0389bbae665268c751'], ['data_batch_4', '634d18415352ddfa80567beed471001a'], ['data_batch_5', '482c414d41f54cd18b22e5b47cb7c3cb']]
- url = 'https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'
- class CIFAR100(root: str, train: bool = True, transform: Callable | None = None, target_transform: Callable | None = None, download: bool = True, return_index: bool = False)[source]
CIFAR100 Dataset.
This is a subclass of the CIFAR10 Dataset.
- base_folder = 'cifar-100-python'
- filename = 'cifar-100-python.tar.gz'
- meta = {'filename': 'meta', 'key': 'fine_label_names', 'md5': '7973b15100ade9c7d40fb424638fde48'}
- test_list = [['test', 'f0ef6b0ae62326f3e7ffdfab6717acfc']]
- tgz_md5 = 'eb9058c3a382ffc7106e4002c42a8d85'
- train_list = [['train', '16019d7e3df5f24257cddd939b257f8d']]
- url = 'https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz'
libauc.datasets.stl10
libauc.datasets.chexpert
- class CheXpert(csv_path, image_root_path='', image_size=320, class_index=0, use_frontal=True, use_upsampling=True, flip_label=False, shuffle=True, seed=123, verbose=False, transforms=None, upsampling_cols=['Cardiomegaly', 'Consolidation'], train_cols=['Cardiomegaly', 'Edema', 'Consolidation', 'Atelectasis', 'Pleural Effusion'], return_index=False, mode='train')[source]
- Reference:
libauc.datasets.folder
- class DatasetFolder(root: str, loader: Callable[[str], Any], extensions: Tuple[str, ...] | None = None, transform: Callable | None = None, target_transform: Callable | None = None, is_valid_file: Callable[[str], bool] | None = None, return_index: bool = False)[source]
A generic data loader.
This default directory structure can be customized by overriding the
find_classes()
method.- Parameters:
root (string) – Root directory path.
loader (callable) – A function to load a sample given its path.
extensions (tuple[string]) – A list of allowed extensions. both extensions and is_valid_file should not be passed.
transform (callable, optional) – A function/transform that takes in a sample and returns a transformed version. E.g,
transforms.RandomCrop
for images.target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
is_valid_file (callable, optional) – A function that takes path of a file and check if the file is a valid file (used to check of corrupt files) both extensions and is_valid_file should not be passed.
return_index – returns a tuple containing data, target, and index if return_index is set to True. Otherwise, it returns a tuple containing data and target only (default:
False
)
- find_classes(directory: str) Tuple[List[str], Dict[str, int]] [source]
Find the class folders in a dataset structured as follows:
directory/ ├── class_x │ ├── xxx.ext │ ├── xxy.ext │ └── ... │ └── xxz.ext └── class_y ├── 123.ext ├── nsdf3.ext └── ... └── asd932_.ext
This method can be overridden to only consider a subset of classes, or to adapt to a different dataset directory structure.
- Parameters:
directory (str) – Root directory path, corresponding to
self.root
- Raises:
FileNotFoundError – If
dir
has no class folders.- Returns:
List of all classes and dictionary mapping each class to an index.
- Return type:
- static make_dataset(directory: str, class_to_idx: Dict[str, int], extensions: Tuple[str, ...] | None = None, is_valid_file: Callable[[str], bool] | None = None) List[Tuple[str, int]] [source]
Generates a list of samples of a form (path_to_sample, class).
This can be overridden to e.g. read files from a compressed zip file instead of from the disk.
- Parameters:
directory (str) – root dataset directory, corresponding to
self.root
.class_to_idx (Dict[str, int]) – Dictionary mapping class name to class index.
extensions (optional) – A list of allowed extensions. Either extensions or is_valid_file should be passed. Defaults to None.
is_valid_file (optional) – A function that takes path of a file and checks if the file is a valid file (used to check of corrupt files) both extensions and is_valid_file should not be passed. Defaults to None.
- Raises:
ValueError – In case
class_to_idx
is empty.ValueError – In case
extensions
andis_valid_file
are None or both are not None.FileNotFoundError – In case no valid file was found for any class.
- Returns:
samples of a form (path_to_sample, class)
- Return type:
- class ImageFolder(root: str, transform: ~typing.Callable | None = None, target_transform: ~typing.Callable | None = None, loader: ~typing.Callable[[str], ~typing.Any] = <function default_loader>, is_valid_file: ~typing.Callable[[str], bool] | None = None, return_index: bool = False)[source]
A generic data loader where the images are arranged in this way by default:
root/dog/xxx.png root/dog/xxy.png root/dog/[...]/xxz.png root/cat/123.png root/cat/nsdf3.png root/cat/[...]/asd932_.png
This class inherits from
DatasetFolder
so the same methods can be overridden to customize the dataset.- Parameters:
root (string) – Root directory path.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g,
transforms.RandomCrop
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
loader (callable, optional) – A function to load an image given its path.
is_valid_file (callable, optional) – A function that takes path of an Image file and check if the file is a valid file (used to check of corrupt files)
return_index – returns a tuple containing data, target, and index if return_index is set to True. Otherwise, it returns a tuple containing data and target only (default:
False
)
- find_classes(directory: str) Tuple[List[str], Dict[str, int]] [source]
Finds the class folders in a dataset.
See
DatasetFolder
for details.
- has_file_allowed_extension(filename: str, extensions: str | Tuple[str, ...]) bool [source]
Checks if a file is an allowed extension.
- is_image_file(filename: str) bool [source]
Checks if a file is an allowed image extension.
- Parameters:
filename (string) – path to a file
- Returns:
True if the filename ends with a known image extension
- Return type:
- make_dataset(directory: str, class_to_idx: Dict[str, int] | None = None, extensions: str | Tuple[str, ...] | None = None, is_valid_file: Callable[[str], bool] | None = None) List[Tuple[str, int]] [source]
Generates a list of samples of a form (path_to_sample, class).
See
DatasetFolder
for details.Note: The class_to_idx parameter is here optional and will use the logic of the
find_classes
function by default.
libauc.datasets.melanoma
libauc.datasets.movielens
- class MoiveLens(root, phase='train', topk=-1, random_seed=123)[source]
A wrapper of MoiveLens dataset.
- cal_ideal_dcg(x, topk=-1)[source]
Compute the ideal DCG for a given list.
- Parameters:
- Outputs:
Ideal DCG
- download_dataset(root_dir, dataset='ml-20m')[source]
A helper function to download movielens dataset.
- Parameters:
- Outputs:
The number of users and items in the dataset, and the dataset in pd.DataFrame format.
- class moivelens_evalset(data_path, n_users, n_items, phase)[source]
The pytorch dataset class for Movielens dev/test sets.
- Parameters:
- class moivelens_trainset(data_path, n_users, n_items, topk=-1, chunksize=1000)[source]
The pytorch dataset class for Movielens training sets.
- Parameters:
- preprocess_movielens(root_dir, dataset='ml-20m', random_seed=42)[source]
A helper function to preprocess the downloaded datasets, and build train/dev/test set in pandas DataFrame.
- Parameters:
- Outputs:
a dict that contains: n_users, n_items, as well as train, dev, and test sets (in pd.DataFrame format)
libauc.datasets.webdataset
Epoch number for distributed training
- class WebDataset(input_shards: str, is_train: bool, batch_size: int, preprocess_img: Callable, seed: int = 0, epoch: int = 0, tokenize: Callable | None = None, return_index: bool = False)[source]
An image-text dataset that is stored in webdataset format. For more information on webdataset format, refer to https://github.com/webdataset/webdataset.
- Parameters:
input_shards (str) – Path to the dataset shards.
is_train (bool) – Whether the dataset is for training or evaluation.
batch_size (int) – Batch size per worker.
preprocess_img (Callable) – Function to preprocess the image.
seed (int) – Seed for shuffling the dataset.
epoch (int) – Start epoch number.
tokenize (Optional[Callable]) – Tokenizer function for the text data.
return_index (bool) – Whether to return the index of the data.
- class detshuffle2(bufsize=1000, initial=100, seed=0, epoch=-1)[source]
Shuffle according to seed and epoch
Check if sample has caption and image
- group_by_keys_nothrow(data, keys=<function base_plus_ext>, lcase=True, suffixes=None, handler=None)[source]
Return function over iterator that groups key, value pairs into samples.
- Parameters:
keys – function that splits the key into key and extension (base_plus_ext)
lcase – convert suffixes to lower case (Default value = True)