libauc.datasets

This module aims to provide several popular dataset wrappers collected and adapted from public codebases to test LibAUC algorithms. We modified some of them to make them adaptable to some LibAUC algorithms that require tracking an index for each individual sample, such as APLoss, pAUCLoss, GCLoss. We recommend users to cite the original papers when using these wrappers. Here is an overview of this module:

Dataset	Reference
`CAT_VS_DOG`: Cat_vs_Dog	elson2007asirra
`CIFAR10`: Cifar10	krizhevsky2009learning
`CIFAR100`: Cifar100	krizhevsky2009learning
`STL10`: STL10	coates2011analysis
`CheXpert`: CheXpert	irvin2019chexpert
`MoiveLens`: MovieLens20M	ml_yt_trailers
`Melanoma`: Melanoma	rotemberg2021patient
`ImageFolder`: A generic data loader	PyTorch
`WebDataset`: Efficient dataset for large-scale data	WebDataset

Please refer to the source code for more details about each implementation.

libauc.datasets.cat_vs_dog

CAT_VS_DOG(root='./data/', train=True)[source]

load_data(data_path, label_path)[source]

libauc.datasets.cifar

class CIFAR10(root: str, train: bool = True, transform: Callable | None = None, target_transform: Callable | None = None, download: bool = True, return_index: bool = False)[source]

CIFAR10 Dataset.

Parameters:

root (string) – Root directory of dataset where directory cifar-10-batches-py exists or will be saved to if download is set to True.
train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
return_index (bool, optional) – returns a tuple containing data, target, and index if return_index is set to True. Otherwise, it returns a tuple containing data and target only (default: False)

as_array()[source]

base_folder = 'cifar-10-batches-py'

download() → None[source]

extra_repr() → str[source]

filename = 'cifar-10-python.tar.gz'

meta = {'filename': 'batches.meta', 'key': 'label_names', 'md5': '5ff9c542aee3614f3951f8cda6e48888'}

test_list = [['test_batch', '40351d587109b95175f43aff81a1287e']]

tgz_md5 = 'c58f30108f718f92721af3b95e74349a'

train_list = [['data_batch_1', 'c99cafc152244af753f735de768cd75f'], ['data_batch_2', 'd4bba439e000b95fd0a9bffe97cbabec'], ['data_batch_3', '54ebc095f3ab1f0389bbae665268c751'], ['data_batch_4', '634d18415352ddfa80567beed471001a'], ['data_batch_5', '482c414d41f54cd18b22e5b47cb7c3cb']]

url = 'https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'

class CIFAR100(root: str, train: bool = True, transform: Callable | None = None, target_transform: Callable | None = None, download: bool = True, return_index: bool = False)[source]

CIFAR100 Dataset.

This is a subclass of the CIFAR10 Dataset.

base_folder = 'cifar-100-python'

filename = 'cifar-100-python.tar.gz'

meta = {'filename': 'meta', 'key': 'fine_label_names', 'md5': '7973b15100ade9c7d40fb424638fde48'}

test_list = [['test', 'f0ef6b0ae62326f3e7ffdfab6717acfc']]

tgz_md5 = 'eb9058c3a382ffc7106e4002c42a8d85'

train_list = [['train', '16019d7e3df5f24257cddd939b257f8d']]

url = 'https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz'

libauc.datasets.stl10

STL10(root='./data/', split='train')[source]

load_file(data_file, labels_file=None)[source]

libauc.datasets.chexpert

class CheXpert(csv_path, image_root_path='', image_size=320, class_index=0, use_frontal=True, use_upsampling=True, flip_label=False, shuffle=True, seed=123, verbose=False, transforms=None, upsampling_cols=['Cardiomegaly', 'Consolidation'], train_cols=['Cardiomegaly', 'Edema', 'Consolidation', 'Atelectasis', 'Pleural Effusion'], return_index=False, mode='train')[source]

Reference:: [1]
Yuan, Zhuoning, Yan, Yan, Sonka, Milan, and Yang, Tianbao. “Large-scale robust deep auc maximization: A new surrogate loss and empirical studies on medical image classification.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. https://arxiv.org/abs/2012.03173

libauc.datasets.folder

class DatasetFolder(root: str, loader: Callable[[str], Any], extensions: Tuple[str, ...] | None = None, transform: Callable | None = None, target_transform: Callable | None = None, is_valid_file: Callable[[str], bool] | None = None, return_index: bool = False)[source]

A generic data loader.

This default directory structure can be customized by overriding the find_classes() method.

Parameters:

root (string) – Root directory path.
loader (callable) – A function to load a sample given its path.
extensions (tuple[string]) – A list of allowed extensions. both extensions and is_valid_file should not be passed.
transform (callable, optional) – A function/transform that takes in a sample and returns a transformed version. E.g, transforms.RandomCrop for images.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
is_valid_file (callable, optional) – A function that takes path of a file and check if the file is a valid file (used to check of corrupt files) both extensions and is_valid_file should not be passed.
return_index – returns a tuple containing data, target, and index if return_index is set to True. Otherwise, it returns a tuple containing data and target only (default: False)

find_classes(directory: str) → Tuple[List[str], Dict[str, int]][source]

Find the class folders in a dataset structured as follows:

directory/
├── class_x
│   ├── xxx.ext
│   ├── xxy.ext
│   └── ...
│       └── xxz.ext
└── class_y
    ├── 123.ext
    ├── nsdf3.ext
    └── ...
    └── asd932_.ext

This method can be overridden to only consider a subset of classes, or to adapt to a different dataset directory structure.

Parameters:: directory (str) – Root directory path, corresponding to self.root
Raises:: FileNotFoundError – If dir has no class folders.
Returns:: List of all classes and dictionary mapping each class to an index.
Return type:: (Tuple[List[str], Dict[str, int]])

static make_dataset(directory: str, class_to_idx: Dict[str, int], extensions: Tuple[str, ...] | None = None, is_valid_file: Callable[[str], bool] | None = None) → List[Tuple[str, int]][source]

Generates a list of samples of a form (path_to_sample, class).

This can be overridden to e.g. read files from a compressed zip file instead of from the disk.

Parameters:

directory (str) – root dataset directory, corresponding to self.root.
class_to_idx (Dict[str, int]) – Dictionary mapping class name to class index.
extensions (optional) – A list of allowed extensions. Either extensions or is_valid_file should be passed. Defaults to None.
is_valid_file (optional) – A function that takes path of a file and checks if the file is a valid file (used to check of corrupt files) both extensions and is_valid_file should not be passed. Defaults to None.

Raises:

ValueError – In case class_to_idx is empty.
ValueError – In case extensions and is_valid_file are None or both are not None.
FileNotFoundError – In case no valid file was found for any class.

Returns:

samples of a form (path_to_sample, class)

Return type:

List[Tuple[str, int]]

class ImageFolder(root: str, transform: ~typing.Callable | None = None, target_transform: ~typing.Callable | None = None, loader: ~typing.Callable[[str], ~typing.Any] = <function default_loader>, is_valid_file: ~typing.Callable[[str], bool] | None = None, return_index: bool = False)[source]

A generic data loader where the images are arranged in this way by default:

root/dog/xxx.png
root/dog/xxy.png
root/dog/[...]/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/[...]/asd932_.png

This class inherits from DatasetFolder so the same methods can be overridden to customize the dataset.

Parameters:

root (string) – Root directory path.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
loader (callable, optional) – A function to load an image given its path.
is_valid_file (callable, optional) – A function that takes path of an Image file and check if the file is a valid file (used to check of corrupt files)
return_index – returns a tuple containing data, target, and index if return_index is set to True. Otherwise, it returns a tuple containing data and target only (default: False)

accimage_loader(path: str) → Any[source]

default_loader(path: str) → Any[source]

find_classes(directory: str) → Tuple[List[str], Dict[str, int]][source]

Finds the class folders in a dataset.

See DatasetFolder for details.

has_file_allowed_extension(filename: str, extensions: str | Tuple[str, ...]) → bool[source]

Checks if a file is an allowed extension.

Parameters:

filename (string) – path to a file
extensions (tuple of strings) – extensions to consider (lowercase)

Returns:

True if the filename ends with one of given extensions

Return type:

bool

is_image_file(filename: str) → bool[source]

Checks if a file is an allowed image extension.

Parameters:: filename (string) – path to a file
Returns:: True if the filename ends with a known image extension
Return type:: bool

make_dataset(directory: str, class_to_idx: Dict[str, int] | None = None, extensions: str | Tuple[str, ...] | None = None, is_valid_file: Callable[[str], bool] | None = None) → List[Tuple[str, int]][source]

Generates a list of samples of a form (path_to_sample, class).

See DatasetFolder for details.

Note: The class_to_idx parameter is here optional and will use the logic of the find_classes function by default.

pil_loader(path: str) → Image[source]

libauc.datasets.melanoma

class Melanoma(root, test_size=0.2, is_test=False, transforms=None)[source]

Reference:

get_augmentations_v1(image_size=256, is_test=True)[source]

libauc.datasets.movielens

class MoiveLens(root, phase='train', topk=-1, random_seed=123)[source]

A wrapper of MoiveLens dataset.

collate_batch(feed_dicts)[source]

get_batch(index: int, batchsize: int)[source]

get_num_televant_pairs()[source]

cal_ideal_dcg(x, topk=-1)[source]

Compute the ideal DCG for a given list.

Parameters:

x (list) – A list of ratings.
topk (int, optional) – If topk=-1, then compute ideal DCG over the full list; otherwise compute over the topk items of the list.

Outputs:: Ideal DCG

df_to_dict(df)[source]: Convert the input pd.DataFrame into a dict.

download_dataset(root_dir, dataset='ml-20m')[source]

A helper function to download movielens dataset.

Parameters:

root_dir (str) – Root directory of the downloaded dataset.
dataset (str, optional) – The name of dataset to be downloaded.

Outputs:: The number of users and items in the dataset, and the dataset in pd.DataFrame format.

class moivelens_evalset(data_path, n_users, n_items, phase)[source]

The pytorch dataset class for Movielens dev/test sets.

Parameters:

data_path (str) – file name, ‘dev.csv’ or ‘test.csv’
n_users (int) – number of users
n_items (int) – number of items
phase (string) – ‘dev’ or ‘test’

get_batch(index, batchsize)[source]

class moivelens_trainset(data_path, n_users, n_items, topk=-1, chunksize=1000)[source]

The pytorch dataset class for Movielens training sets.

Parameters:

data_path (str) – file name, default: ‘train.csv’.
n_users (int) – number of users
n_items (int) – number of items
topk (int, optional) – topk value is used to compute the ideal DCG for each user.

collate_batch(feed_dicts)[source]

get_num_televant_pairs()[source]

preprocess_movielens(root_dir, dataset='ml-20m', random_seed=42)[source]

A helper function to preprocess the downloaded datasets, and build train/dev/test set in pandas DataFrame.

Parameters:

root_dir (str) – Root directory of the downloaded dataset.
dataset (str, optional) – The name of dataset to be downloaded.

Outputs:: a dict that contains: n_users, n_items, as well as train, dev, and test sets (in pd.DataFrame format)

libauc.datasets.webdataset

class SharedEpoch(epoch: int = 0)[source]

Epoch number for distributed training

get_value()[source]

set_value(epoch)[source]

class WebDataset(input_shards: str, is_train: bool, batch_size: int, preprocess_img: Callable, seed: int = 0, epoch: int = 0, tokenize: Callable | None = None, return_index: bool = False)[source]

An image-text dataset that is stored in webdataset format. For more information on webdataset format, refer to https://github.com/webdataset/webdataset.

Parameters:

input_shards (str) – Path to the dataset shards.
is_train (bool) – Whether the dataset is for training or evaluation.
batch_size (int) – Batch size per worker.
preprocess_img (Callable) – Function to preprocess the image.
seed (int) – Seed for shuffling the dataset.
epoch (int) – Start epoch number.
tokenize (Optional[Callable]) – Tokenizer function for the text data.
return_index (bool) – Whether to return the index of the data.

class detshuffle2(bufsize=1000, initial=100, seed=0, epoch=-1)[source]

Shuffle according to seed and epoch

run(src)[source]

filter_no_caption_or_no_image(sample)[source]: Check if sample has caption and image

group_by_keys_nothrow(data, keys=<function base_plus_ext>, lcase=True, suffixes=None, handler=None)[source]

Return function over iterator that groups key, value pairs into samples.

Parameters:

keys – function that splits the key into key and extension (base_plus_ext)
lcase – convert suffixes to lower case (Default value = True)

log_and_continue(exn)[source]: Call in an exception handler to ignore any exception, issue a warning, and continue.

pytorch_worker_seed(increment=0)[source]: Get dataloader worker seed from pytorch

tarfile_to_samples_nothrow(src, handler=<function log_and_continue>)[source]: A re-implementation of the webdataset impl with group_by_keys that doesn’t throw