libauc.datasets

This module aims to provide several popular dataset wrappers collected and adapted from public codebases to test LibAUC algorithms. We modified some of them to make them adaptable to some LibAUC algorithms that require tracking an index for each individual sample, such as APLoss, pAUCLoss, GCLoss. We recommend users to cite the original papers when using these wrappers. Here is an overview of this module:

Dataset

Reference

CAT_VS_DOG: Cat_vs_Dog

elson2007asirra

CIFAR10: Cifar10

krizhevsky2009learning

CIFAR100: Cifar100

krizhevsky2009learning

STL10: STL10

coates2011analysis

CheXpert: CheXpert

irvin2019chexpert

MoiveLens: MovieLens20M

ml_yt_trailers

Melanoma: Melanoma

rotemberg2021patient

ImageFolder: A generic data loader

PyTorch

WebDataset: Efficient dataset for large-scale data

WebDataset

Please refer to the source code for more details about each implementation.

libauc.datasets.cat_vs_dog

CAT_VS_DOG(root='./data/', train=True)[source]
load_data(data_path, label_path)[source]

libauc.datasets.cifar

class CIFAR10(root: str, train: bool = True, transform: Callable | None = None, target_transform: Callable | None = None, download: bool = True, return_index: bool = False)[source]

CIFAR10 Dataset.

Parameters:
  • root (string) – Root directory of dataset where directory cifar-10-batches-py exists or will be saved to if download is set to True.

  • train (bool, optional) – If True, creates dataset from training set, otherwise creates from test set.

  • transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

  • download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

  • return_index (bool, optional) – returns a tuple containing data, target, and index if return_index is set to True. Otherwise, it returns a tuple containing data and target only (default: False)

as_array()[source]
base_folder = 'cifar-10-batches-py'
download() None[source]
extra_repr() str[source]
filename = 'cifar-10-python.tar.gz'
meta = {'filename': 'batches.meta', 'key': 'label_names', 'md5': '5ff9c542aee3614f3951f8cda6e48888'}
test_list = [['test_batch', '40351d587109b95175f43aff81a1287e']]
tgz_md5 = 'c58f30108f718f92721af3b95e74349a'
train_list = [['data_batch_1', 'c99cafc152244af753f735de768cd75f'], ['data_batch_2', 'd4bba439e000b95fd0a9bffe97cbabec'], ['data_batch_3', '54ebc095f3ab1f0389bbae665268c751'], ['data_batch_4', '634d18415352ddfa80567beed471001a'], ['data_batch_5', '482c414d41f54cd18b22e5b47cb7c3cb']]
url = 'https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'
class CIFAR100(root: str, train: bool = True, transform: Callable | None = None, target_transform: Callable | None = None, download: bool = True, return_index: bool = False)[source]

CIFAR100 Dataset.

This is a subclass of the CIFAR10 Dataset.

base_folder = 'cifar-100-python'
filename = 'cifar-100-python.tar.gz'
meta = {'filename': 'meta', 'key': 'fine_label_names', 'md5': '7973b15100ade9c7d40fb424638fde48'}
test_list = [['test', 'f0ef6b0ae62326f3e7ffdfab6717acfc']]
tgz_md5 = 'eb9058c3a382ffc7106e4002c42a8d85'
train_list = [['train', '16019d7e3df5f24257cddd939b257f8d']]
url = 'https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz'

libauc.datasets.stl10

STL10(root='./data/', split='train')[source]
load_file(data_file, labels_file=None)[source]

libauc.datasets.chexpert

class CheXpert(csv_path, image_root_path='', image_size=320, class_index=0, use_frontal=True, use_upsampling=True, flip_label=False, shuffle=True, seed=123, verbose=False, transforms=None, upsampling_cols=['Cardiomegaly', 'Consolidation'], train_cols=['Cardiomegaly', 'Edema', 'Consolidation', 'Atelectasis', 'Pleural Effusion'], return_index=False, mode='train')[source]
Reference:

libauc.datasets.folder

class DatasetFolder(root: str, loader: Callable[[str], Any], extensions: Tuple[str, ...] | None = None, transform: Callable | None = None, target_transform: Callable | None = None, is_valid_file: Callable[[str], bool] | None = None, return_index: bool = False)[source]

A generic data loader.

This default directory structure can be customized by overriding the find_classes() method.

Parameters:
  • root (string) – Root directory path.

  • loader (callable) – A function to load a sample given its path.

  • extensions (tuple[string]) – A list of allowed extensions. both extensions and is_valid_file should not be passed.

  • transform (callable, optional) – A function/transform that takes in a sample and returns a transformed version. E.g, transforms.RandomCrop for images.

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

  • is_valid_file (callable, optional) – A function that takes path of a file and check if the file is a valid file (used to check of corrupt files) both extensions and is_valid_file should not be passed.

  • return_index – returns a tuple containing data, target, and index if return_index is set to True. Otherwise, it returns a tuple containing data and target only (default: False)

find_classes(directory: str) Tuple[List[str], Dict[str, int]][source]

Find the class folders in a dataset structured as follows:

directory/
├── class_x
│   ├── xxx.ext
│   ├── xxy.ext
│   └── ...
│       └── xxz.ext
└── class_y
    ├── 123.ext
    ├── nsdf3.ext
    └── ...
    └── asd932_.ext

This method can be overridden to only consider a subset of classes, or to adapt to a different dataset directory structure.

Parameters:

directory (str) – Root directory path, corresponding to self.root

Raises:

FileNotFoundError – If dir has no class folders.

Returns:

List of all classes and dictionary mapping each class to an index.

Return type:

(Tuple[List[str], Dict[str, int]])

static make_dataset(directory: str, class_to_idx: Dict[str, int], extensions: Tuple[str, ...] | None = None, is_valid_file: Callable[[str], bool] | None = None) List[Tuple[str, int]][source]

Generates a list of samples of a form (path_to_sample, class).

This can be overridden to e.g. read files from a compressed zip file instead of from the disk.

Parameters:
  • directory (str) – root dataset directory, corresponding to self.root.

  • class_to_idx (Dict[str, int]) – Dictionary mapping class name to class index.

  • extensions (optional) – A list of allowed extensions. Either extensions or is_valid_file should be passed. Defaults to None.

  • is_valid_file (optional) – A function that takes path of a file and checks if the file is a valid file (used to check of corrupt files) both extensions and is_valid_file should not be passed. Defaults to None.

Raises:
  • ValueError – In case class_to_idx is empty.

  • ValueError – In case extensions and is_valid_file are None or both are not None.

  • FileNotFoundError – In case no valid file was found for any class.

Returns:

samples of a form (path_to_sample, class)

Return type:

List[Tuple[str, int]]

class ImageFolder(root: str, transform: ~typing.Callable | None = None, target_transform: ~typing.Callable | None = None, loader: ~typing.Callable[[str], ~typing.Any] = <function default_loader>, is_valid_file: ~typing.Callable[[str], bool] | None = None, return_index: bool = False)[source]

A generic data loader where the images are arranged in this way by default:

root/dog/xxx.png
root/dog/xxy.png
root/dog/[...]/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/[...]/asd932_.png

This class inherits from DatasetFolder so the same methods can be overridden to customize the dataset.

Parameters:
  • root (string) – Root directory path.

  • transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

  • loader (callable, optional) – A function to load an image given its path.

  • is_valid_file (callable, optional) – A function that takes path of an Image file and check if the file is a valid file (used to check of corrupt files)

  • return_index – returns a tuple containing data, target, and index if return_index is set to True. Otherwise, it returns a tuple containing data and target only (default: False)

accimage_loader(path: str) Any[source]
default_loader(path: str) Any[source]
find_classes(directory: str) Tuple[List[str], Dict[str, int]][source]

Finds the class folders in a dataset.

See DatasetFolder for details.

has_file_allowed_extension(filename: str, extensions: str | Tuple[str, ...]) bool[source]

Checks if a file is an allowed extension.

Parameters:
  • filename (string) – path to a file

  • extensions (tuple of strings) – extensions to consider (lowercase)

Returns:

True if the filename ends with one of given extensions

Return type:

bool

is_image_file(filename: str) bool[source]

Checks if a file is an allowed image extension.

Parameters:

filename (string) – path to a file

Returns:

True if the filename ends with a known image extension

Return type:

bool

make_dataset(directory: str, class_to_idx: Dict[str, int] | None = None, extensions: str | Tuple[str, ...] | None = None, is_valid_file: Callable[[str], bool] | None = None) List[Tuple[str, int]][source]

Generates a list of samples of a form (path_to_sample, class).

See DatasetFolder for details.

Note: The class_to_idx parameter is here optional and will use the logic of the find_classes function by default.

pil_loader(path: str) Image[source]

libauc.datasets.melanoma

class Melanoma(root, test_size=0.2, is_test=False, transforms=None)[source]
Reference:
get_augmentations_v1(image_size=256, is_test=True)[source]

libauc.datasets.movielens

class MoiveLens(root, phase='train', topk=-1, random_seed=123)[source]

A wrapper of MoiveLens dataset.

collate_batch(feed_dicts)[source]
get_batch(index: int, batchsize: int)[source]
get_num_televant_pairs()[source]
cal_ideal_dcg(x, topk=-1)[source]

Compute the ideal DCG for a given list.

Parameters:
  • x (list) – A list of ratings.

  • topk (int, optional) – If topk=-1, then compute ideal DCG over the full list; otherwise compute over the topk items of the list.

Outputs:

Ideal DCG

df_to_dict(df)[source]

Convert the input pd.DataFrame into a dict.

download_dataset(root_dir, dataset='ml-20m')[source]

A helper function to download movielens dataset.

Parameters:
  • root_dir (str) – Root directory of the downloaded dataset.

  • dataset (str, optional) – The name of dataset to be downloaded.

Outputs:

The number of users and items in the dataset, and the dataset in pd.DataFrame format.

class moivelens_evalset(data_path, n_users, n_items, phase)[source]

The pytorch dataset class for Movielens dev/test sets.

Parameters:
  • data_path (str) – file name, ‘dev.csv’ or ‘test.csv’

  • n_users (int) – number of users

  • n_items (int) – number of items

  • phase (string) – ‘dev’ or ‘test’

get_batch(index, batchsize)[source]
class moivelens_trainset(data_path, n_users, n_items, topk=-1, chunksize=1000)[source]

The pytorch dataset class for Movielens training sets.

Parameters:
  • data_path (str) – file name, default: ‘train.csv’.

  • n_users (int) – number of users

  • n_items (int) – number of items

  • topk (int, optional) – topk value is used to compute the ideal DCG for each user.

collate_batch(feed_dicts)[source]
get_num_televant_pairs()[source]
preprocess_movielens(root_dir, dataset='ml-20m', random_seed=42)[source]

A helper function to preprocess the downloaded datasets, and build train/dev/test set in pandas DataFrame.

Parameters:
  • root_dir (str) – Root directory of the downloaded dataset.

  • dataset (str, optional) – The name of dataset to be downloaded.

Outputs:

a dict that contains: n_users, n_items, as well as train, dev, and test sets (in pd.DataFrame format)

libauc.datasets.webdataset

class SharedEpoch(epoch: int = 0)[source]

Epoch number for distributed training

get_value()[source]
set_value(epoch)[source]
class WebDataset(input_shards: str, is_train: bool, batch_size: int, preprocess_img: Callable, seed: int = 0, epoch: int = 0, tokenize: Callable | None = None, return_index: bool = False)[source]

An image-text dataset that is stored in webdataset format. For more information on webdataset format, refer to https://github.com/webdataset/webdataset.

Parameters:
  • input_shards (str) – Path to the dataset shards.

  • is_train (bool) – Whether the dataset is for training or evaluation.

  • batch_size (int) – Batch size per worker.

  • preprocess_img (Callable) – Function to preprocess the image.

  • seed (int) – Seed for shuffling the dataset.

  • epoch (int) – Start epoch number.

  • tokenize (Optional[Callable]) – Tokenizer function for the text data.

  • return_index (bool) – Whether to return the index of the data.

class detshuffle2(bufsize=1000, initial=100, seed=0, epoch=-1)[source]

Shuffle according to seed and epoch

run(src)[source]
filter_no_caption_or_no_image(sample)[source]

Check if sample has caption and image

group_by_keys_nothrow(data, keys=<function base_plus_ext>, lcase=True, suffixes=None, handler=None)[source]

Return function over iterator that groups key, value pairs into samples.

Parameters:
  • keys – function that splits the key into key and extension (base_plus_ext)

  • lcase – convert suffixes to lower case (Default value = True)

log_and_continue(exn)[source]

Call in an exception handler to ignore any exception, issue a warning, and continue.

pytorch_worker_seed(increment=0)[source]

Get dataloader worker seed from pytorch

tarfile_to_samples_nothrow(src, handler=<function log_and_continue>)[source]

A re-implementation of the webdataset impl with group_by_keys that doesn’t throw