libauc.sampler

The overview of sampler module is summarized as follow:

`ControlledDataSampler`	Base class for Controlled Data Sampler.
`DualSampler`	Dual Sampler aims to customize the number of positives and negatives in mini-batch data for binary classification tasks.
`TriSampler`	TriSampler aims to customize the number of positives and negatives in mini-batch data for multi-label classification or ranking tasks.

class ControlledDataSampler(dataset, batch_size, labels=None, shuffle=True, num_pos=None, num_sampled_tasks=None, sampling_rate=0.5, random_seed=2023)[source]: Base class for Controlled Data Sampler.

class DualSampler(dataset, batch_size, labels=None, shuffle=True, num_pos=None, num_sampled_tasks=None, sampling_rate=0.5, random_seed=2023)[source]

Dual Sampler aims to customize the number of positives and negatives in mini-batch data for binary classification tasks. For more details, please refer to LibAUC paper[1]_.

Parameters:

dataset (torch.utils.data.Dataset) – pytorch dataset object for training or evaluation.
batch_size (int) – number of samples per mini-batch.
sampling_rate (float) – the ratio of number of positive samples to total number of samples per task in a mini-batch (default: 0.5).
num_pos (int, optional) – number of positive samples in a batch (default: None).
labels (list or array, optional) – A list or array of labels for the dataset (default: None).
shuffle (bool) – Whether to shuffle the data before sampling mini-batch data (default: True).
num_sampled_tasks (int) – number of sampled tasks from original dataset. If None is given, then all labels (tasks) are used for training (default: None).
random_seed (int) – random seed for reproducibility (default: 2023).

Example

>>> sampler = libauc.sampler.DualSampler(trainSet, batch_size=32, sampling_rate=0.5)
>>> trainloader = torch.utils.data.DataLoader(trainSet, batch_size=32, sampler=sampler, shuffle=False)
>>> data, targets, index = next(iter(trainloader))

Note

Practical Tips:

In DualSampler, num_pos is equivalent to int(sampling_rate * batch_size). You can choose to use num_pos if you want to define the exact number of positive samples per mini-batch. Otherwise, sampling_rate will be the required parameter by default.
For sampling_rate, we recommended to set a value slightly higher than the proportion of positive samples in your training dataset. For instance, if the ratio of positive sample in your dataset is 0.01, you might consider setting sampling_rate to 0.05, 0.1, or 0.2.

Reference:: [1]
Zhuoning Yuan, Dixian Zhu, Zi-Hao Qiu, Gang Li, Xuanhui Wang, Tianbao Yang. “LibAUC: A Deep Learning Library for X-Risk Optimization.” 29th SIGKDD Conference on Knowledge Discovery and Data Mining. https://arxiv.org/abs/2306.03065

class TriSampler(dataset, batch_size_per_task, num_sampled_tasks=None, sampling_rate=0.5, mode='classification', labels=None, shuffle=True, num_pos=None, random_seed=2023)[source]

TriSampler aims to customize the number of positives and negatives in mini-batch data for multi-label classification or ranking tasks. For more details, please refer to LibAUC paper[1]_.

Parameters:

dataset (torch.utils.data.Dataset) – pytorch dataset object for training or evaluation.
batch_size_per_task (int) – number of samples per mini-batch for each task.
num_sampled_tasks (int) – number of sampled tasks from original dataset. If None is given, then all labels (tasks) are used for training (default: None).
sampling_rate (float) – the ratio of number of positive samples to total number of samples per task in a mini-batch (default: 0.5).
num_pos (int, optional) – number of positive samples in a batch (default: None).
mode (str, optional) – sampling mode for classification or ranking tasks (default: 'classification').
labels (list or array, optional) – A list or array of labels for the dataset (default: None).
shuffle (bool) – Whether to shuffle the data before sampling mini-batch data (default: True).
random_seed (int) – random seed for reproducibility (default: 2023).

Example

>>> sampler = libauc.sampler.TriSampler(trainSet, batch_size_per_task=32, num_sampled_tasks=10, sampling_rate=0.5)
>>> trainloader = torch.utils.data.DataLoader(trainSet, batch_size=320, sampler=sampler, shuffle=False)
>>> data, targets, index = next(iter(trainloader))
>>> data_id, task_id = index

Note

TriSampler will return an index tuple of (sample_id, task_id) and it requires a slight change in your dataloader for the training. See the example below:

class SampleDataset(torch.utils.data.Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, index):
        index, task_id = index
        data = self.inputs[index]
        target = self.targets[index]
        return data, target, (index, task_id)

Note

Practical Tips:

In classification mode, batch_size_per_task * num_sampled_tasks is the total batch_size. If num_sampled_tasks is not specified, all labels will be used.
In ranking mode, batch_size_per_task is the number of queries, num_pos is the number of positive items per user, and num_sampled_tasks is the number of users sampled from the dataset for mini-batch. For example, batch_size_per_task=310, num_pos=10, num_sampled_tasks=256 implies that we sample 256 users per mini-batch data where each user has 10 positive items and 300 negative items.