libauc.sampler
The overview of sampler
module is summarized as follow:
Base class for Controlled Data Sampler. |
|
Dual Sampler aims to customize the number of positives and negatives in mini-batch data for binary classification tasks. |
|
TriSampler aims to customize the number of positives and negatives in mini-batch data for multi-label classification or ranking tasks. |
- class ControlledDataSampler(dataset, batch_size, labels=None, shuffle=True, num_pos=None, num_sampled_tasks=None, sampling_rate=0.5, random_seed=2023)[source]
Base class for Controlled Data Sampler.
- class DualSampler(dataset, batch_size, labels=None, shuffle=True, num_pos=None, num_sampled_tasks=None, sampling_rate=0.5, random_seed=2023)[source]
Dual Sampler aims to customize the number of positives and negatives in mini-batch data for binary classification tasks. For more details, please refer to LibAUC paper[1]_.
- Parameters:
dataset (torch.utils.data.Dataset) – pytorch dataset object for training or evaluation.
batch_size (int) – number of samples per mini-batch.
sampling_rate (float) – the ratio of number of positive samples to total number of samples per task in a mini-batch (default:
0.5
).num_pos (int, optional) – number of positive samples in a batch (default:
None
).labels (list or array, optional) – A list or array of labels for the dataset (default:
None
).shuffle (bool) – Whether to shuffle the data before sampling mini-batch data (default:
True
).num_sampled_tasks (int) – number of sampled tasks from original dataset. If None is given, then all labels (tasks) are used for training (default:
None
).random_seed (int) – random seed for reproducibility (default:
2023
).
Example
>>> sampler = libauc.sampler.DualSampler(trainSet, batch_size=32, sampling_rate=0.5) >>> trainloader = torch.utils.data.DataLoader(trainSet, batch_size=32, sampler=sampler, shuffle=False) >>> data, targets, index = next(iter(trainloader))
Note
Practical Tips:
In DualSampler,
num_pos
is equivalent toint(sampling_rate * batch_size)
. You can choose to usenum_pos
if you want to define the exact number of positive samples per mini-batch. Otherwise,sampling_rate
will be the required parameter by default.For
sampling_rate
, we recommended to set a value slightly higher than the proportion of positive samples in your training dataset. For instance, if the ratio of positive sample in your dataset is 0.01, you might consider settingsampling_rate
to 0.05, 0.1, or 0.2.
- Reference:
- class TriSampler(dataset, batch_size_per_task, num_sampled_tasks=None, sampling_rate=0.5, mode='classification', labels=None, shuffle=True, num_pos=None, random_seed=2023)[source]
TriSampler aims to customize the number of positives and negatives in mini-batch data for multi-label classification or ranking tasks. For more details, please refer to LibAUC paper[1]_.
- Parameters:
dataset (torch.utils.data.Dataset) – pytorch dataset object for training or evaluation.
batch_size_per_task (int) – number of samples per mini-batch for each task.
num_sampled_tasks (int) – number of sampled tasks from original dataset. If None is given, then all labels (tasks) are used for training (default:
None
).sampling_rate (float) – the ratio of number of positive samples to total number of samples per task in a mini-batch (default:
0.5
).num_pos (int, optional) – number of positive samples in a batch (default:
None
).mode (str, optional) – sampling mode for classification or ranking tasks (default:
'classification'
).labels (list or array, optional) – A list or array of labels for the dataset (default:
None
).shuffle (bool) – Whether to shuffle the data before sampling mini-batch data (default:
True
).random_seed (int) – random seed for reproducibility (default:
2023
).
Example
>>> sampler = libauc.sampler.TriSampler(trainSet, batch_size_per_task=32, num_sampled_tasks=10, sampling_rate=0.5) >>> trainloader = torch.utils.data.DataLoader(trainSet, batch_size=320, sampler=sampler, shuffle=False) >>> data, targets, index = next(iter(trainloader)) >>> data_id, task_id = index
Note
TriSampler will return an index tuple of
(sample_id, task_id)
and it requires a slight change in your dataloader for the training. See the example below:class SampleDataset(torch.utils.data.Dataset): def __init__(self, inputs, targets): self.inputs = inputs self.targets = targets def __len__(self): return len(self.inputs) def __getitem__(self, index): index, task_id = index data = self.inputs[index] target = self.targets[index] return data, target, (index, task_id)
Note
Practical Tips:
In classification mode,
batch_size_per_task * num_sampled_tasks
is the totalbatch_size
. Ifnum_sampled_tasks
is not specified, all labels will be used.In ranking mode,
batch_size_per_task
is the number of queries,num_pos
is the number of positive items per user, andnum_sampled_tasks
is the number of users sampled from the dataset for mini-batch. For example,batch_size_per_task=310
,num_pos=10
,num_sampled_tasks=256
implies that we sample 256 users per mini-batch data where each user has 10 positive items and 300 negative items.