Dual Sampler aims to customize the number of positives and negatives in mini-batch data for binary classification tasks.
For more details, please refer to LibAUC paper[1]_.
batch_size_per_gpu (int) – number of samples per mini-batch for each gpu.
sampling_rate (float) – the ratio of number of positive samples to total number of samples per task in a mini-batch (default: 0.5).
num_pos (int, optional) – number of positive samples in a batch (default: None).
labels (list or array, optional) – A list or array of labels for the dataset (default: None).
shuffle (bool) – Whether to shuffle the data before sampling mini-batch data (default: True).
num_sampled_tasks (int) – number of sampled tasks from original dataset. If None is given, then all labels (tasks) are used for training (default: None).
random_seed (int) – random seed for reproducibility (default: 2023).
In DualSampler, num_pos is equivalent to int(sampling_rate*batch_size). You can choose to use num_pos if you want to define the exact number of positive samples per mini-batch. Otherwise, sampling_rate will be the required parameter by default.
For sampling_rate, we recommended to set a value slightly higher than the proportion of positive samples in your training dataset. For instance, if the ratio of positive sample in your dataset is 0.01, you might consider setting sampling_rate to 0.05, 0.1, or 0.2.
TriSampler aims to customize the number of positives and negatives in mini-batch data for multi-label classification or ranking tasks. For more details,
please refer to LibAUC paper[1]_.
batch_size_per_task (int) – number of samples per mini-batch for each task.
num_sampled_tasks_per_gpu (int) – number of sampled tasks from original dataset for each gpu. If None is given, then all labels (tasks) are used for training (default: None).
sampling_rate (float) – the ratio of number of positive samples to total number of samples per task in a mini-batch (default: 0.5).
num_pos (int, optional) – number of positive samples in a batch (default: None).
mode (str, optional) – sampling mode for classification or ranking tasks (default: 'classification').
labels (list or array, optional) – A list or array of labels for the dataset (default: None).
shuffle (bool) – Whether to shuffle the data before sampling mini-batch data (default: True).
random_seed (int) – random seed for reproducibility (default: 2023).
TriSampler will return an index tuple of (sample_id,task_id) and it requires a slight change in your dataloader for the training. See the example below:
In classification mode, batch_size_per_task*num_sampled_tasks is the total batch_size. If num_sampled_tasks is not specified, all labels will be used.
In ranking mode, batch_size_per_task is the number of queries, num_pos is the number of positive items per user, and num_sampled_tasks is the number of users sampled from the dataset for mini-batch. For example, batch_size_per_task=310, num_pos=10, num_sampled_tasks=256 implies that we sample 256 users per mini-batch data where each user has 10 positive items and 300 negative items.
Dual Sampler aims to customize the number of positives and negatives in mini-batch data for binary classification tasks.
For more details, please refer to LibAUC paper[1]_.
batch_size (int) – number of samples per mini-batch.
sampling_rate (float) – the ratio of number of positive samples to total number of samples per task in a mini-batch (default: 0.5).
num_pos (int, optional) – number of positive samples in a batch (default: None).
labels (list or array, optional) – A list or array of labels for the dataset (default: None).
shuffle (bool) – Whether to shuffle the data before sampling mini-batch data (default: True).
num_sampled_tasks (int) – number of sampled tasks from original dataset. If None is given, then all labels (tasks) are used for training (default: None).
random_seed (int) – random seed for reproducibility (default: 2023).
In DualSampler, num_pos is equivalent to int(sampling_rate*batch_size). You can choose to use num_pos if you want to define the exact number of positive samples per mini-batch. Otherwise, sampling_rate will be the required parameter by default.
For sampling_rate, we recommended to set a value slightly higher than the proportion of positive samples in your training dataset. For instance, if the ratio of positive sample in your dataset is 0.01, you might consider setting sampling_rate to 0.05, 0.1, or 0.2.
TriSampler aims to customize the number of positives and negatives in mini-batch data for multi-label classification or ranking tasks. For more details,
please refer to LibAUC paper[1]_.
batch_size_per_task (int) – number of samples per mini-batch for each task.
num_sampled_tasks (int) – number of sampled tasks from original dataset. If None is given, then all labels (tasks) are used for training (default: None).
sampling_rate (float) – the ratio of number of positive samples to total number of samples per task in a mini-batch (default: 0.5).
num_pos (int, optional) – number of positive samples in a batch (default: None).
mode (str, optional) – sampling mode for classification or ranking tasks (default: 'classification').
labels (list or array, optional) – A list or array of labels for the dataset (default: None).
shuffle (bool) – Whether to shuffle the data before sampling mini-batch data (default: True).
random_seed (int) – random seed for reproducibility (default: 2023).
TriSampler will return an index tuple of (sample_id,task_id) and it requires a slight change in your dataloader for the training. See the example below:
In classification mode, batch_size_per_task*num_sampled_tasks is the total batch_size. If num_sampled_tasks is not specified, all labels will be used.
In ranking mode, batch_size_per_task is the number of queries, num_pos is the number of positive items per user, and num_sampled_tasks is the number of users sampled from the dataset for mini-batch. For example, batch_size_per_task=310, num_pos=10, num_sampled_tasks=256 implies that we sample 256 users per mini-batch data where each user has 10 positive items and 300 negative items.