libauc.utils

This module contains two submodules:

  • utils offers a list of commonly used utility functions.

  • paper_utils provides various wrapper functions used for reproducing the experiments in the corresponding research papers.

libauc.utils.utils

class ImbalancedDataGenerator(imratio=None, shuffle=True, random_seed=0, verbose=False)[source]
transform(data, targets, imratio=None)[source]
check_array_shape(array, shape)[source]
check_array_type(array)[source]
check_class_labels(labels)[source]
check_imbalance_ratio(labels)[source]
check_tensor_shape(tensor, shape)[source]
get_time()[source]
select_mean(array, threshold=0)[source]
set_all_seeds(SEED)[source]

libauc.utils.paper_utils

class CosineLRScheduler(optimizer: Optimizer, t_initial: int, t_mul: float = 1.0, lr_min: float = 0.0, decay_rate: float = 1.0, warmup_t=0, warmup_lr_init=0, warmup_prefix=True, cycle_limit=0, t_in_epochs=True, noise_range_t=None, noise_pct=0.67, noise_std=1.0, noise_seed=42, initialize=True)[source]

Cosine decay with restarts. This is described in the paper https://arxiv.org/abs/1608.03983. Inspiration from https://github.com/allenai/allennlp/blob/master/allennlp/training/learning_rate_schedulers/cosine.py

get_cycle_length(cycles=0)[source]
get_epoch_values(epoch: int)[source]
get_update_values(num_updates: int)[source]
MIL_aggregation(bag_X, model, mode='mean', tau=0.1, device=None)[source]

The bag-level prediction aggregated from all the instances from the input bag. Notice that MIL_aggregation is not recommended for back-propagation, which may exceede GPU memory limits.

Parameters:
  • bag_X (array-like, required) – data features for all instances from a bag with shape [number_of_instance, …].

  • model (pytorch model, required) – model that generates predictions (or more generally related outputs) from instance-level.

  • mode (str, required) – the stochastic pooling mode for MIL, default: mean.

  • tau (float, optional) – the temperature parameter for stochastic softmax (smoothed-max) pooling, default: 0.1.

  • device (torch.device, optional) – device for running the code. default: none (use GPU if available)

Example

>>> model = FFNN_stoc_MIL(num_classes=1, dims=DIMS)
>>> train_data_bags, train_labels, index = data
>>> for i in range(len(train_data_bags)):
>>>   y_pred[i] = MIL_aggregation(bag_X=train_data_bags[i], model=model, mode='att')
Reference:
MIL_evaluate_auc(dataloader, model, mode='max', tau=0.1)[source]

The high-level wrapper for AUC evaluation under Multiple Instance Learning setting.

Parameters:
  • dataloader (torch.utils.data.dataloader, required) – dataloader for loading data.

  • model (pytorch model, required) – model that generates predictions (or more generally related outputs) from instance-level.

  • mode (str, required) – the stochastic pooling mode for MIL, default: mean.

  • tau (float, optional) – the temperature parameter for stochastic softmax (smoothed-max) pooling, default: 0.1.

Example

>>> traindSet = TabularDataset(data, label)
>>> trainloader =  torch.utils.data.DataLoader(dataset=traindSet, batch_size=BATCH_SIZE, collate_fn=collate_fn)
>>> model = FFNN_stoc_MIL(num_classes=1, dims=DIMS)
>>> tr_auc = evaluate_auc(trainloader, model, mode='att')
Reference:
MIL_sampling(bag_X, model, instance_batch_size=4, mode='mean', tau=0.1, device=None)[source]

The multiple instance sampling for the stochastic pooling operations. It uniformly randomly samples instances from each bag and take different pooling calculations for different pooling methods.

Parameters:
  • bag_X (array-like, required) – data features for all instances from a bag with shape [number_of_instance, …].

  • model (pytorch model, required) – model that generates predictions (or more generally related outputs) from instance-level.

  • instance_batch_size (int, required) – the maximal instance batch size for each bag, default: 4.

  • mode (str, required) – the stochastic pooling mode for MIL, default: mean.

  • tau (float, optional) – the temperature parameter for stochastic softmax (smoothed-max) pooling, default: 0.1.

  • device (torch.device, optional) – device for running the code. default: none (use GPU if available)

Example

>>> model = FFNN_stoc_MIL(num_classes=1, dims=DIMS)
>>> train_data_bags, train_labels, index = data
>>> for i in range(len(train_data_bags)):
>>>   y_pred[i] = MIL_sampling(bag_X=train_data_bags[i], model=model, instance_batch_size=instance_batch_size, mode='att')
Reference:
class Scheduler(optimizer: Optimizer, param_group_field: str, noise_range_t=None, noise_type='normal', noise_pct=0.67, noise_std=1.0, noise_seed=None, initialize: bool = True)[source]

Parameter Scheduler Base Class.

A scheduler base class that can be used to schedule any optimizer parameter groups.

Unlike the builtin PyTorch schedulers, this is intended to be consistently called

  • At the END of each epoch, before incrementing the epoch count, to calculate next epoch’s value

  • At the END of each optimizer update, after incrementing the update count, to calculate next update’s value

The schedulers built on this should try to remain as stateless as possible (for simplicity).

This family of schedulers is attempting to avoid the confusion of the meaning of ‘last_epoch’ and -1 values for special behaviour. All epoch and update counts must be tracked in the training code and explicitly passed in to the schedulers on the corresponding step or step_update call.

Reference:
get_epoch_values(epoch: int)[source]
get_update_values(num_updates: int)[source]
load_state_dict(state_dict: Dict[str, Any]) None[source]
state_dict() Dict[str, Any][source]
step(epoch: int, metric: float = None) None[source]
step_update(num_updates: int, metric: float = None)[source]
update_groups(values)[source]
adjust_lr(learning_rate, lr_schedule, optimizer, epoch)[source]
batch_to_gpu(batch, device='cuda')[source]
collate_fn(list_items)[source]

The basic collate function takes a list of (x, y, index) and collate them separately.

Parameters:

list_items (list, required) – list of tuples (x, y, index)

Example

>>> traindSet = TabularDataset(data, label)
>>> trainloader =  torch.utils.data.DataLoader(dataset=traindSet, batch_size=BATCH_SIZE, collate_fn=collate_fn)
evaluate(model, data_set, topks, metrics)[source]

The returned prediction is a 2D-array, each row corresponds to all the candidates, and the ground-truth item poses the first. Example: ground-truth items: [1, 2], 2 negative items for each instance: [[3,4], [5,6]]

predictions like: [[1,3,4], [2,5,6]]

evaluate_method(predictions, ratings, topk, metrics)[source]
Parameters:
  • predictions – (-1, n_candidates) shape, the first column is the score for ground-truth item

  • ratings – (# of users, # of pos items)

  • topk – top-K value list

  • metrics – metric string list

Returns:

a result dict, the keys are metric@topk

format_metric(result_dict)[source]