AUC-Margin loss with squared-hinge surrogate loss for optimizing AUROC. The objective function is defined as:
\[\begin{split}\min _{\substack{\mathbf{w} \in \mathbb{R}^d \\(a, b) \in \mathbb{R}^2}} \max _{\alpha \in \mathbb{R^+}} f(\mathbf{w}, a, b, \alpha):=\mathbb{E}_{\mathbf{z}}[F(\mathbf{w}, a, b, \alpha ; \mathbf{z})]\end{split}\]
where
\[\begin{split}F(\mathbf{w},a,b,\alpha; \mathbf{z}) &=(1-p)(h_{\mathbf{w}}(x)-a)^2\mathbb{I}_{[y=1]} +p(h_{\mathbf{w}}(x)-b)^2\mathbb{I}_{[y=-1]} \\
&+2\alpha(p(1-p)m+ p h_{\mathbf{w}}(x)\mathbb{I}_{[y=-1]}-(1-p)h_{\mathbf{w}}(x)\mathbb{I}_{[y=1]})\\
&-p(1-p)\alpha^2\end{split}\]
\(h_{\mathbf{w}}\) is the prediction scoring function, e.g., deep neural network, \(p\) is the ratio of positive samples to all samples, \(a\), \(b\) are the running statistics of
the positive and negative predictions, \(\alpha\) is the auxiliary variable derived from the problem formulation and \(m\) is the margin term. We denote this version of AUCMLoss as v1.
To remove the class prior \(p\) in the above formulation, we can write the new objective function as follow:
We denote this version of AUCMLoss as v2. The optimization algorithm for solving the above objectives are implemented as PESG. For the derivations, please refer to the original paper [1]_.
Parameters:
margin (float) – margin for squared-hinge surrogate loss (default: 1.0).
imratio (float, optional) – the ratio of the number of positive samples to the number of total samples in the training dataset, i.e., \(p\) in the above formulation.
If this value is not given, it will be automatically calculated with mini-batch samples.
This value is ignored when version is set to 'v2'.
version (str, optional) – whether to include prior \(p\) in the objective function (default: 'v1').
It is recommended to use v2 of AUCMLoss by setting version='v2' to get better performance. The v2 version requires the use of DualSampler.
epoch_decay is a regularization parameter similar to weight_decay that can be tuned in the same range.
For complex tasks, it is recommended to use regular loss to pretrain the model, and then switch to AUCMLoss for finetuning with a smaller learning rate.
where \(L_{\mathrm{AUC}}\) refers to AUCMLoss, \(L_{\mathrm{CE}}\) refers to CrossEntropyLoss and math:alpha refer to the step size for inner updates.
The optimization algorithm for solving this objective is implemented as PDSCA. For the derivations, please refer to the original paper [2]_.
Parameters:
margin (float) – margin for squared-hinge surrogate loss (default: 1.0).
imratio (float, optional) – the ratio of the number of positive samples to the number of total samples in the training dataset, i.e., \(p\) in the above formulation.
If this value is not given, it will be automatically calculated with mini-batch samples.
This value is ignored when version is set to 'v2'.
k (int, optional) – number of steps for inner updates. For example, when k is set to 2, the optimizer will alternately execute two steps optimizing CrossEntropyLoss followed by a single step optimizing AUCMLoss during training (default: 1).
version (str, optional) – whether to include prior \(p\) in the objective function (default: 'v1').
As CompositionalAUCLoss is built on AUCMLoss, there are also two versions of CompositionalAUCLoss.
It is recommended to use v2 version by setting version='v2' to get better performance.
Note
Practial Tips:
By default, k is set to 1. You may consider increasing it to a larger number to potentially improve performance.
where \(\ell(\mathbf{w}; \mathbf{x}_s, \mathbf{x}_i)\) is a surrogate function of the non-continuous indicator function \(\mathbb{I}(h(\mathbf{x}_s)\geq h(\mathbf{x}_i))\), \(h(\cdot)\) is the prediction function,
e.g., deep neural network.
The optimization algorithm for solving this objective is implemented as SOAP. For the derivations, please refer to the original paper [3].
This class is also aliased as APLoss.
Parameters:
data_len (int) – total number of samples in the training dataset.
gamma (float, optional) – parameter for moving average estimator (default: 0.9).
surr_loss (str, optional) – the choice for surrogate loss used for problem formulation (default: 'squared_hinge').
margin (float, optional) – margin for squred hinge surrogate loss (default: 1.0).
A wrapper for Partial AUC losses to optimize One-way and Two-way Partial AUROC. By default, One-way Partial AUC (OPAUC) refers to SOPAs and
Two-way Partial AUC (TPAUC) refers to SOTAs. The usage for each loss is same as the original loss.
Parameters:
mode (str) – the specific loss function to be used in the backend (default: ‘1w’).
**kwargs – the required arguments for the selected loss function.
Partial AUC loss based on DRO-CVaR to optimize One-way Partial AUROC (OPAUC). The loss focuses on optimizing OPAUC in the range [0, beta] for false positive rate. The objective is defined as
where \(L(\mathbf w; \mathbf x_i, \mathbf x_j)\) is the surrogate pairwise loss function for one positive data and one negative data, e.g., squared hinge loss, logitstic loss, etc. \(\mathbf s\) is the dual variable from DRO-CVaR formulation that is minimized in the loss function. For a positive data \(\mathbf x_i\), any pairwise losses samller than \(s_i\) are truncated. Therefore, the loss function focus on the harder negative data; as a consequence, the pAUC_CVaR_Loss optimize the upper bounded FPR (false positive rate) of pAUC region.
This loss optimizes OPAUC in the range [0, beta] for False Positive Rate (FPR). The optimization algorithm for solving this objective is implemented as SOPA. For the derivations, please refer to the original paper [4].
Parameters:
data_len (int) – total number of samples in the training dataset.
pos_len (int) – total number of positive samples in the training dataset.
margin (float, optional) – margin term for squared-hinge surrogate loss (default: 1.0).
beta (float) – upper bound of False Positive Rate (FPR) used for optimizing pAUC (default: 0.2).
eta (float) – stepsize for update the dual variables for DRO-CVaR formulation (default: 0.1).
surr_loss (string, optional) – surrogate loss used in the problem formulation (default: 'squared_hinge').
Partial AUC loss based on KL-DRO to optimize One-way Partial AUROC (OPAUC). In contrast to conventional AUC, partial AUC pays more attention to partial difficult samples. By leveraging the Distributionally Robust Optimization (DRO), the objective is defined as
where \(L(\mathbf{w}; \mathbf{x_i}, \mathbf{x_j})\) is the surrogate pairwise loss function for one positive data and one negative data, e.g., squared hinge loss, \(\mathbf{S}_+\) and \(\mathbf{S}_-\) denote the subsets of the dataset which contain only positive samples and negative samples, respectively.
The optimization algorithm for solving the above objective is implemented as SOAPs. For the derivation of the above formulation, please refer to the original paper [4].
Parameters:
data_len (int) – total number of samples in the training dataset.
gamma (float) – parameter for moving average estimator (default: 0.9).
surr_loss (string, optional) – surrogate loss used in the problem formulation (default: 'squared_hinge').
margin (float, optional) – margin for squared-hinge surrogate loss (default: 1.0).
where \(L(\mathbf w; \mathbf x_i, \mathbf x_j)\) is the surrogate pairwise loss function for one positive data and one negative data, e.g., squared hinge loss, logitstic loss, etc. In this formulation, we implicitly handle the \(\alpha\) and \(\beta\) range of TPAUC by tuning \(\lambda\) and \(\lambda'\) (we rename \(\lambda\) as Lambda and \(\lambda'\) as tau for coding purpose). The loss focuses on both harder positive and harder negative samples, hence can optimize the TPAUC on the left corner space of the AUROC curve.
The optimization algorithm for solving the above objective is implemented as SOTAs. For the derivation of the above formulation, please refer to the original paper [4].
Parameters:
data_len (int) – total number of samples in the training dataset.
margin (float, optional) – margin term used in surrogate loss (default: 1.0).
gammas are parameters which are better to be tuned in the range (0, 1) for better performance. Some suggested values are {(0.1,0.1),(0.5,0.5),(0.9,0.9)}.
margin can be tuned in {0.1,0.3,0.5,0.7,0.9,1.0} for better performance.
Lambda and tau can be tuned in the range (0.1, 10) for better performance.
Pairwise AUC loss to optimize AUROC based on different surrogate losses. For optimizing this objective, we can use existing optimizers in LibAUC or PyTorch such as, SGD, AdamW.
Parameters:
surr_loss (str) – surrogate loss for optimizing pairwise AUC loss. The available options are ‘logistic’,
‘squared’, ‘squared_hinge’, ‘barrier_hinge’ (default: 'squared_hinge').
Mean Average Precision loss based on squared-hinge surrogate loss to optimize mAP and mAP@k. This is an extension of APLoss.
Parameters:
data_len (int) – total number of samples in the training dataset.
num_labels (int) – number of unique labels(tasks) in the dataset.
margin (float, optional) – margin for the squared-hinge surrogate loss (default: 1.0).
gamma (float, optional) – parameter for the moving average estimator (default: 0.9).
top_k (int, optional) – mAP@k optimization is activated if top_k > 0; top_k=-1 represents mAP (default: -1).
surr_loss (str, optional) – type of surrogate loss to use. Choices are ‘squared_hinge’, ‘squared’,
‘logistic’, ‘barrier_hinge’ (default: 'squared_hinge').
AUC-Margin loss with squared-hinge surrogate loss to optimize multi-label AUROC. This is an extension of AUCMLoss.
Parameters:
margin (float) – margin term for squared-hinge surrogate loss. (default: 1.0)
num_labels (int) – number of labels for the dataset.
imratio (float, optional) – the ratio of the number of positive samples to the number of total samples in the training dataset.
If this value is not given, the mini-batch statistics will be used instead.
version (str, optional) – whether to include prior \(p\) in the objective function (default: 'v1').
where \(h_q(\mathbf{x}_i^q;\mathbf{w})\) is the predicted score of \(\mathbf{x}_i^q\) with respect to \(q\), \(y_i^q\) is the relvance score of \(x_i^q\) with respect to \(q\), \(N\) is the number of total queries, \(N_q\) is the total number of items to be ranked for query q,
\(S_q\) denotes the set of items to be ranked by query \(q\), and \(S_q^+\) denotes the set of relevant items for query \(q\).
num_pos (int) – number of positive items sampled for each user
gamma (float) – the factor for moving average, i.e., gamma in our paper [1]_.
eps (float, optional) – a small value to avoid divide-zero error (default: 1e-10)
Example
>>> loss_fn=libauc.losses.ListwiseCELoss(N=1000,num_pos=10,gamma=0.1)# assume we have 1000 relevant query-item pairs>>> predictions=torch.randn((32,10+20),requires_grad=True)# we sample 32 queries/users, and 10 positive items and 20 negative items for each query/user>>> batch={'user_item_id':torch.randint(low=0,high=1000-1,size=(32,10+20))}# ids for all sampled query-item pairs in the batch>>> loss=loss_fn(predictions,batch)>>> loss.backward()
where \(\psi(\cdot)\) is a smooth Lipschtiz continuous function to approximate \(\mathbb{I}(\cdot\ge 0)\), e.g., sigmoid function, \(f_{q,i}(g)\) denotes \(\frac{1}{Z_q^K}\frac{1-2^{y_i^q}}{\log_2(N_q g+1)}\). The objective formulation for SONG is a special case of
that for K-SONG, where the \(\psi(\cdot)\) function is a constant.
num_pos (int) – number of positive items sampled for each user
gamma0 (float) – the moving average factor of u_{q,i}, i.e., beta_0 in our paper, in range (0.0, 1.0)
this hyper-parameter can be tuned for better performance (default: 0.9)
gamma1 (float, optional) – the moving average factor of s_{q} and v_{q} (default: 0.9)
eta0 (float, optional) – step size of lambda (default: 0.01)
margin (float, optional) – margin for squared hinge loss (default: 1.0)
topk (int, optional) – NDCG@k optimization is activated if topk > 0; topk=-1 represents SONG (default: 1e-10)
topk_version (string, optional) – ‘theo’ or ‘prac’ (default: theo)
sigmoid_alpha (float, optional) – a hyperparameter for sigmoid function, psi(x) = sigmoid(x * sigmoid_alpha) (default: 1.0)
Example
>>> loss_fn=libauc.losses.NDCGLoss(N=1000,num_user=100,num_item=5000,num_pos=10,gamma0=0.1,topk=-1)# SONG (with topk = -1)/K-SONG (with topk = 100)>>> predictions=torch.randn((32,10+20),requires_grad=True)# we sample 32 queries/users, and 10 positive items and 20 negative items for each query/user>>> batch={ 'rating': torch.randint(low=0, high=5, size=(32,10+20)), # ratings (e.g., in the range of [0,1,2,3,4]) for each sampled query-item pair 'user_id': torch.randint(low=0, high=100-1, size=32), # id for each sampled query 'num_pos_items': torch.randint(low=0, high=1000, size=32), # number of all relevant items for each sampled query 'ideal_dcg': torch.rand(32), # ideal DCG precomputed for each sampled query (in the range of (0.0, 1.0)) 'user_item_id': torch.randint(low=0, high=1000-1, size=(32,10+20))} # ids for all sampled query-item pairs in the batch }>>> loss=loss_fn(predictions,batch)>>> loss.backward()
Stochastic Optimization of Global Contrastive Loss (GCL) and Robust Global Contrastive Loss (RGCL) for learning representations for unimodal tasks (e.g., image-image).
Stochastic Optimization of Global Contrastive Loss (GCL) and Robust Global Contrastive Loss (RGCL) for learning representations for bimodal task (e.g., image-text).
Stochastic Optimization of Global Contrastive Loss (GCL) and Robust Global Contrastive Loss (RGCL) for learning representations for unimodal tasks (e.g., image-image). The objective for optimizing GCL (i.e., objective for SogCLR) is defined as
where \(h_i(\mathbf{z})=E(\mathcal{A}(\mathbf{x}_i))^{\mathrm{T}}E(\mathbf{z})-E(\mathcal{A}(\mathbf{x}_i))^{\mathrm{T}}E(\mathcal{A}^{\prime}(\mathbf{x}_i))\), \(\mathcal{A}\) and \(\mathcal{A}^{\prime}\) are two data
augmentation operations, \(S_i^-\) denotes all negative samples for anchor data \(\mathbf{x}_i\), and \(E(\cdot)\) represents the image encoder. In iSogCLR, \(\mathbf{\tau}_i\) is the individualized
temperature for \(\mathbf{x}_i\).
Parameters:
N (int) – number of samples in the training dataset (default: 100000)
tau (float) – temperature parameter for global contrastive loss. If you enable isogclr, then input temperature will be the initial value for learnable temperature parameters (default: 0.1)
device (torch.device) – the device for the inputs (default: None)
distributed (bool) – whether to use distributed training (default: False)
enable_isogclr (bool, optional) – whether to enable iSogCLR. If True, then the algorithm will optimize individualized temperature parameters for all samples (default: False)
eta (float, optional) – the step size for updating temperature parameters in iSogCLR (default: 0.01)
rho (float, optional) – the hyperparameter \(\rho\) in Eq. (6) in iSogCLR [2] (default: 0.3)
tau_min (float, optional) – lower bound of learnable temperature in iSogCLR (default: 0.05)
tau_max (float, optional) – upper bound of learnable temperature in iSogCLR (default: 0.7)
beta (float, optional) – the momentum parameter for updating temperature parameters in iSogCLR (default: 0.9)
gamma (float, optional) – the moving average factor for dynamic loss in range the range of (0.0, 1.0) (default: 0.9)
gamma_schedule (str, optional) – the schedule for gamma. Options are ‘constant’ (fixed gamma) and ‘cosine’ (decaying from 1.0 to gamma) (default: 'cosine')
gamma_decay_epochs (int, optional) – After this number of epochs, gamma will decrease to the value set by the option gamma. Used only when gamma_schedule is ‘cosine’. We recommend a value of total_training_epochs // 2 (default: -1)
Stochastic Optimization of Global Contrastive Loss (GCL) and Robust Global Contrastive Loss (RGCL) for learning
representations for bimodal task (e.g., image-text). The objective for optimizing GCL (i.e., objective for SogCLR) is defined as
where \((\mathbf{x}_i, \mathbf{t}_i) \in D\) is an image-text pair, \(h_{\mathbf{x}_i}(\mathbf{t})=E_I(\mathbf{x}_i)^{\mathrm{T}}E_T(\mathbf{t}) - E_I(\mathbf{x}_i)^{\mathrm{T}}E_T(\mathbf{t}_i)\), \(h_{\mathbf{t}_i}(\mathbf{x})=E_I(\mathbf{x})^{\mathrm{T}}E_T(\mathbf{t}_i) - E_I(\mathbf{x}_i)^{\mathrm{T}}E_T(\mathbf{t}_i)\),
\(E_I(\cdot)\) and \(E_T(\cdot)\) are image and text encoder, respectively. In iSogCLR, \(\mathbf{\tau}_i\), \(\mathbf{\tau}^{\prime}_i\) are individualized temperature for \(\mathbf{x}_i\) and \(\mathbf{t}_i\), respectively.
Parameters:
N (int) – number of samples in the training dataset (default: 100000)
tau (float) – temperature parameter for global contrastive loss. If you enable isogclr, then input temperature will be the initial value for learnable temperature parameters (default: 0.1)
gamma (float) – the moving average factor for dynamic loss in range the range of (0.0, 1.0) (default: 0.9)
cache_labels (bool) – whether to cache labels for mini-batch data (default: True)
rank (int) – unique ID given to a process for distributed training (default: 0)
world_size (int) – total number of processes for distributed training (default: 1)
distributed (bool) – whether to use distributed training (default: False)
enable_isogclr (bool, optional) – whether to enable iSogCLR. If True, then the algorithm will optimize individualized temperature parameters for all samples (default: False)
eta (float, optional) – the step size for updating temperature parameters in iSogCLR (default: 0.01)
rho (float, optional) – the hyperparameter \(\rho\) in Eq. (6) in iSogCLR [2] (default: 6.0)
tau_min (float, optional) – lower bound of learnable temperature in iSogCLR (default: 0.005)
tau_max (float, optional) – upper bound of learnable temperature in iSogCLR (default: 0.05)
beta (float, optional) – the momentum parameter for updating temperature parameters in iSogCLR (default: 0.9)
gamma_schedule (str, optional) – the schedule for gamma. Options are ‘constant’ (fixed gamma) and ‘cosine’ (decaying from 1.0 to gamma) (default: 'cosine')
gamma_decay_epochs (int, optional) – After this number of epochs, gamma will decrease to the value set by the option gamma. Used only when gamma_schedule is ‘cosine’. We recommend a value of total_training_epochs // 2 (default: -1)
Multiple Instance Deep AUC Maximization with stochastic Attention (MIDAM-att) Pooling is used for optimizing the AUROC under Multiple Instance Learning (MIL) setting.
>>> loss_fn=MIDAMLoss(mode='softmax',data_len=N,margin=para)>>> preds=torch.randn(32,1,requires_grad=True)>>> target=torch.empty(32dtype=torch.long).random_(1)>>> # in practice, index should be the indices of your data (bag-index for multiple instance learning).>>> loss=loss_fn(exps=preds,y_true=target,index=torch.arange(32))>>> loss.backward()
>>> loss_fn=MIDAMLoss(mode='attention',data_len=N,margin=para)>>> preds=torch.randn(32,1,requires_grad=True)>>> denoms=torch.rand(32,1,requires_grad=True)+0.01>>> target=torch.empty(32dtype=torch.long).random_(1)>>> # in practice, index should be the indices of your data (bag-index for multiple instance learning).>>> # denoms should be the stochastic denominator values output from your model.>>> loss=loss_fn(sn=preds,sd=denoms,y_true=target,index=torch.arange(32))>>> loss.backward()
Multiple Instance Deep AUC Maximization with stochastic Attention (MIDAM-att) Pooling is used for optimizing the AUROC under Multiple Instance Learning (MIL) setting.
The Attention Pooling is defined as
where \(g(\mathbf w;\mathbf x)\) is a parametric function, e.g., \(g(\mathbf w; \mathbf x)=\mathbf w_a^{\top}\text{tanh}(V e(\mathbf w_e; \mathbf x))\), where \(V\in\mathbb R^{m\times d_o}\) and \(\mathbf w_a\in\mathbb R^m\).
And \(\delta(\mathbf w;\mathbf x) = \mathbf w_c^{\top}e(\mathbf w_e; \mathbf x)\) is the prediction score from each instance, which will be combined with attention weights.
We optimize the following AUC loss with the Attention Pooling:
The optimization algorithm for solving the above objective is implemented as MIDAM. The stochastic pooling loss only requires partial data from each bag in the mini-batch. For the more details about the formulations, please refer to the original paper [1]_.
margin (float, optional) – margin parameter for AUC loss (default: 0.5).
gamma (float, optional) – moving average parameter for numerator and denominator on attention calculation (default: 0.9).
device (torch.device, optional) – the device used for computing loss, e.g., ‘cpu’ or ‘cuda’ (default: None)
Example
>>> loss_fn=MIDAM_attention_pooling_loss(data_len=data_length,margin=margin,tau=tau,gamma=gamma)>>> preds=torch.randn(32,1,requires_grad=True)>>> denoms=torch.rand(32,1,requires_grad=True)+0.01>>> target=torch.empty(32dtype=torch.long).random_(1)>>> # in practice, index should be the indices of your data (bag-index for multiple instance learning).>>> # denoms should be the stochastic denominator values output from your model.>>> loss=loss_fn(sn=preds,sd=denoms,y_true=target,index=torch.arange(32))>>> loss.backward()
Reference:
Note
To use MIDAM_attention_pooling_loss, we need to track index for each sample in the training dataset. To do so, see the example below:
Multiple Instance Deep AUC Maximization with stochastic Smoothed-MaX (MIDAM-smx) Pooling. This loss is used for optimizing the AUROC under Multiple Instance Learning (MIL) setting.
The Smoothed-MaX Pooling is defined as
where \(\phi(\mathbf w;\mathbf x)\) is the prediction score for instance \(\mathbf x\) and \(\tau>0\) is a hyperparameter.
We optimize the following AUC loss with the Smoothed-MaX Pooling:
The optimization algorithm for solving the above objective is implemented as MIDAM. The stochastic pooling loss only requires partial data from each bag in the mini-batch For the more details about the formulations, please refer to the original paper [1]_.
margin (float, optional) – margin parameter for AUC loss (default: 0.5).
tau (float) – temperature parameter for smoothed max pooling (default: 0.1).
gamma (float, optional) – moving average parameter for pooling operation (default: 0.9).
device (torch.device, optional) – the device used for computing loss, e.g., ‘cpu’ or ‘cuda’ (default: None)
Example
>>> loss_fn=MIDAM_softmax_pooling_loss(data_len=data_length,margin=margin,tau=tau,gamma=gamma)>>> preds=torch.randn(32,1,requires_grad=True)>>> target=torch.empty(32dtype=torch.long).random_(1)>>> # in practice, index should be the indices of your data (bag-index for multiple instance learning).>>> loss=loss_fn(exps=preds,y_true=target,index=torch.arange(32))>>> loss.backward()
Reference:
Note
To use MIDAM_softmax_pooling_loss, we need to track index for each sample in the training dataset. To do so, see the example below:
A wrapper to call a specific surrogate loss function.
Parameters:
loss_name (str) – type of surrogate loss function to fetch, including ‘squared_hinge’, ‘squared’, ‘logistic’, ‘barrier_hinge’ (default: 'squared_hinge').