\(\hspace{5mm}\) Decrease \(\eta\) by a decay factor and update \(\mathbf v_{\text{ref}}\) periodically
where \(z_t\) is the data pair \((x_t, y_t)\), \(\lambda_0\) is the epoch-level l2 penalty (i.e., epoch_decay), \(\lambda\) is the l2 penalty (i.e., weight_decay),
and \(\eta\) is the learning rate.
betas (Tuple[float, float], optional) – coefficients used for computing
running averages of gradient and its square for ‘adam’ mode (default: (0.9,0.999))
eps (float, optional) – term added to the denominator to improve
numerical stability for ‘adam’ mode (default: 1e-8)
amsgrad (bool, optional) – whether to use the AMSGrad variant of ‘adam’ mode
from the paper On the Convergence of Adam and Beyond (default: False)
state_dict (dict) – optimizer state. Should be an object returned
from a call to state_dict().
Warning
Make sure this method is called after initializing torch.optim.lr_scheduler.LRScheduler,
as calling it beforehand will overwrite the loaded learning rates.
Note
The names of the parameters (if they exist under the “param_names” key of each param group
in state_dict()) will not affect the loading process.
To use the parameters’ names for custom cases (such as when the parameters in the loaded state dict
differ from those initialized in the optimizer),
a custom register_load_state_dict_pre_hook should be implemented to adapt the loaded dict
accordingly.
If param_names exist in loaded state dict param_groups they will be saved and override
the current names, if present, in the optimizer state. If they do not exist in loaded state dict,
the optimizer param_names will remain unchanged.
Example
>>> # xdoctest: +SKIP>>> model=torch.nn.Linear(10,10)>>> optim=torch.optim.SGD(model.parameters(),lr=3e-4)>>> scheduler1=torch.optim.lr_scheduler.LinearLR(... optim,... start_factor=0.1,... end_factor=1,... total_iters=20,... )>>> scheduler2=torch.optim.lr_scheduler.CosineAnnealingLR(... optim,... T_max=80,... eta_min=3e-5,... )>>> lr=torch.optim.lr_scheduler.SequentialLR(... optim,... schedulers=[scheduler1,scheduler2],... milestones=[20],... )>>> lr.load_state_dict(torch.load("./save_seq.pt"))>>> # now load the optimizer checkpoint after loading the LRScheduler>>> optim.load_state_dict(torch.load("./save_optim.pt"))
state: a Dict holding current optimization state. Its content
differs between optimizer classes, but some common characteristics
hold. For example, state is saved per parameter, and the parameter
itself is NOT saved. state is a Dictionary mapping parameter ids
to a Dict with state corresponding to each parameter.
param_groups: a List containing all parameter groups where each
parameter group is a Dict. Each parameter group contains metadata
specific to the optimizer, such as learning rate and weight decay,
as well as a List of parameter IDs of the parameters in the group.
If a param group was initialized with named_parameters() the names
content will also be saved in the state dict.
NOTE: The parameter IDs may look like indices but they are just IDs
associating state with param_group. When loading from a state_dict,
the optimizer will zip the param_group params (int IDs) and the
optimizer param_groups (actual nn.Parameter s) in order to
match state WITHOUT additional verification.
Primal-Dual Stochastic Compositional Adaptive Algorithm (PDSCA) is used for optimizing CompositionalAUCLoss. For itearton \(t\), the key update steps are summarized as follows:
\(\hspace{5mm}\) Decrease \(\eta_0, \eta_1\) by a decay factor and update \(\mathbf v_{\text{ref}}\) periodically
where \(\lambda_0,\lambda_1\) refer to epoch_decay and weight_decay, \(\eta_0, \eta_1\) refer to learning rates for inner updates (\(L_{CE}\)) and
outer updates (\(L_{AUC}\)), and \(\mathbf v_t\) refers to \(\{\mathbf w_t, a_t, b_t\}\) and \(\theta\) refers to dual variable in CompositionalAUCLoss.
For more details, please refer to Compositional Training for End-to-End Deep AUC Maximization.
Parameters:
params (iterable) – iterable of parameters to optimize.
loss_fn (callable) – loss function used for optimization (default: None)
state_dict (dict) – optimizer state. Should be an object returned
from a call to state_dict().
Warning
Make sure this method is called after initializing torch.optim.lr_scheduler.LRScheduler,
as calling it beforehand will overwrite the loaded learning rates.
Note
The names of the parameters (if they exist under the “param_names” key of each param group
in state_dict()) will not affect the loading process.
To use the parameters’ names for custom cases (such as when the parameters in the loaded state dict
differ from those initialized in the optimizer),
a custom register_load_state_dict_pre_hook should be implemented to adapt the loaded dict
accordingly.
If param_names exist in loaded state dict param_groups they will be saved and override
the current names, if present, in the optimizer state. If they do not exist in loaded state dict,
the optimizer param_names will remain unchanged.
Example
>>> # xdoctest: +SKIP>>> model=torch.nn.Linear(10,10)>>> optim=torch.optim.SGD(model.parameters(),lr=3e-4)>>> scheduler1=torch.optim.lr_scheduler.LinearLR(... optim,... start_factor=0.1,... end_factor=1,... total_iters=20,... )>>> scheduler2=torch.optim.lr_scheduler.CosineAnnealingLR(... optim,... T_max=80,... eta_min=3e-5,... )>>> lr=torch.optim.lr_scheduler.SequentialLR(... optim,... schedulers=[scheduler1,scheduler2],... milestones=[20],... )>>> lr.load_state_dict(torch.load("./save_seq.pt"))>>> # now load the optimizer checkpoint after loading the LRScheduler>>> optim.load_state_dict(torch.load("./save_optim.pt"))
state: a Dict holding current optimization state. Its content
differs between optimizer classes, but some common characteristics
hold. For example, state is saved per parameter, and the parameter
itself is NOT saved. state is a Dictionary mapping parameter ids
to a Dict with state corresponding to each parameter.
param_groups: a List containing all parameter groups where each
parameter group is a Dict. Each parameter group contains metadata
specific to the optimizer, such as learning rate and weight decay,
as well as a List of parameter IDs of the parameters in the group.
If a param group was initialized with named_parameters() the names
content will also be saved in the state dict.
NOTE: The parameter IDs may look like indices but they are just IDs
associating state with param_group. When loading from a state_dict,
the optimizer will zip the param_group params (int IDs) and the
optimizer param_groups (actual nn.Parameter s) in order to
match state WITHOUT additional verification.
betas (Tuple[float, float], optional) – coefficients used for computing
running averages of gradient and its square (default: (0.9,0.999))
eps (float, optional) – term added to the denominator to improve
numerical stability (default: 1e-8)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this
algorithm from the paper On the Convergence of Adam and Beyond
(default: False)
betas (Tuple[float, float], optional) – coefficients used for computing
running averages of gradient and its square for ‘adam’ mode (default: (0.9,0.999))
eps (float, optional) – term added to the denominator to improve
numerical stability for ‘adam’ mode (default: 1e-8)
amsgrad (bool, optional) – whether to use the AMSGrad variant of ‘adam’ mode
from the paper On the Convergence of Adam and Beyond (default: False)
betas (Tuple[float, float], optional) – coefficients used for computing
running averages of gradient and its square (default: (0.9,0.999))
eps (float, optional) – term added to the denominator to improve
numerical stability (default: 1e-8)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this
algorithm from the paper On the Convergence of Adam and Beyond
(default: False)
betas (Tuple[float, float], optional) – coefficients used for computing
running averages of gradient and its square for ‘adam’ mode. (default: (0.9,0.999))
eps (float, optional) – term added to the denominator to improve
numerical stability for ‘adam’ mode (default: 1e-8)
amsgrad (bool, optional) – whether to use the AMSGrad variant of ‘adam’ mode
from the paper On the Convergence of Adam and Beyond (default: False)
Stochastic Optimization for NDCG (SONG) and its top-K variant (K-SONG) is used for optimizing NDCGLoss. The key update steps are summarized as follows:
betas (Tuple[float, float], optional) – coefficients used for computing
running averages of gradient and its square for ‘adam’ mode (default: (0.9,0.999))
eps (float, optional) – term added to the denominator to improve
numerical stability for ‘adam’ mode (default: 1e-8)
amsgrad (bool, optional) – whether to use the AMSGrad variant of ‘adam’ mode
from the paper On the Convergence of Adam and Beyond (default: False)
device (torch.device, optional) – the device used for optimization, e.g., ‘cpu’ or ‘cuda’ (default: None)
Example
>>> optimizer=libauc.optimizers.SONG(model.parameters(),lr=0.1,momentum=0.9)>>> optimizer.zero_grad()>>> loss_fn(predictions,batch).backward()# loss_fn can be ListwiseCE_Loss or NDCG_Loss>>> optimizer.step()
betas (Tuple[float, float], optional) – coefficients used for computing
running averages of gradient and its square for ‘adam’ mode (default: (0.9,0.999))
eps (float, optional) – term added to the denominator to improve
numerical stability for ‘adam’ mode (default: 1e-8)
amsgrad (bool, optional) – whether to use the AMSGrad variant of ‘adam’ mode
from the paper On the Convergence of Adam and Beyond (default: False)
\(\hspace{5mm}\) Update model \(\mathbf{w_t}\) by Momemtum or Adam optimzier
where \(h_i(z)=E(\mathcal{A}(\mathbf{x}_i))^{\top} E(z) - E(\mathcal{A}(\mathbf{x}_i))^{\top} E(\mathcal{A}^{\prime}(\mathbf{x}_i))\), \(\mathbf{B}_i = \{\mathcal{A}(\mathbf{x}), \mathcal{A}^{\prime}(\mathbf{x}): \mathcal{A},\mathcal{A}^{\prime}\in\mathcal{P},\mathbf{x}\in \mathbf{B} \backslash \mathbf{x}_i \}\),
\(\Omega=\{\tau_0 \leq \tau \}\) is the constraint set for each learnable \(\mathbf{\tau}_i\), \(\Pi\) is the projection operator.
betas (Tuple[float, float], optional) – coefficients used for computing
running averages of gradient and its square for ‘adam’ mode (default: (0.9,0.999))
eps (float, optional) – term added to the denominator to improve
numerical stability for ‘adam’ mode (default: 1e-8)
amsgrad (bool, optional) – whether to use the AMSGrad variant of ‘adam’ mode
from the paper On the Convergence of Adam and Beyond (default: False)
MIDAM (Multiple Instance Deep AUC Maximization) is used for optimizing the MIDAMLoss (softmax or attention pooling based AUC loss).
Notice that \(h(\mathbf w; \mathcal X_i)=f_2(f_1 (\mathbf w;\mathcal X_i))\) is the bag-level prediction after the pooling operation. Denote that the moving average estimation for bag-level prediction for i-th bag at t-th iteration as \(s_i^t\). The gradients estimation are:
\(\hspace{5mm}\) Sample a batch of positive bags \(\mathcal S_+^t\subset\mathcal D_+\) and a batch of negative bags \(\mathcal S_-^t\subset\mathcal D_-\).
\(\hspace{5mm}\) For each \(i \in \mathcal S^t=\mathcal S_+^t\cup \mathcal S_-^t\):
\(\hspace{5mm}\) Sample a mini-batch of instances \(\mathcal B^t_i\subset\mathcal X_i\) and update:
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this
algorithm from the paper On the Convergence of Adam and Beyond
(default: False)
device (torch.device, optional) – the device used for optimization, e.g., ‘cpu’ or ‘cuda’ (default: None)
Implements AdamW algorithm. This code is adapated from PyTorch codebase.
The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization.
The AdamW variant was proposed in Decoupled Weight Decay Regularization.
:param params: iterable of parameters to optimize
:type params: iterable
:param lr: learning rate (default: 1e-3)
:type lr: float
:param betas: coefficients used for computing
running averages of gradient and its square (default: (0.9,0.999))
Parameters:
eps (float, optional) – term added to the denominator to improve
numerical stability (default: 1e-8)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this
algorithm from the paper On the Convergence of Adam and Beyond
(default: False)
device (torch.device, optional) – the device used for optimization, e.g., ‘cpu’ or ‘cuda’ (default: None).