.. _dist_train:
Distributed Training
================================================================================================================================
.. raw:: html
------------------------------------------------------------------------------------
.. container:: cell markdown
| **Author**: Gang Li, Xiyuan Wei, Tianbao Yang
\
Introduction
------------------------------------------------------------------------------------
This tutorial demonstrates how to perform distributed training with LibAUC.
We provide example scripts for:
1. Optimizing `AUCMLoss + PESG` for AUROC
2. Optimizing `APLoss + SOAP` for AUPRC
3. Optimizing `NDCGLoss + SONG` for NDCG
Download example scripts from `examples` folder in `LibAUC Repository `__.
Run the following comand based on number of GPUs you have.
The entry command is:
.. code:: bash
torchrun --nproc_per_node= 20_Distributed_Training_for_Optimizing_AUROC.py
# or torchrun --nproc_per_node= 20_Distributed_Training_for_Optimizing_AUPRC.py
# or torchrun --nproc_per_node= 20_Distributed_Training_for_Optimizing_NDCG.py
Customize Your Training
------------------------------------------------------------------------------------
Feel free to adapt these scripts to suit your specific training needs.
Below are some key points to support distributed training with LibAUC.
1. Use `DistributedDualSampler` or `DistributedTriSampler` as your train_sampler.
2. Pass `device` as an argument to your loss function and optimzer. For example,
.. code:: python
loss_fn = APLoss(data_len=len(trainSet), margin=margin, gamma=gamma, device=device)
optimizer = SOAP(model.parameters(), lr=lr, mode='adam', weight_decay=weight_decay, device=device)