.. _dist_train: Distributed Training ================================================================================================================================ .. raw:: html
View on Github
------------------------------------------------------------------------------------ .. container:: cell markdown | **Author**: Gang Li, Xiyuan Wei, Tianbao Yang \ Introduction ------------------------------------------------------------------------------------ This tutorial demonstrates how to perform distributed training with LibAUC. We provide example scripts for: 1. Optimizing `AUCMLoss + PESG` for AUROC 2. Optimizing `APLoss + SOAP` for AUPRC 3. Optimizing `NDCGLoss + SONG` for NDCG Download example scripts from `examples` folder in `LibAUC Repository `__. Run the following comand based on number of GPUs you have. The entry command is: .. code:: bash torchrun --nproc_per_node= 20_Distributed_Training_for_Optimizing_AUROC.py # or torchrun --nproc_per_node= 20_Distributed_Training_for_Optimizing_AUPRC.py # or torchrun --nproc_per_node= 20_Distributed_Training_for_Optimizing_NDCG.py Customize Your Training ------------------------------------------------------------------------------------ Feel free to adapt these scripts to suit your specific training needs. Below are some key points to support distributed training with LibAUC. 1. Use `DistributedDualSampler` or `DistributedTriSampler` as your train_sampler. 2. Pass `device` as an argument to your loss function and optimzer. For example, .. code:: python loss_fn = APLoss(data_len=len(trainSet), margin=margin, gamma=gamma, device=device) optimizer = SOAP(model.parameters(), lr=lr, mode='adam', weight_decay=weight_decay, device=device)