Distributed Training
Author: Gang Li, Xiyuan Wei, Tianbao Yang
Introduction
This tutorial demonstrates how to perform distributed training with LibAUC. We provide example scripts for:
Optimizing AUCMLoss + PESG for AUROC
Optimizing APLoss + SOAP for AUPRC
Optimizing NDCGLoss + SONG for NDCG
Download example scripts from examples folder in LibAUC Repository. Run the following comand based on number of GPUs you have.
The entry command is:
torchrun --nproc_per_node=<NUM_GPUS> 20_Distributed_Training_for_Optimizing_AUROC.py
# or torchrun --nproc_per_node=<NUM_GPUS> 20_Distributed_Training_for_Optimizing_AUPRC.py
# or torchrun --nproc_per_node=<NUM_GPUS> 20_Distributed_Training_for_Optimizing_NDCG.py
Customize Your Training
Feel free to adapt these scripts to suit your specific training needs. Below are some key points to support distributed training with LibAUC.
Use DistributedDualSampler or DistributedTriSampler as your train_sampler.
Pass device as an argument to your loss function and optimzer. For example,
loss_fn = APLoss(data_len=len(trainSet), margin=margin, gamma=gamma, device=device)
optimizer = SOAP(model.parameters(), lr=lr, mode='adam', weight_decay=weight_decay, device=device)