Distributed Training


Author: Gang Li, Xiyuan Wei, Tianbao Yang

Introduction

This tutorial demonstrates how to perform distributed training with LibAUC. We provide example scripts for:

  1. Optimizing AUCMLoss + PESG for AUROC

  2. Optimizing APLoss + SOAP for AUPRC

  3. Optimizing NDCGLoss + SONG for NDCG

Download example scripts from examples folder in LibAUC Repository. Run the following comand based on number of GPUs you have.

The entry command is:

torchrun --nproc_per_node=<NUM_GPUS>  20_Distributed_Training_for_Optimizing_AUROC.py
# or torchrun --nproc_per_node=<NUM_GPUS>  20_Distributed_Training_for_Optimizing_AUPRC.py
# or torchrun --nproc_per_node=<NUM_GPUS>  20_Distributed_Training_for_Optimizing_NDCG.py

Customize Your Training

Feel free to adapt these scripts to suit your specific training needs. Below are some key points to support distributed training with LibAUC.

  1. Use DistributedDualSampler or DistributedTriSampler as your train_sampler.

  2. Pass device as an argument to your loss function and optimzer. For example,

loss_fn = APLoss(data_len=len(trainSet), margin=margin, gamma=gamma, device=device)
optimizer = SOAP(model.parameters(), lr=lr, mode='adam', weight_decay=weight_decay, device=device)