15

I am not able to initialize the group process in PyTorch for BERT model I had tried to initialize using following code:

import torch
import datetime

torch.distributed.init_process_group(
    backend='nccl',
    init_method='env://',
    timeout=datetime.timedelta(0, 1800),
    world_size=0,
    rank=0,
    store=None,
    group_name=''
)

and tried to access the get_world_size() function:

num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()

full code:

train_examples = None
    num_train_optimization_steps = None
    if do_train:
        train_examples = processor.get_train_examples(data_dir)
        num_train_optimization_steps = int(
            len(train_examples) / train_batch_size / gradient_accumulation_steps) * num_train_epochs
        if local_rank != -1:
            import datetime
            torch.distributed.init_process_group(backend='nccl',init_method='env://', timeout=datetime.timedelta(0, 1800), world_size=0, rank=0, store=None, group_name='')
            num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
            print(num_train_optimization_steps)

4 Answers 4

16

I solve the problem by referring https://github.com/NVIDIA/apex/issues/99. Specifically run

python -m torch.distributed.launch xxx.py
Sign up to request clarification or add additional context in comments.

1 Comment

when I do this I get an error of No module named 'fire', but the package is already installed!!!
10

Just an update, instead of running:

$ python -m torch.distributed.launch --use_env train_script.py

You now only need to run:

$ torchrun train_script.py

As indicated here.

Comments

4

How to do the setup for distributed training is defined here by PyTorch -> https://huggingface.co/blog/pytorch-ddp-accelerate-transformers

But you could also do the setting up by adding following lines to your code

import os
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
dist.init_process_group(backend='nccl', init_method='env://', rank = torch.cuda.device_count(), world_size = 1)

Comments

0

you can also add these lines to your script if you want to run the script in native python (helpful for debugging purposes)

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.