ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

Question

I am not able to initialize the group process in PyTorch for BERT model I had tried to initialize using following code:

import torch
import datetime

torch.distributed.init_process_group(
    backend='nccl',
    init_method='env://',
    timeout=datetime.timedelta(0, 1800),
    world_size=0,
    rank=0,
    store=None,
    group_name=''
)

and tried to access the get_world_size() function:

num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()

full code:

train_examples = None
    num_train_optimization_steps = None
    if do_train:
        train_examples = processor.get_train_examples(data_dir)
        num_train_optimization_steps = int(
            len(train_examples) / train_batch_size / gradient_accumulation_steps) * num_train_epochs
        if local_rank != -1:
            import datetime
            torch.distributed.init_process_group(backend='nccl',init_method='env://', timeout=datetime.timedelta(0, 1800), world_size=0, rank=0, store=None, group_name='')
            num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
            print(num_train_optimization_steps)

Dandelion · Accepted Answer · 2019-11-12 09:55:28Z

16

I solve the problem by referring https://github.com/NVIDIA/apex/issues/99. Specifically run

python -m torch.distributed.launch xxx.py

edited Nov 12, 2019 at 9:55

answered Nov 12, 2019 at 9:47

Dandelion

1511 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

leila Over a year ago

when I do this I get an error of No module named 'fire', but the package is already installed!!!

mah_mpc · Accepted Answer · 2022-03-02 11:13:56Z

10

Just an update, instead of running:

$ python -m torch.distributed.launch --use_env train_script.py

You now only need to run:

$ torchrun train_script.py

As indicated here.

answered Mar 2, 2022 at 11:13

mah_mpc

1011 silver badge5 bronze badges

Comments

Nikhil Verma · Accepted Answer · 2023-08-03 14:09:05Z

4

How to do the setup for distributed training is defined here by PyTorch -> https://huggingface.co/blog/pytorch-ddp-accelerate-transformers

But you could also do the setting up by adding following lines to your code

import os
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
dist.init_process_group(backend='nccl', init_method='env://', rank = torch.cuda.device_count(), world_size = 1)

answered Aug 3, 2023 at 14:09

Nikhil Verma

1011 silver badge3 bronze badges

Comments

Baragorn · Accepted Answer · 2023-05-31 07:57:04Z

0

you can also add these lines to your script if you want to run the script in native python (helpful for debugging purposes)

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'

answered May 31, 2023 at 7:57

Baragorn

1461 silver badge3 bronze badges

Collectives™ on Stack Overflow

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related