0

I am having issue with have WANDB as it disconnect frequently in a non-deterministic manner. below is the description.

Error Message:

WandB run initialized successfully. Run ID: f782c567-5e34-4544-b7dc-a717e0af18dd, Name: repercussion-24-07-21-01-32-50-536 wandb: Network error (ConnectionError), entering retry loop.

WandB run initialized successfully. Run ID: f782c567-5e34-4544-b7dc-a717e0af18dd, Name: repercussion-24-07-21-01-32-50-536 wandb: Network error (ConnectionError), entering retry loop.

Code Implementation:

import os
import wandb
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_fixed

class WandbRun:
    def __init__(self, project: str, entity: str, config: dict = {}, tags: list = [], run_name: str = "", run_id: str = ""):
        self.project = project
        self.entity = entity
        self.config = config
        self.tags = tags
        self.run_name = run_name
        self.run_id = run_id
        self.run = None
        self.initialize_run()

    @retry(
        stop=stop_after_attempt(5),
        wait=wait_fixed(2),
        retry=retry_if_exception_type(ConnectionError),
    )
    def initialize_run(self):
        params = {
            "project": self.project,
            "entity": self.entity,
            "config": self.config,
            "tags": self.tags,
            "resume": "allow",
        }
        if self.run_name:
            params["name"] = self.run_name
        if self.run_id:
            params["id"] = self.run_id
        try:
            self.run = wandb.init(**params)
            print(f"WandB run initialized successfully. Run ID: {self.run.id}, Name: {self.run.name}")
        except (wandb.CommError, ConnectionError) as e:
            print(f"WandB initialization failed: {e}")
            raise
        except Exception as e:
            print(f"Unexpected error during WandB initialization: {e}")
            raise

# Usage
run = WandbRun(
    project="my_project",
    entity="my_entity",
    config={"param1": "value1"},
    tags=["tag1", "tag2"],
    run_name="my_run_name"
)

Environment Details:

WandB version: 0.13.7
Python version: 3.10
cloud provider: GCP

Also, I have tried wandb troublshooting guide: https://docs.wandb.ai/guides/technical-faq/troubleshooting#how-do-i-deal-with-network-issues.

any suggestion why this happens or solution is greatly appreciated.

Thanks :)

1 Answer 1

0

Are you still experiencing this issue? What type of setup are you running your code from? Is this a distributed setup?

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.