I am having issue with have WANDB as it disconnect frequently in a non-deterministic manner. below is the description.
Error Message:
WandB run initialized successfully. Run ID: f782c567-5e34-4544-b7dc-a717e0af18dd, Name: repercussion-24-07-21-01-32-50-536 wandb: Network error (ConnectionError), entering retry loop.
WandB run initialized successfully. Run ID: f782c567-5e34-4544-b7dc-a717e0af18dd, Name: repercussion-24-07-21-01-32-50-536 wandb: Network error (ConnectionError), entering retry loop.
Code Implementation:
import os
import wandb
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_fixed
class WandbRun:
def __init__(self, project: str, entity: str, config: dict = {}, tags: list = [], run_name: str = "", run_id: str = ""):
self.project = project
self.entity = entity
self.config = config
self.tags = tags
self.run_name = run_name
self.run_id = run_id
self.run = None
self.initialize_run()
@retry(
stop=stop_after_attempt(5),
wait=wait_fixed(2),
retry=retry_if_exception_type(ConnectionError),
)
def initialize_run(self):
params = {
"project": self.project,
"entity": self.entity,
"config": self.config,
"tags": self.tags,
"resume": "allow",
}
if self.run_name:
params["name"] = self.run_name
if self.run_id:
params["id"] = self.run_id
try:
self.run = wandb.init(**params)
print(f"WandB run initialized successfully. Run ID: {self.run.id}, Name: {self.run.name}")
except (wandb.CommError, ConnectionError) as e:
print(f"WandB initialization failed: {e}")
raise
except Exception as e:
print(f"Unexpected error during WandB initialization: {e}")
raise
# Usage
run = WandbRun(
project="my_project",
entity="my_entity",
config={"param1": "value1"},
tags=["tag1", "tag2"],
run_name="my_run_name"
)
Environment Details:
WandB version: 0.13.7
Python version: 3.10
cloud provider: GCP
Also, I have tried wandb troublshooting guide: https://docs.wandb.ai/guides/technical-faq/troubleshooting#how-do-i-deal-with-network-issues.
any suggestion why this happens or solution is greatly appreciated.
Thanks :)