I encounterd this error while trying to run hugging face trainer on a multi-gpu.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
I use a T5 model, which then I extract the encoder only, sharding the encoder(separate into two device), wrap it with LoRA, and attach it with a classifier head.
This is the model code:
class ProtT5ForClassification(nn.Module):
def __init__(self, encoder, device_map):
super().__init__()
self.encoder = encoder # already sharded
hidden = self.encoder.config.d_model
# create classifier but don’t push it to a device yet
self.classifier = nn.Linear(hidden, 1, bias=True).to(torch.float16)
# dispatch classifier to follow the encoder device map
# simplest: put it entirely on the last shard (cuda:1 here)
dispatch_model(self.classifier, device_map={"": "cuda:1"})
self.loss_fn = nn.BCEWithLogitsLoss()
def masked_mean_pool(self, hidden_states, attention_mask):
mask = attention_mask.unsqueeze(-1).type_as(hidden_states)
summed = (hidden_states * mask).sum(dim=1)
denom = mask.sum(dim=1).clamp(min=1e-9)
return summed / denom
def forward(self, input_ids=None, attention_mask=None, labels=None, **kwargs):
# IMPORTANT: do not pass anything other than encoder-expected args to encoder
enc_out = self.encoder(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
last_hidden = enc_out.last_hidden_state
pooled = self.masked_mean_pool(last_hidden, attention_mask)
logits = self.classifier.to(pooled.device)(pooled).squeeze(-1)
loss = None
if labels is not None:
labels = labels.float().view(-1)
loss = self.loss_fn(logits, labels)
return SequenceClassifierOutput(loss=loss, logits=logits)
I assume that the problem is the classifier head and the final layer of the encoder is not in the same device, so I tried to map the classifier head and the encoder last layer on the same device, but the error persist.
Could anyone figure out what's wrong?
Thanks