1

I encounterd this error while trying to run hugging face trainer on a multi-gpu.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

I use a T5 model, which then I extract the encoder only, sharding the encoder(separate into two device), wrap it with LoRA, and attach it with a classifier head.

This is the model code:

class ProtT5ForClassification(nn.Module):
    def __init__(self, encoder, device_map):
        super().__init__()
        self.encoder = encoder  # already sharded
        hidden = self.encoder.config.d_model

        # create classifier but don’t push it to a device yet
        self.classifier = nn.Linear(hidden, 1, bias=True).to(torch.float16)

        # dispatch classifier to follow the encoder device map
        # simplest: put it entirely on the last shard (cuda:1 here)
        dispatch_model(self.classifier, device_map={"": "cuda:1"})

        self.loss_fn = nn.BCEWithLogitsLoss()

    def masked_mean_pool(self, hidden_states, attention_mask):
        mask = attention_mask.unsqueeze(-1).type_as(hidden_states)
        summed = (hidden_states * mask).sum(dim=1)
        denom = mask.sum(dim=1).clamp(min=1e-9)
        return summed / denom

    def forward(self, input_ids=None, attention_mask=None, labels=None, **kwargs):
        # IMPORTANT: do not pass anything other than encoder-expected args to encoder
        enc_out = self.encoder(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
        last_hidden = enc_out.last_hidden_state
        pooled = self.masked_mean_pool(last_hidden, attention_mask)
        logits = self.classifier.to(pooled.device)(pooled).squeeze(-1)
        
        loss = None
        if labels is not None:
            labels = labels.float().view(-1)
            loss = self.loss_fn(logits, labels)

        return SequenceClassifierOutput(loss=loss, logits=logits)

I assume that the problem is the classifier head and the final layer of the encoder is not in the same device, so I tried to map the classifier head and the encoder last layer on the same device, but the error persist.

Could anyone figure out what's wrong?

Thanks

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.