23
$\begingroup$

I have x_data and labels separately. How can I combine and load them in the model using torch.utils.data.DataLoader?

I have a dataset that I created and the training data has 20k samples and the labels are also separate. Lets say I want to load a dataset in the model, shuffle each time and use the batch size that I prefer. The Dataloader function does that. How can I combine and put them in the function so that I can train it in the model in pytorch?

$\endgroup$
2
  • $\begingroup$ See discussion on StackOverflow here: stackoverflow.com/questions/41924453/… $\endgroup$ Commented Mar 13, 2019 at 17:00
  • $\begingroup$ I found this example using TensorDataset to be helpful: stackoverflow.com/questions/55588201/… If x_data and labels are both Pytorch tensors, you can combine them into a TensorDataset then create a dataloader from that TensorDataset. $\endgroup$ Commented Jun 11, 2020 at 7:54

2 Answers 2

17
$\begingroup$

Assuming both of x_data and labels are lists or numpy arrays,

train_data = []
for i in range(len(x_data)):
   train_data.append([x_data[i], labels[i]])

trainloader = torch.utils.data.DataLoader(train_data, shuffle=True, batch_size=100)
i1, l1 = next(iter(trainloader))
print(i1.shape)
$\endgroup$
2
  • $\begingroup$ I end up getting errors saying that train_data is an np.object rather than a tensor. $\endgroup$ Commented Nov 28, 2020 at 19:47
  • 2
    $\begingroup$ For me, it worked. You can even use a shorter version: trainloader = torch.utils.data.DataLoader([ [x_data[i], labels[i]] for i in range(len(labels))], shuffle=True, batch_size=100). Thank you @ASHu2 $\endgroup$ Commented Jun 15, 2021 at 8:40
7
$\begingroup$

I think the standard way is to create a Dataset class object from the arrays and pass the Dataset object to the DataLoader.

One solution is to inherit from the Dataset class and define a custom class that implements __len__() and __get__(), where you pass X and y to the __init__(self,X,y).

For your simple case with two arrays and without the necessity for a special __get__() function beyond taking the values in row i, you can also use transform the arrays into Tensor objects and pass them to TensorDataset.

Run the following code for a self-contained example.

# Create a dataset like the one you describe
from sklearn.datasets import make_classification
X,y = make_classification()

# Load necessary Pytorch packages
from torch.utils.data import DataLoader, TensorDataset
from torch import Tensor

# Create dataset from several tensors with matching first dimension
# Samples will be drawn from the first dimension (rows)
dataset = TensorDataset( Tensor(X), Tensor(y) )

# Create a data loader from the dataset
# Type of sampling and batch size are specified at this step
loader = DataLoader(dataset, batch_size= 3)

# Quick test
next(iter(loader))
$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.