Loading own train data and labels in dataloader using pytorch?

Question

I have x_data and labels separately. How can I combine and load them in the model using torch.utils.data.DataLoader?

I have a dataset that I created and the training data has 20k samples and the labels are also separate. Lets say I want to load a dataset in the model, shuffle each time and use the batch size that I prefer. The Dataloader function does that. How can I combine and put them in the function so that I can train it in the model in pytorch?

See discussion on StackOverflow here: stackoverflow.com/questions/41924453/… — Johannes
– Johannes, Commented Mar 13, 2019 at 17:00
I found this example using TensorDataset to be helpful: stackoverflow.com/questions/55588201/… If x_data and labels are both Pytorch tensors, you can combine them into a TensorDataset then create a dataloader from that TensorDataset. — littleO
– littleO, Commented Jun 11, 2020 at 7:54

ASHu2 · Accepted Answer · 2019-03-13 14:19:19Z

17

Assuming both of x_data and labels are lists or numpy arrays,

train_data = []
for i in range(len(x_data)):
   train_data.append([x_data[i], labels[i]])

trainloader = torch.utils.data.DataLoader(train_data, shuffle=True, batch_size=100)
i1, l1 = next(iter(trainloader))
print(i1.shape)

answered Mar 13, 2019 at 14:19

ASHu2

2702 silver badges6 bronze badges

$\begingroup$ I end up getting errors saying that train_data is an np.object rather than a tensor. $\endgroup$

Gunner Stone
– Gunner Stone

2020-11-28 19:47:55 +00:00
Commented Nov 28, 2020 at 19:47
2

$\begingroup$ For me, it worked. You can even use a shorter version: trainloader = torch.utils.data.DataLoader([ [x_data[i], labels[i]] for i in range(len(labels))], shuffle=True, batch_size=100). Thank you @ASHu2 $\endgroup$

Leo
– Leo

2021-06-15 08:40:23 +00:00
Commented Jun 15, 2021 at 8:40

Add a comment |

Johannes · Accepted Answer · 2019-03-13 17:06:39Z

I think the standard way is to create a Dataset class object from the arrays and pass the Dataset object to the DataLoader.

One solution is to inherit from the Dataset class and define a custom class that implements __len__() and __get__(), where you pass X and y to the __init__(self,X,y).

For your simple case with two arrays and without the necessity for a special __get__() function beyond taking the values in row i, you can also use transform the arrays into Tensor objects and pass them to TensorDataset.

Run the following code for a self-contained example.

# Create a dataset like the one you describe
from sklearn.datasets import make_classification
X,y = make_classification()

# Load necessary Pytorch packages
from torch.utils.data import DataLoader, TensorDataset
from torch import Tensor

# Create dataset from several tensors with matching first dimension
# Samples will be drawn from the first dimension (rows)
dataset = TensorDataset( Tensor(X), Tensor(y) )

# Create a data loader from the dataset
# Type of sampling and batch size are specified at this step
loader = DataLoader(dataset, batch_size= 3)

# Quick test
next(iter(loader))

Stack Exchange Network

Loading own train data and labels in dataloader using pytorch?

2 Answers 2

Your Answer

Hot Network Questions

Loading own train data and labels in dataloader using pytorch?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions