2

I am beginner pytorch user, and I am trying to use dataloader.

Actually, I am trying to implement this into my network but it takes a very long time to load. And so, I debugged my network to see if the network itself has the problem, but it turns out it has something to with my dataloader class. Here is the code:

 from torch.utils.data import Dataset, DataLoader
 import numpy as np
 import pandas as pd

class DiabetesDataset(Dataset):

  def __init__(self, csv):
      self.xy = pd.read_csv(csv)

  def __len__(self):
      return len(self.xy)

  def __getitem__(self, index):
       self.x_data = torch.Tensor(xy.iloc[:, 0:-1].values)
       self.y_data = torch.Tensor(xy.iloc[:, [-1]].values)
       return self.x_data[index], self.y_data[index]

 dataset = DiabetesDataset("trial.csv")
 train_loader = DataLoader(dataset=dataset,
                      batch_size=1,
                      shuffle=True,
                      num_workers=2)`

 for a in train_loader:
    print(a)

To verify that the dataloader causes all the delay, I created a dummy csv file with 2 columns of 1s and 2s, for a total of 10 samples for each columns. Then, I looped over the train_loader object, it has been more than 1 hr and it is still running, considering that the sample size is small and batch size is set to 1.

I am not sure as to what the error to my code is and it is causing this issue.

Any comments/inputs are greatly appreciated!

1 Answer 1

3

There are some bugs in your code - could you check if this works (it is working on my computer with your toy example):

from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
import torch


class DiabetesDataset(Dataset):

    def __init__(self, csv):
        self.xy = pd.read_csv(csv)

    def __len__(self):
        return len(self.xy)

    def __getitem__(self, index):
        x_data = torch.Tensor(self.xy.iloc[:, 0:-1].values)
        y_data = torch.Tensor(self.xy.iloc[:, [-1]].values)
        return x_data[index], y_data[index]


dataset = DiabetesDataset("trial.csv")


train_loader = DataLoader(
    dataset=dataset,
    batch_size=1,
    shuffle=True,
    num_workers=2)

if __name__ == '__main__':
    for a in train_loader:
        print(a)

Edit: Your code is not working because you are missing a self in the __getitem__ method (self.xy.iloc...) and because you do not have a if __name__ == '__main__ at the end of your script. For the second error, see RuntimeError on windows trying python multiprocessing

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the quick response! I copied and pasted your code, but it is still running (5 minutes have elapsed).
Are you using the if __name__ == '__main__' at the end?
Could you try setting num_workers=0 in the DataLoader?
it worked!! thank you very much. Would you know if this has something to do with the computing power/specs of my laptop?
I am really not an expert. You should read about subprocesses. There is something wrong if you can't use 2 subprocesses to load the data, but I cannot tell you the source. You could start by looking at your task manager to check your CPU usage.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.