4

I am trying to write a custom torch data loader so that large CSV files can be loaded incrementally (by chunks).

I have a rough idea of how to do that. However, I keep getting some PyTorch error that I do not know how to solve.


import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader

# Create dummy csv data
nb_samples = 110
a = np.arange(nb_samples)
df = pd.DataFrame(a, columns=['data'])
df.to_csv('data.csv', index=False)


# Create Dataset
class CSVDataset(Dataset):
    def __init__(self, path, chunksize, nb_samples):
        self.path = path
        self.chunksize = chunksize
        self.len = nb_samples / self.chunksize

    def __getitem__(self, index):
        x = next(
            pd.read_csv(
                self.path,
                skiprows=index * self.chunksize + 1,  #+1, since we skip the header
                chunksize=self.chunksize,
                names=['data']))
        x = torch.from_numpy(x.data.values)
        return x

    def __len__(self):
        return self.len


dataset = CSVDataset('data.csv', chunksize=10, nb_samples=nb_samples)
loader = DataLoader(dataset, batch_size=10, num_workers=1, shuffle=False)

for batch_idx, data in enumerate(loader):
    print('batch: {}\tdata: {}'.format(batch_idx, data))

I get 'float' object cannot be interpreted as an integer error

1 Answer 1

3

The error is caused by this line:

self.len = nb_samples / self.chunksize

When dividing using / the result is always a float. But you can only return an integer in the __len__() function. Therefore you have to round self.len and/or convert it to an integer. For example by simply doing this:

self.len = nb_samples // self.chunksize

the double slash (//) rounds down and converts to integer.

Edit: You acutally CAN return a float in __len__() but when calling len(dataset) the error will occur. So I guess len(dataset) is called somewhere inside the DataLoader class.

Sign up to request clarification or add additional context in comments.

2 Comments

Many thanks for this suggestion. However, with this fix I get a new error: DataLoader worker (pid(s) 18357) exited unexpectedly
this error is unrelated. But maybe this answer helps

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.