0

I have a file containing paths to images I would like to load into Pytorch, while utilizing the built-in dataloader features (multiprocess loading pipeline, data augmentations, and so on).

def create_links():
    data_dir = "/myfolder"

    full_path_list = []
    assert os.path.isdir(data_dir)
    for _, _, filenames in os.walk(data_dir):
        for filename in filenames:
            full_path_list.append(os.path.join(data_dir, filename))

    with open(config.data.links_file, 'w+') as links_file:
        for full_path in full_path_list:
            links_file.write(f"{full_path}\n")


def read_links_file_to_list():
    config = ConfigProvider.config()
    links_file_path = config.data.links_file
    if not os.path.isfile(links_file_path):
        raise RuntimeError("did you forget to create a file with links to images? Try using 'create_links()'")
    with open(links_file_path, 'r') as links_file:
        return links_file.readlines()

So I have a list of files (or a generator, or whatever works), file_list = read_links_file_to_list().

How can I build a Pytorch dataloader around it, and how would I use it?

3
  • 2
    pytorch.org/tutorials/beginner/data_loading_tutorial.html Commented Jul 18, 2020 at 18:29
  • 2
    You can't get much more beginner than the docs at the pytorch.org official website with the suffix beginner/data_loading_tutorial.html. If you want to break into the DL field you are going to have to do some reading and work - there is no way around it. Commented Jul 18, 2020 at 18:36
  • Here is another way it was explained as answer to a similar post: stackoverflow.com/a/78862393/26775041 Commented Aug 12, 2024 at 15:51

1 Answer 1

4

What you want is a Custom Dataset. The __getitem__ method is where you would apply transforms such as data-augmentation etc. To give you an idea of what it looks like in practice you can take a look at this Custom Dataset I wrote the other day:

class GTSR43Dataset(Dataset):
    """German Traffic Sign Recognition dataset."""
    def __init__(self, root_dir, train_file, transform=None):
        self.root_dir = root_dir
        self.train_file_path = train_file
        self.label_df = pd.read_csv(os.path.join(self.root_dir, self.train_file_path))
        self.transform = transform
        self.classes = list(self.label_df['ClassId'].unique())

    def __getitem__(self, idx):
        """Return (image, target) after resize and preprocessing."""
        img = os.path.join(self.root_dir, self.label_df.iloc[idx, 7])
        
        X = Image.open(img)
        y = self.class_to_index(self.label_df.iloc[idx, 6])

        if self.transform:
            X = self.transform(X)

        return X, y
    
    def class_to_index(self, class_name):
        """Returns the index of a given class."""
        return self.classes.index(class_name)
    
    def index_to_class(self, class_index):
        """Returns the class of a given index."""
        return self.classes[class_index] 
    
    def get_class_count(self):
        """Return a list of label occurences"""
        cls_count = dict(self.label_df.ClassId.value_counts())
#         cls_percent = list(map(lambda x: (1 - x / sum(cls_count)), cls_count))
        return cls_count
    
    def __len__(self):
        """Returns the length of the dataset."""
        return len(self.label_df)
Sign up to request clarification or add additional context in comments.

1 Comment

Good job. As a side note for the viewers, The full name for Dataset is torch.utils.data.Dataset

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.