1

I have about 20000 documents in subdirectories. And I would like to read them all and append them as a one list of lists. This is my code so far,

topics =os.listdir(my_directory)
df =[]
for topic in topics:
    files = os.listdir (my_directory+ '/'+ topic)
    print(files)

    for file in files: 
        print(file)
        f = open(my_directory+ '/'+ topic+ '/'+file, 'r', encoding ='latin1')
        data = f.read().replace('\n', ' ')
        print(data)
        f.close()
    df = np.append(df, data)

However this is inefficient, and it takes a long time to read and append them in the df list. My expected output is,

 df= [[doc1], [doc2], [doc3], [doc4],......,[doc20000]]

I ran the above code and it took more than 6 hours and was still not finished(probably did half of the documents).How can I change the code to make it faster?

10
  • 1
    I notice that you've flagged this as 'machine learning', and as such I won't answer your question exactly but give a couple suggestions. It's generally bad practice to load all of your data simultaneously into memory, especially since you can perform reads while you're doing your other calculations. You should use the multiprocessing module to take advantage of another core to go and collect the next N batches while you're model is computing gradients (or whatever it does). Otherwise, your code looks fine (could be improved with context managers), but needs to be multi-threaded. Commented May 24, 2020 at 5:46
  • 2
    As an aside, since df = np.append(df, data) is outside of the loop, you are throwing all but the last data away. Commented May 24, 2020 at 5:46
  • Opening 20 000 text files takes a lot of time in itself. Perhaps you could write a seperate code to convert those to something like 100 csv files which are a lot faster to read? Commented May 24, 2020 at 5:49
  • 2
    Remove the print(data) call in the loop. Printing stuff takes a surprisingly long time what with all the scrolling, and it can be even slower if you're running the script in an IDE or something other than the terminal. Commented May 24, 2020 at 5:53
  • How big are these files and do you have enough RAM to hold them? It shouldn't take hours to read enough data to swamp your RAM. At some point you may start thrashing the swap file, but eventually it'll all blow up. Commented May 24, 2020 at 6:03

3 Answers 3

1

There is only so much you can do to speed disk access. You can use threads to overlap some file read operations with the latin1 decode and newline replacement. But realistically, it won't make a huge difference.

import multiprocessing.pool

MEG = 2**20
filelist = []

topics =os.listdir(my_directory)
for topic in topics:
    files = os.listdir (my_directory+ '/'+ topic)
    print(files)

    for file in files: 
        print(file)
        filelist.append(my_directory+ '/'+ topic+ '/'+file)

def worker(filename):
    with open(filename, encoding ='latin1',  bufsize=1*MEG) as f:
        data = f.read().replace('\n', ' ')
        #print(data)
        return data

with multiprocessing.pool.ThreadPool() as pool:
    datalist = pool.map(worker, filelist, chunksize=1)
df = np.array(datalist)
Sign up to request clarification or add additional context in comments.

2 Comments

can it be done using MPI too ? Can you suggest a solution based on MPI?
Sure, but I think MPI would be slower because you'd have to copy data between processes. This task is likely I/O bound and just letting one thread wait on reads while another is converting is about the best you'll get.
0

Generator functions allow you to declare a function that behaves like an iterator, i.e. it can be used in a for loop.

generators

lazy function generator

def read_in_chunks(file, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file.read(chunk_size)
        if not data:
            break
        yield data


with open('big_file.dat') as f:
    for piece in read_in_chunks(f):
        process_data(piece)

class Reader(object):
    def __init__(self, g):
        self.g = g
    def read(self, n=0):
        try:
            return next(self.g)
        except StopIteration:
            return ''

df = pd.concat(list(pd.read_csv(Reader(read_in_chunks()),chunksize=10000)),axis=1)
df.to_csv("output.csv", index=False)

2 Comments

This is not a big file. This dataset contains 20000 folders in subdirectories. My aim is to read them one by one and append them in a list of lists.
Ok, there is a Reader() which put data in datframe and write in csv.
0

Note

I misread the line df = np.append(df, data) and I assumed you are appending to DataFrame, not to numpy array. So my comment is kind of irrelevant but I am leaving it for others that my misread like me or have a similar problem with pandas' DataFrame append.


Actual Problem

It looks like your question may not actually solve your actual problem. Have you measured the performance of your two most important calls?

  • files = os.listdir (my_directory+ '/'+ topic)
  • df = np.append(df, data)

The way you formatted your code makes me think there is a bug: df = np.append(df, data) is outside the file's for loop scope so I think only your last data is appended to your data frame. In case that's just problem with code formatting here in the post and you really do append 20k files to your data frame then this may be the problem - appending to DataFrame is slow.

Potential Solution

As usual slow performance can be tackled by throwing more memory at the problem. If you have enough memory to load all of the files beforehand and only then insert them in a DataFrame this could prove to be faster.

The key is to not deal with any pandas operation until you have loaded all the data. Only then you could use DataFrame's from_records or one of its other factory methods.

A nice SO question that has a little more discussion I found: Improve Row Append Performance On Pandas DataFrames

TL;DR

  1. Measure the time to read all the files without dealing with pandas at all
  2. If it proves to be much much faster and you have enough memory to load all the files' contents at once use another way to construct your DataFrame, say DataFrame.from_records

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.