I have about 20000 documents in subdirectories. And I would like to read them all and append them as a one list of lists. This is my code so far,
topics =os.listdir(my_directory)
df =[]
for topic in topics:
files = os.listdir (my_directory+ '/'+ topic)
print(files)
for file in files:
print(file)
f = open(my_directory+ '/'+ topic+ '/'+file, 'r', encoding ='latin1')
data = f.read().replace('\n', ' ')
print(data)
f.close()
df = np.append(df, data)
However this is inefficient, and it takes a long time to read and append them in the df list. My expected output is,
df= [[doc1], [doc2], [doc3], [doc4],......,[doc20000]]
I ran the above code and it took more than 6 hours and was still not finished(probably did half of the documents).How can I change the code to make it faster?
df = np.append(df, data)is outside of the loop, you are throwing all but the lastdataaway.print(data)call in the loop. Printing stuff takes a surprisingly long time what with all the scrolling, and it can be even slower if you're running the script in an IDE or something other than the terminal.