I have a dataframe of the file name and its path in the format of a continuous string:
e.g:
files = pandas.Dataframe((
name path
0 file1.txt \\drive\folder1\folder2\folder3\...\file1.txt
1 file2.pdf \\drive\folder1\file2.pdf
2 file3.xls \\drive\folder1\folder2\folder3\...\folder21\file3.xls
n ... ...))
The size of the frame is about 1.02E+06 entries, the depth of the drive is at most 21 folders, but varies greatly. The goal is have a dataframe in the format of:
name level1 level2 level3 level4 ... level21
0 file.txt folder1 folder2 folder3 0 ... 0
1 file.pdf folder1 0 0 0 ... 0
2 file3.xls folder1 folder2 folder3 folder4 ... folder21
...
I split the string of the file location and created an array with, which can be filled up with zeros if the path is shorter:
files = files.assign(plist=files['path'].iloc[:].apply(path_split))
def path_split(name):
return np.array(os.path.normpath(name).split(os.sep)[7:])
Add a column with number of folder in the files path:
files = files.assign(len_plist = files.plist.iloc[:].map(len))
The problem here is that the split path string creates an nested arrays within the dataframe. Then an empty Dataframe with the number of columns in the quantity of folders ( 21 here) and rows accordin to the number of files (1.02E+06 here):
max_folder = files['len_plist'].max() # get the maximum amount of folders
levelcos = [ 'flevel_{}'.format(i) for i in np.arange(max_folder)]
levels = pd.DataFrame(np.zeros((files.shape[0],max_folder)),
columns =levelcos, index = files.index )
and now I fill the empty frame with the entries of the path array:
levels = fill_rows(levels,files.plist.values)
def fill_rows(df,array):
for i,row in enumerate(array):
df.iloc[i,:row.shape[0] - 1] = row[:-1]
return df
This takes a lot of time, since the varying length of the path arrays does not allow a vectorize solution right away. If I need to loop all 1.02E+06 rows of the dataframe, it would take at least 34h maybe up to 200h.
First and foremost, I want to optimize the filling of the dataframe and in a second step I would split the dataframe, parallelize the operations and assemble the frame again afterwards.
edit: added clarification, that a shorter path can be filled up to the maximum length with zeros.
path_splitto always return an array of that maximum size?