0

I have a dataframe of the file name and its path in the format of a continuous string:

e.g:

 files = pandas.Dataframe((   
      name            path
 0    file1.txt       \\drive\folder1\folder2\folder3\...\file1.txt   
 1    file2.pdf       \\drive\folder1\file2.pdf 
 2    file3.xls       \\drive\folder1\folder2\folder3\...\folder21\file3.xls  
 n   ...            ...))

The size of the frame is about 1.02E+06 entries, the depth of the drive is at most 21 folders, but varies greatly. The goal is have a dataframe in the format of:

     name           level1     level2     level3    level4  ...  level21
0    file.txt       folder1    folder2    folder3      0    ...    0      
1    file.pdf       folder1       0          0         0    ...    0   
2    file3.xls      folder1    folder2    folder3   folder4 ...  folder21
...

I split the string of the file location and created an array with, which can be filled up with zeros if the path is shorter:

files = files.assign(plist=files['path'].iloc[:].apply(path_split))

def path_split(name):
     return np.array(os.path.normpath(name).split(os.sep)[7:])

Add a column with number of folder in the files path:

files = files.assign(len_plist = files.plist.iloc[:].map(len))

The problem here is that the split path string creates an nested arrays within the dataframe. Then an empty Dataframe with the number of columns in the quantity of folders ( 21 here) and rows accordin to the number of files (1.02E+06 here):

max_folder = files['len_plist'].max()  # get the maximum amount of folders    
levelcos = [ 'flevel_{}'.format(i) for i in np.arange(max_folder)]   
levels = pd.DataFrame(np.zeros((files.shape[0],max_folder)),   
                      columns =levelcos, index = files.index )

and now I fill the empty frame with the entries of the path array:

levels = fill_rows(levels,files.plist.values)   

def fill_rows(df,array):
    for i,row in enumerate(array):
        df.iloc[i,:row.shape[0] - 1] = row[:-1]
    return df

This takes a lot of time, since the varying length of the path arrays does not allow a vectorize solution right away. If I need to loop all 1.02E+06 rows of the dataframe, it would take at least 34h maybe up to 200h.

First and foremost, I want to optimize the filling of the dataframe and in a second step I would split the dataframe, parallelize the operations and assemble the frame again afterwards.

edit: added clarification, that a shorter path can be filled up to the maximum length with zeros.

2
  • Can you know the maximum depth of a path? You speak of 21, put produce example code with only 7... Is it an option to initially create all the columns for that maximum and have path_split to always return an array of that maximum size? Commented Feb 26, 2019 at 8:41
  • Updated the question to clarify that the maximum amount of folders is indeed 21 and if a path is shorter, the entry remains '0'. Commented Feb 26, 2019 at 9:14

1 Answer 1

1

Maybe I'm missing something but why doesn't this work for you?

expanded = files['path'].str.split(os.path.sep, expand=True).fillna(0)
expanded = expanded.rename(columns=lambda x: 'level_' + str(x))
df = pd.concat([files.name, expanded], axis=1)
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you that works very much. The expand function option was the piece missing. Only a little fine tuning needed: The filename itsseld is still always in the very last level column, not the last folder the file is in. Can I drop the last column of every row and skip a couple in the front in-between here?
Sorry, I don't know what you mean. For me df seems to have the format you describes aboce as goal.
the first row of the data frame for file1.txt looks kile this: [file.txt, folder1, folder2, folder3, file.txt,0,0,0,..,0]. Since at the end of the path the file itself is writen. I want to split away the file name and have only the path. I wondered if it is possible here or i need to do it beforehanded.
You should split the filename first with files['path'].str.rsplit(os.path.sep, n=1, expand=True)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.