Create a dataframe from unequal length strings

Question

I have a dataframe of the file name and its path in the format of a continuous string:

e.g:

 files = pandas.Dataframe((   
      name            path
 0    file1.txt       \\drive\folder1\folder2\folder3\...\file1.txt   
 1    file2.pdf       \\drive\folder1\file2.pdf 
 2    file3.xls       \\drive\folder1\folder2\folder3\...\folder21\file3.xls  
 n   ...            ...))

The size of the frame is about 1.02E+06 entries, the depth of the drive is at most 21 folders, but varies greatly. The goal is have a dataframe in the format of:

     name           level1     level2     level3    level4  ...  level21
0    file.txt       folder1    folder2    folder3      0    ...    0      
1    file.pdf       folder1       0          0         0    ...    0   
2    file3.xls      folder1    folder2    folder3   folder4 ...  folder21
...

I split the string of the file location and created an array with, which can be filled up with zeros if the path is shorter:

files = files.assign(plist=files['path'].iloc[:].apply(path_split))

def path_split(name):
     return np.array(os.path.normpath(name).split(os.sep)[7:])

Add a column with number of folder in the files path:

files = files.assign(len_plist = files.plist.iloc[:].map(len))

The problem here is that the split path string creates an nested arrays within the dataframe. Then an empty Dataframe with the number of columns in the quantity of folders ( 21 here) and rows accordin to the number of files (1.02E+06 here):

max_folder = files['len_plist'].max()  # get the maximum amount of folders    
levelcos = [ 'flevel_{}'.format(i) for i in np.arange(max_folder)]   
levels = pd.DataFrame(np.zeros((files.shape[0],max_folder)),   
                      columns =levelcos, index = files.index )

and now I fill the empty frame with the entries of the path array:

levels = fill_rows(levels,files.plist.values)   

def fill_rows(df,array):
    for i,row in enumerate(array):
        df.iloc[i,:row.shape[0] - 1] = row[:-1]
    return df

This takes a lot of time, since the varying length of the path arrays does not allow a vectorize solution right away. If I need to loop all 1.02E+06 rows of the dataframe, it would take at least 34h maybe up to 200h.

First and foremost, I want to optimize the filling of the dataframe and in a second step I would split the dataframe, parallelize the operations and assemble the frame again afterwards.

edit: added clarification, that a shorter path can be filled up to the maximum length with zeros.

Can you know the maximum depth of a path? You speak of 21, put produce example code with only 7... Is it an option to initially create all the columns for that maximum and have path_split to always return an array of that maximum size? — Serge Ballesta
– Serge Ballesta, Commented Feb 26, 2019 at 8:41
Updated the question to clarify that the maximum amount of folders is indeed 21 and if a path is shorter, the entry remains '0'. — k0ngcrete
– k0ngcrete, Commented Feb 26, 2019 at 9:14

JoergVanAken · Accepted Answer · 2019-02-26 10:01:20Z

1

Maybe I'm missing something but why doesn't this work for you?

expanded = files['path'].str.split(os.path.sep, expand=True).fillna(0)
expanded = expanded.rename(columns=lambda x: 'level_' + str(x))
df = pd.concat([files.name, expanded], axis=1)

answered Feb 26, 2019 at 10:01

JoergVanAken

1,2969 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

k0ngcrete Over a year ago

Thank you that works very much. The expand function option was the piece missing. Only a little fine tuning needed: The filename itsseld is still always in the very last level column, not the last folder the file is in. Can I drop the last column of every row and skip a couple in the front in-between here?

JoergVanAken Over a year ago

Sorry, I don't know what you mean. For me df seems to have the format you describes aboce as goal.

k0ngcrete Over a year ago

the first row of the data frame for file1.txt looks kile this: [file.txt, folder1, folder2, folder3, file.txt,0,0,0,..,0]. Since at the end of the path the file itself is writen. I want to split away the file name and have only the path. I wondered if it is possible here or i need to do it beforehanded.

JoergVanAken Over a year ago

You should split the filename first with files['path'].str.rsplit(os.path.sep, n=1, expand=True)

Collectives™ on Stack Overflow

Create a dataframe from unequal length strings

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related