0

I am wondering if there is a more performant way to iterate through a pandas dataframe and concatenate values in different columns.

For example I have the below working:

import pandas as pd
from pathlib import Path

data = {'subdir': ['tom', 'phil', 'ava'],
        'filename':['9.wav', '8.wav', '7.wav'],
        'text':['Pizza','Strawberries and yogurt', 'potato']}

df = pd.DataFrame(data, columns = ['subdir', 'filename', 'text'])

df.head()

example_path = Path(r"C:\Hello\World")
for index, row in df.iterrows():
    full_path = example_path.joinpath(row['subdir'], row['filename'])
    print(full_path)
    text = row['text']
    print(text)

Output:

C:\Hello\World\tom\9.wav
Pizza
C:\Hello\World\phil\8.wav
Strawberries and yogurt
C:\Hello\World\ava\7.wav
potato

However, I have a large amount of rows and I would like to do this in the fastest way possible. What is the best way to do this? I am taking pieces of a path (subdirectory and the base file name) and concatenating them as I iterate through the dataframe.

I will also likely be grabbing data from other adjacent columns (like 'text' in the example) and storing them as I iterate over the dataframe, so I'd like to find a way to do this all in one go, as I will be taking these pieces to output a dictionary/dataframe object after I have gathered all of the data in list or series like structures.

Thank you.

2 Answers 2

1

Since you are using Path, you can just do:

 example_path/df.filename

Output (my system is Linux):

0    C:\Hello\World/9.wav
1    C:\Hello\World/8.wav
2    C:\Hello\World/7.wav
Name: filename, dtype: object

Note usually, string operations are not vectorized. The above piece of code might very well be just a wrapper for a for loop.

Sign up to request clarification or add additional context in comments.

2 Comments

is doing the / method supposed to be faster than joinpath? My method I posted already works decently fast I was just wondering if there was a better way to iterate and combine these rows as I go rather than iterrows()
AFAIK, it's just a shorthand for the same operation with joinpath. And as I noted, I doubt it's gonna be significantly faster, if any. Fastest could just be list/series comprehension.
1

You can always make a path column in your df using .apply method:

import pandas as pd
import pathlib

data = {'subdir': ['tom', 'phil', 'ava'],
        'filename':['9.wav', '8.wav', '7.wav'],
        'text':['Pizza','Strawberries and yogurt', 'potato']}

df = pd.DataFrame(data, columns = ['subdir', 'filename', 'text'])



df["path"] = df[['subdir','filename']].apply(
    lambda x:pathlib.Path(
        r"C:\Hello\World\{}\{}".format(
            x['subdir'],x['filename']
        )
    ),
    axis=1
)

print(df[['path','text']])

Out:

                        path                     text
0   C:\Hello\World\tom\9.wav                    Pizza
1  C:\Hello\World\phil\8.wav  Strawberries and yogurt
2   C:\Hello\World\ava\7.wav                   potato

1 Comment

Thanks for this. Creating this is so much faster than iterrows(). I used this to rewrite it without hardcoding the raw string path. apply(lambda x: example_path.joinpath(x['subdir'], x['filename']),axis = 1)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.