Fastest way to iterate over a Pandas Dataframe while concatenating values from multiple columns

Question

I am wondering if there is a more performant way to iterate through a pandas dataframe and concatenate values in different columns.

For example I have the below working:

import pandas as pd
from pathlib import Path

data = {'subdir': ['tom', 'phil', 'ava'],
        'filename':['9.wav', '8.wav', '7.wav'],
        'text':['Pizza','Strawberries and yogurt', 'potato']}

df = pd.DataFrame(data, columns = ['subdir', 'filename', 'text'])

df.head()

example_path = Path(r"C:\Hello\World")
for index, row in df.iterrows():
    full_path = example_path.joinpath(row['subdir'], row['filename'])
    print(full_path)
    text = row['text']
    print(text)

Output:

C:\Hello\World\tom\9.wav
Pizza
C:\Hello\World\phil\8.wav
Strawberries and yogurt
C:\Hello\World\ava\7.wav
potato

However, I have a large amount of rows and I would like to do this in the fastest way possible. What is the best way to do this? I am taking pieces of a path (subdirectory and the base file name) and concatenating them as I iterate through the dataframe.

I will also likely be grabbing data from other adjacent columns (like 'text' in the example) and storing them as I iterate over the dataframe, so I'd like to find a way to do this all in one go, as I will be taking these pieces to output a dictionary/dataframe object after I have gathered all of the data in list or series like structures.

Thank you.

Quang Hoang · Accepted Answer · 2020-06-26 22:10:52Z

1

Since you are using Path, you can just do:

 example_path/df.filename

Output (my system is Linux):

0    C:\Hello\World/9.wav
1    C:\Hello\World/8.wav
2    C:\Hello\World/7.wav
Name: filename, dtype: object

Note usually, string operations are not vectorized. The above piece of code might very well be just a wrapper for a for loop.

answered Jun 26, 2020 at 22:10

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Coldchain9 Over a year ago

is doing the / method supposed to be faster than joinpath? My method I posted already works decently fast I was just wondering if there was a better way to iterate and combine these rows as I go rather than iterrows()

Quang Hoang Over a year ago

AFAIK, it's just a shorthand for the same operation with joinpath. And as I noted, I doubt it's gonna be significantly faster, if any. Fastest could just be list/series comprehension.

Phillyclause89 · Accepted Answer · 2020-06-26 22:22:24Z

1

You can always make a path column in your df using .apply method:

import pandas as pd
import pathlib

data = {'subdir': ['tom', 'phil', 'ava'],
        'filename':['9.wav', '8.wav', '7.wav'],
        'text':['Pizza','Strawberries and yogurt', 'potato']}

df = pd.DataFrame(data, columns = ['subdir', 'filename', 'text'])



df["path"] = df[['subdir','filename']].apply(
    lambda x:pathlib.Path(
        r"C:\Hello\World\{}\{}".format(
            x['subdir'],x['filename']
        )
    ),
    axis=1
)

print(df[['path','text']])

Out:

                        path                     text
0   C:\Hello\World\tom\9.wav                    Pizza
1  C:\Hello\World\phil\8.wav  Strawberries and yogurt
2   C:\Hello\World\ava\7.wav                   potato

answered Jun 26, 2020 at 22:22

Phillyclause89

6784 silver badges15 bronze badges

1 Comment

Coldchain9 Over a year ago

Thanks for this. Creating this is so much faster than iterrows(). I used this to rewrite it without hardcoding the raw string path. apply(lambda x: example_path.joinpath(x['subdir'], x['filename']),axis = 1)

Collectives™ on Stack Overflow

Fastest way to iterate over a Pandas Dataframe while concatenating values from multiple columns

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related