Creating new (more detailed) data frame with Pandas based on index data frame

Question

I apologize for the neophyte question, but I'm having a hard time figuring out Pandas' data frames. I have one data frame with something like

df_index:
Product    Title
100000     Sample main product
200000     Non-consecutive main sample

I have another data frame with a more detailed list of the products with formats, like

df_details:
Product                    Title
100000                    Sample main product
100000-Format-English     Sample product details
100000-Format-Spanish     Sample product details
100000-Format-French      Sample product details
110000                    Another sample main product
110000-Format-English     Another sample details
110000-Format-Spanish     Another sample details
120000                    Yet another sample main product
120000-Format-English     Yet another sample details
120000-Format-Spanish     Yet another sample details
...
200000                    Non-consecutive main sample
200000-Format-English     Non-consecutive sample details
200000-Format-Spanish     Non-consecutive sample details

I want to create a new data frame based on df_details, but only for the products that appear in df_index. Ideally, it would look something like:

new_df:
Product                    Title
100000                    Sample main product
100000-Format-English     Sample product details
100000-Format-Spanish     Sample product details
100000-Format-French      Sample product details
200000                    Non-consecutive main sample
200000-Format-English     Non-consecutive sample details
200000-Format-Spanish     Non-consecutive sample details

Here's what I've tried so far:

new_df = df_details[df_details['Product'][0:5] == df_index['Product'][0:5]]

That gives me a error:

ValueError: Can only compare identically-labeled Series objects

I've also tried

new_df = pd.merge(df_index, df_details, 
  left_on=['Product'[0:5]], right_index=True, how='left')

Which does give me a resulting data set, but not the kind I want—it doesn't include the details rows with the format information.

Fabio Lamanna · Accepted Answer · 2016-12-22 19:04:37Z

2

You should be able to use .isin() as:

new_df = df_details[df_details['Product'].isin(df_index['Product']]

This will perform a mask looking up only the common indices.

EDIT: this works only whether the column has the same string. To solve this you can use str.contains() with:

import re

# create a pattern to look for
pat ='|'.join(map(re.escape, df_index['Product']))

# Create the mask
new_df = df_details[df_details['Product'].str.contains(pat)]

This works if the column is formatted as string.

edited Dec 22, 2016 at 19:04

answered Dec 22, 2016 at 18:13

Fabio Lamanna

21.7k24 gold badges95 silver badges126 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ted Petrou Over a year ago

Nice. was just writing the same solution.

nathan.hunt Over a year ago

That does work, sort of. It doesn't give me the 200000-Format-English style rows, though. Possibly because it's not an exact match for the rows that look like 200000?

Fabio Lamanna Over a year ago

@nathan.hunt yes you're right, I thought all rows have the same format...this performs a lookup on 'same-format' rows... thinking on a more general solution..

Fabio Lamanna Over a year ago

This should help stackoverflow.com/questions/27975069/…

nathan.hunt Over a year ago

@FabioLamanna Thanks so much for your help! I did use re.escape, and it worked, but for some reason I can't figure out, it also gave me some rows I wasn't expecting… and which weren't found in df_index, as near as I can tell. I did figure out how to solve the problem with for and if loops with pandas' .iterrow(), though, so I'll post that.

nathan.hunt · Accepted Answer · 2016-12-23 17:42:13Z

Here's how I managed to pull this off—I'm sure it's not pretty, or the fastest way to accomplish it, but it does work.

I used pandas' .itterow() with some for and if loops to go through the data frame row-by-row:

# create a list based on the 'Product' column of df_index
increment = 0
index_list = []
for product, row in df_index.iterrows():
    prod_num = df_index.product.iloc[increment]
    index_list.append(prod_num)
    increment += 1

#construct a new data frame based on the rows in df_details that are found in index_list
new_df = pd.DataFrame(columns=detail_df.columns)
increment_detail = 0
for product, row in df_details.iterrows():
    prod_num_detail = df_details.product.iloc[increment_detail]
    prod_num_detail = prod_num_detail[0:6]
    if str(prod_num_detail) in dupe_list:
        new_df = new_df.append(df_details.iloc[increment_detail])
        increment_detail += 1
    else:
        increment_detail += 1

Collectives™ on Stack Overflow

Creating new (more detailed) data frame with Pandas based on index data frame

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related