1

I apologize for the neophyte question, but I'm having a hard time figuring out Pandas' data frames. I have one data frame with something like

df_index:
Product    Title
100000     Sample main product
200000     Non-consecutive main sample

I have another data frame with a more detailed list of the products with formats, like

df_details:
Product                    Title
100000                    Sample main product
100000-Format-English     Sample product details
100000-Format-Spanish     Sample product details
100000-Format-French      Sample product details
110000                    Another sample main product
110000-Format-English     Another sample details
110000-Format-Spanish     Another sample details
120000                    Yet another sample main product
120000-Format-English     Yet another sample details
120000-Format-Spanish     Yet another sample details
...
200000                    Non-consecutive main sample
200000-Format-English     Non-consecutive sample details
200000-Format-Spanish     Non-consecutive sample details

I want to create a new data frame based on df_details, but only for the products that appear in df_index. Ideally, it would look something like:

new_df:
Product                    Title
100000                    Sample main product
100000-Format-English     Sample product details
100000-Format-Spanish     Sample product details
100000-Format-French      Sample product details
200000                    Non-consecutive main sample
200000-Format-English     Non-consecutive sample details
200000-Format-Spanish     Non-consecutive sample details

Here's what I've tried so far:

new_df = df_details[df_details['Product'][0:5] == df_index['Product'][0:5]]

That gives me a error:

ValueError: Can only compare identically-labeled Series objects

I've also tried

new_df = pd.merge(df_index, df_details, 
  left_on=['Product'[0:5]], right_index=True, how='left')

Which does give me a resulting data set, but not the kind I want—it doesn't include the details rows with the format information.

2 Answers 2

2

You should be able to use .isin() as:

new_df = df_details[df_details['Product'].isin(df_index['Product']]

This will perform a mask looking up only the common indices.

EDIT: this works only whether the column has the same string. To solve this you can use str.contains() with:

import re

# create a pattern to look for
pat ='|'.join(map(re.escape, df_index['Product']))

# Create the mask
new_df = df_details[df_details['Product'].str.contains(pat)]

This works if the column is formatted as string.

Sign up to request clarification or add additional context in comments.

5 Comments

Nice. was just writing the same solution.
That does work, sort of. It doesn't give me the 200000-Format-English style rows, though. Possibly because it's not an exact match for the rows that look like 200000?
@nathan.hunt yes you're right, I thought all rows have the same format...this performs a lookup on 'same-format' rows... thinking on a more general solution..
@FabioLamanna Thanks so much for your help! I did use re.escape, and it worked, but for some reason I can't figure out, it also gave me some rows I wasn't expecting… and which weren't found in df_index, as near as I can tell. I did figure out how to solve the problem with for and if loops with pandas' .iterrow(), though, so I'll post that.
0

Here's how I managed to pull this off—I'm sure it's not pretty, or the fastest way to accomplish it, but it does work.

I used pandas' .itterow() with some for and if loops to go through the data frame row-by-row:

# create a list based on the 'Product' column of df_index
increment = 0
index_list = []
for product, row in df_index.iterrows():
    prod_num = df_index.product.iloc[increment]
    index_list.append(prod_num)
    increment += 1

#construct a new data frame based on the rows in df_details that are found in index_list
new_df = pd.DataFrame(columns=detail_df.columns)
increment_detail = 0
for product, row in df_details.iterrows():
    prod_num_detail = df_details.product.iloc[increment_detail]
    prod_num_detail = prod_num_detail[0:6]
    if str(prod_num_detail) in dupe_list:
        new_df = new_df.append(df_details.iloc[increment_detail])
        increment_detail += 1
    else:
        increment_detail += 1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.