Replace values in a dataframes using other dataframe with strings as keys with Pandas

Question

I have been trying this for a while and I am stuck. Here it is the problem:

I am working with some metadata about texts that I have in CSV files. It looks like this:

The real table is longer and more complex, but it follows the same logic: every row is a text and every column is different aspects of the text. I have in some of the columns to much variation and I want it to remodel in a simpler one. For example changing from the narrative-perspective the values of homodiegetic and autodiegetic to non-heterodiegetic. I define this new model in another CSV file called keywords that looks like this:

As you can see, every column of the metadata becomes a row in the new model-keywords, where the old value is in the term_value column and the new value is in the new_model column.

So I need to map or replace this values using Pandas. This is what I have got for now:

import re
import pandas as pd

df_metadata = pd.read_csv("/metadata.csv", encoding="utf-8", sep=",")
df_keywords = pd.read_csv("/keywords.csv", encoding="utf-8", sep="\t")

for column_metadata,value_metadata in df_metadata.iteritems():

    if str(column_metadata) in list(df_keywords.loc[:,"term_type"]):
        
        df_metadata.loc[df_metadata[column_metadata] == value_metadata, column_metadata] = df_keywords.loc[df_keywords["term_value"] == value_metadata, ["new_model"]]

And Python always gives this error back:

"ValueError: Series lengths must match to compare"

I think the problem is in the value_metadata of the second part of the replace with loc, I mean here:

df_keywords.loc[df_keywords["term_value"] == value_metadata, ["new_model"]]

The thing I don't understand is why value_metadata works in the first part of this command but it doesn't in the second one...

Please, I would appreciate any help. Maybe there is a simpler way than iterate through the dataframes... I am very open to any suggestion. Best regards, José

jezrael · Accepted Answer · 2016-09-20 06:44:19Z

1

You can first create Multiindex in df_keywords for faster selecting by slicers and in loop map new value by old one:

df_keywords.set_index(['term_type','term_value'], inplace=True)

idx = pd.IndexSlice
#first maping in column narrative-perspective
print (df_keywords.loc[idx['narrative-perspective',:]].to_dict()['new_model'])
{'heterodiegetic': 'heterodiegetic', 'other/mixed': 'n-heterodiegetic', 
 'homodiegetic': 'n-heterodiegetic', 'autodiegetic': 'n-heterodiegetic'}

#column names for replacing    
L = ['narrative-perspective','narrator','protagonist-gender']
for col in L:
    df_metadata[col] = 
    df_metadata[col].map(df_keywords.loc[idx[col,:]].to_dict()['new_model'])

print (df_metadata)
     idno author-name narrative-perspective        narrator protagonist-gender
0  ne0001      Baroja      n-heterodiegetic    third-person               male
1  ne0002      Galdos        heterodiegetic    third-person             n-male
2  ne0003      Galdos      n-heterodiegetic    third-person               male
3  ne0004      Galdos      n-heterodiegetic    third-person             n-male
4  ne0005      Galdos        heterodiegetic    third-person             n-male
5  ne0006      Galdos        heterodiegetic    third-person               male
6  ne0007        Sawa        heterodiegetic    third-person             n-male
7  ne0008    Zamacois        heterodiegetic    third-person             n-male
8  ne0009      Galdos        heterodiegetic    third-person             n-male
9  ne0011      Galdos      n-heterodiegetic  n-third-person               male

Also to_dict can be omited and then map by Series:

df_keywords.set_index(['term_type','term_value'], inplace=True)
idx = pd.IndexSlice

#first maping in column narrative-perspective
print (df_keywords.loc[idx['narrative-perspective',:]]['new_model'])
term_value
autodiegetic      n-heterodiegetic
heterodiegetic      heterodiegetic
homodiegetic      n-heterodiegetic
other/mixed       n-heterodiegetic
Name: new_model, dtype: object

L = ['narrative-perspective','narrator','protagonist-gender']
for col in L:
    df_metadata[col] = df_metadata[col].map(df_keywords.loc[idx[col,:]]['new_model'])

print (df_metadata)
     idno author-name narrative-perspective        narrator protagonist-gender
0  ne0001      Baroja      n-heterodiegetic    third-person               male
1  ne0002      Galdos        heterodiegetic    third-person             n-male
2  ne0003      Galdos      n-heterodiegetic    third-person               male
3  ne0004      Galdos      n-heterodiegetic    third-person             n-male
4  ne0005      Galdos        heterodiegetic    third-person             n-male
5  ne0006      Galdos        heterodiegetic    third-person               male
6  ne0007        Sawa        heterodiegetic    third-person             n-male
7  ne0008    Zamacois        heterodiegetic    third-person             n-male
8  ne0009      Galdos        heterodiegetic    third-person             n-male
9  ne0011      Galdos      n-heterodiegetic  n-third-person               male

edited Sep 20, 2016 at 6:44

answered Sep 20, 2016 at 6:38

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

José Over a year ago

Wuah, many many thanks! :) It works! What would you suggest as the best method to create the list of the metadata to remodel to be depend on the input of both files? Because now I have this three columns, but tomorrow I might have 20... I did this, which it works, but sure you have a better way: ` list_ = []; for column_name in df_metadata.columns.values: if str(column_name) in list(df_keywords.loc[:,"term_type"]): list_.append(column_name); print(list_); ` I don't get to make the code of the comment to render nice :( sorry!

jezrael Over a year ago

I think the easiest is drop_duplicates - print (df_keywords.term_type.drop_duplicates().tolist())

jezrael Over a year ago

Thank you! Btw, I get first nice day in stackOverflow ;)

Collectives™ on Stack Overflow

Replace values in a dataframes using other dataframe with strings as keys with Pandas

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related