1

I have been trying this for a while and I am stuck. Here it is the problem:

I am working with some metadata about texts that I have in CSV files. It looks like this:

enter image description here

The real table is longer and more complex, but it follows the same logic: every row is a text and every column is different aspects of the text. I have in some of the columns to much variation and I want it to remodel in a simpler one. For example changing from the narrative-perspective the values of homodiegetic and autodiegetic to non-heterodiegetic. I define this new model in another CSV file called keywords that looks like this:

enter image description here

As you can see, every column of the metadata becomes a row in the new model-keywords, where the old value is in the term_value column and the new value is in the new_model column.

So I need to map or replace this values using Pandas. This is what I have got for now:

import re
import pandas as pd

df_metadata = pd.read_csv("/metadata.csv", encoding="utf-8", sep=",")
df_keywords = pd.read_csv("/keywords.csv", encoding="utf-8", sep="\t")

for column_metadata,value_metadata in df_metadata.iteritems():

    if str(column_metadata) in list(df_keywords.loc[:,"term_type"]):
        
        df_metadata.loc[df_metadata[column_metadata] == value_metadata, column_metadata] = df_keywords.loc[df_keywords["term_value"] == value_metadata, ["new_model"]]

And Python always gives this error back:

"ValueError: Series lengths must match to compare"

I think the problem is in the value_metadata of the second part of the replace with loc, I mean here:

df_keywords.loc[df_keywords["term_value"] == value_metadata, ["new_model"]]

The thing I don't understand is why value_metadata works in the first part of this command but it doesn't in the second one...

Please, I would appreciate any help. Maybe there is a simpler way than iterate through the dataframes... I am very open to any suggestion. Best regards, José

1 Answer 1

1

You can first create Multiindex in df_keywords for faster selecting by slicers and in loop map new value by old one:

df_keywords.set_index(['term_type','term_value'], inplace=True)

idx = pd.IndexSlice
#first maping in column narrative-perspective
print (df_keywords.loc[idx['narrative-perspective',:]].to_dict()['new_model'])
{'heterodiegetic': 'heterodiegetic', 'other/mixed': 'n-heterodiegetic', 
 'homodiegetic': 'n-heterodiegetic', 'autodiegetic': 'n-heterodiegetic'}

#column names for replacing    
L = ['narrative-perspective','narrator','protagonist-gender']
for col in L:
    df_metadata[col] = 
    df_metadata[col].map(df_keywords.loc[idx[col,:]].to_dict()['new_model'])

print (df_metadata)
     idno author-name narrative-perspective        narrator protagonist-gender
0  ne0001      Baroja      n-heterodiegetic    third-person               male
1  ne0002      Galdos        heterodiegetic    third-person             n-male
2  ne0003      Galdos      n-heterodiegetic    third-person               male
3  ne0004      Galdos      n-heterodiegetic    third-person             n-male
4  ne0005      Galdos        heterodiegetic    third-person             n-male
5  ne0006      Galdos        heterodiegetic    third-person               male
6  ne0007        Sawa        heterodiegetic    third-person             n-male
7  ne0008    Zamacois        heterodiegetic    third-person             n-male
8  ne0009      Galdos        heterodiegetic    third-person             n-male
9  ne0011      Galdos      n-heterodiegetic  n-third-person               male

Also to_dict can be omited and then map by Series:

df_keywords.set_index(['term_type','term_value'], inplace=True)
idx = pd.IndexSlice

#first maping in column narrative-perspective
print (df_keywords.loc[idx['narrative-perspective',:]]['new_model'])
term_value
autodiegetic      n-heterodiegetic
heterodiegetic      heterodiegetic
homodiegetic      n-heterodiegetic
other/mixed       n-heterodiegetic
Name: new_model, dtype: object

L = ['narrative-perspective','narrator','protagonist-gender']
for col in L:
    df_metadata[col] = df_metadata[col].map(df_keywords.loc[idx[col,:]]['new_model'])

print (df_metadata)
     idno author-name narrative-perspective        narrator protagonist-gender
0  ne0001      Baroja      n-heterodiegetic    third-person               male
1  ne0002      Galdos        heterodiegetic    third-person             n-male
2  ne0003      Galdos      n-heterodiegetic    third-person               male
3  ne0004      Galdos      n-heterodiegetic    third-person             n-male
4  ne0005      Galdos        heterodiegetic    third-person             n-male
5  ne0006      Galdos        heterodiegetic    third-person               male
6  ne0007        Sawa        heterodiegetic    third-person             n-male
7  ne0008    Zamacois        heterodiegetic    third-person             n-male
8  ne0009      Galdos        heterodiegetic    third-person             n-male
9  ne0011      Galdos      n-heterodiegetic  n-third-person               male
Sign up to request clarification or add additional context in comments.

3 Comments

Wuah, many many thanks! :) It works! What would you suggest as the best method to create the list of the metadata to remodel to be depend on the input of both files? Because now I have this three columns, but tomorrow I might have 20... I did this, which it works, but sure you have a better way: ` list_ = []; for column_name in df_metadata.columns.values: if str(column_name) in list(df_keywords.loc[:,"term_type"]): list_.append(column_name); print(list_); ` I don't get to make the code of the comment to render nice :( sorry!
I think the easiest is drop_duplicates - print (df_keywords.term_type.drop_duplicates().tolist())
Thank you! Btw, I get first nice day in stackOverflow ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.