1

I have three csv files we can call a, b, and c. File a has geographic information including zip codes. File b has statistical data. File c has only zip codes.

I used pandas to convert a and b to dataframes (a_df and b_df) which I used to join on information that was a shared column between those two dataframes (intermediate_df). File c was read and converted to a dataframe that had the zipcode as an integer type. I had to convert that to string so zipcodes are not treated as integers. However, c_df treats that column as objects after I convert it to string which means then I cannot do a join between c_df and intermediate_df to make final_df.

To illustrate what I mean:

a_data = pd.read_csv("a.csv")
b_data = pd.read_csv("b.csv", dtype={'zipcode': 'str'})
a_df = pd.DataFrame(a_data)
b_df = pd.DataFrame(b_data)

# file c conversion
c_data = pd.read_csv("slcsp.csv", dtype={'zipcode': 'str'})
print ("This is c data types: ", c_data.dtypes)
c_conversion = c_data['zipcode'].apply(str)
print ("This is c_conversion data types: ", c_conversion.dtypes)
c_df = pd.DataFrame(c_conversion)
print ("This is c_df data types: ", c_df.dtypes)

# Joining on the two common columns to avoid duplicates
joined_ab_df = pd.merge(a_df, a_df, on =['state', 'area'])

# Dropping columns that are not needed anymore
ab_for_analysis_df = joined_ab.drop(['county_code','name', 'area'], axis=1)

# Time to analyze this dataframe. Let's pick out only the silver values for 
a specific attribute
silver_only_df = (ab_for_analysis_df[filtered_df.metal_name == 'Silver'])

# Getting second lowest value of silver only
sorted_silver = silver_only_df.groupby('zipcode')['rate'].nsmallest(2)
sorted_silver_df = sorted_silver.to_frame()

print ("We cleaned up our data. Let's join the dataframes.")
print ("Final result...")
print (c_df.dtypes)
final_df = pd.merge(sorted_silver_df,c_df, on ='zipcode')

This is what we get after running it:

This is c_data types:  zipcode     object
rate       float64
dtype: object
This is c_conversion_data types:  object
This is c_df data types:  zipcode    object
dtype: object
zipcode    object
dtype: object

We cleaned up our data. Let's join the dataframes.
This is the final result...
KeyError: 'zipcode'

Any idea why it changed data types and how do I then fix it so it all joins in the end?

3
  • 1
    Can you add print(c_df.columns) and print(sorted_silver_df.columns) Commented Oct 20, 2017 at 6:14
  • So the second to last line: print (c_df.dtypes) doesn't print either? That's bizarre. I recommend using ipython/jupyter and the %debug magic function, that way you can step through these kind of errors. Commented Oct 20, 2017 at 6:19
  • It's a weird problem. @AndyHayden. The print c_df.dtypes works though it gives weird results Commented Oct 20, 2017 at 6:30

1 Answer 1

2

If convert to str always output dtype is object.

For check strings need check type:

print (c_data['zipcode'].apply(type))

To your last error:

Need reset_index, because else zipcode is index, not column:

sorted_silver_df = silver_only_df.groupby('zipcode')['rate'].nsmallest(2).reset_index()
final_df = pd.merge(sorted_silver_df,c_df, on ='zipcode')

Thanks, Andy for alternative (untested):

sorted_silver_df = silver_only_df.groupby('zipcode', as_index=False)['rate'].nsmallest(2)
final_df = pd.merge(sorted_silver_df,c_df, on ='zipcode')

Or use left_index=True and riht_on in merge:

sorted_silver = silver_only_df.groupby('zipcode')['rate'].nsmallest(2)
sorted_silver_df = sorted_silver.to_frame()
final_df = pd.merge(sorted_silver_df,c_df, right_on ='zipcode', left_index=True)
Sign up to request clarification or add additional context in comments.

14 Comments

can also use as_index=False in the groupby (instead of reset_index)
Thanks, I add it to answer. But sometimes it does not work, so rather I add notice untested ;)
Thanks. I can't seem to get that to work either: 50 <class 'str'> Name: zipcode, dtype: object This is c_data types: object This is c_df data types: zipcode object dtype: object zipcode object dtype: object We cleaned up our data. Let's join the dataframes. Final result...raise ValueError('len(right_on) must equal the number ' ValueError: len(right_on) must equal the number of levels in the index ` of "left"`
So it does not work? Then problem is in data - maybe some whitespaces I guess. Is possible share your data - gdocs, dropbox or send by email if not confidental data?
I think because pd.merge(sorted_silver_df,c_df, on ='zipcode') failed...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.