1

My goal is to create a bar graph with my .csv data to see the relationship between work year (x) and wage (y) grouped by gender (separate bars).

First off, I want to group the variable'workyear' into three groups: (1) more than 10 years, (2) just 10 years and (3) less than 10 years Then I would like to create the bar graph with gender (1=female, 0=male)

Part of my data looks like this:

...    workyear gender wage 
513         12    0  15.00
514         16    0  12.67
515         14    1   7.38
516         16    0  15.56
517         12    1   7.45
518         14    1   6.25
519         16    1   6.25
520         17    0   9.37
....

To do this, I tried to replace the variable's value into three groups and I used matplotlib.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#load data 
df=pd.DataFrame.from_csv('data.csv', index_col=None)
print(df)
df.sort_Values("workyear", ascending=True, inplace=True)

#parameters
bar_width = 0.2

#replacing Education year -> Education level grouped by given criteria.
#But I got an error.
df.loc[df.workyear<10, 'workyear'] = 'G1'
df.loc[df.workyear==10, 'workyear'] = 'G2'
df.loc[df.workyear>10, 'workyear']='G3'

#plotting
plt.bar(x, df.education[df.gender==1], bar_width, yerr=df.wage,color='y', label='female')
plt.bar(x+bar_width, df.education[df.gender==0], bar_width, yerr=df.wage, color='c', label='male')

I want to see the bar graph like this (please consider '+' as a bar):

y=wage|                 + +
      | +        +      + +
      | +      + +      + +
      | + +    + +      + +
      |_______________________ x=work year (3-group)
        >10     10       10<  

But this is what I actually got... (yes. all errors)

Traceback (most recent call last):
File "data.py", line 21, in <module>
df.loc[df.workyear>10, 'workyear']='G3'
in wrapper
res = na_op(values, other)
in na_op
result = _comp_method_OBJECT_ARRAY(op, x, y)
in _comp_method_OBJECT_ARRAY
result = lib.scalar_compare(x, y, op)
File "pandas\_libs\lib.pyx", line 769, in pandas._libs.lib.scalar_compare (pandas\_libs\lib.c:13717) 
TypeError: unorderable types: str() > int()

Could you please advice me?

7
  • 1
    Convert df.workyear to a numeric type before. Commented Dec 5, 2017 at 10:35
  • Try this df.apply(pd.to_numeric) right before the plotting. Commented Dec 5, 2017 at 10:42
  • @Goyo : did you mean to convert df.workyear like this? -> df.loc[df.workyear>10, 'workyear']='3'? Since I am a beginner in python. I am not sure how to solve this at all. Commented Dec 5, 2017 at 10:42
  • @Tom Wojcik Thank you. But I got an error with df.loc[df.workyear>10, 'workyear']='3'. Commented Dec 5, 2017 at 10:43
  • @Goyo: I still got the same error. Commented Dec 5, 2017 at 10:46

1 Answer 1

1

A more direct way :

df['Age']=pd.cut(df.workyear,[1,13,14,100])
df['Gender']=df.gender.map({0:'male',1:'female'})
df.pivot_table(values='wage',index='Age',columns='Gender').plot.bar()

for :

![enter image description here

Sign up to request clarification or add additional context in comments.

9 Comments

there was no 10 in your sample data :) . for you it is [min,10,11,max] . ( is included, ] is excluded.
I accidently deleted my previous comment. Thank you so much again @B. M. I actually have more data (n=534). I changed the code using [min,10,11, max], but I still have an error.
I edit because I realize than mean is not necessary since (G','gender') is a unique key. Is it better ?
Or, is there any other ways to cut the variable into three groups? Let's say, the first group includes 0 ~9.9, and the second group has 10 only, and the last group works more than 10.1~
[min,9.9,10.1, max] ?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.