2

I am scraping some data from an Excel file and processing it in python. However, the data in the column appear to have some strings while I need them to be integers. I am trying to sort the data but it gives me the error because it is trying to sort numbers on a string.

I am trying to count the number of murders committed by each age in the file.

This is my code to do so.

xl = pd.ExcelFile('Murders.xlsx')
df = xl.parse('Sheet1')
#df = df[df["Perpetrator Age"].ne("Blanks")]
age = df['Perpetrator Age']

#print(df["Perpetrator Age"].dtype)
freq1 = collections.Counter(df['Perpetrator Age'].sort_values())
freq = [{'Perpetrator_Age': m, 'Freq': f} for m, f in freq1.items()]
file = open("MurderPerpAge.js", "w+")
file.write(json.dumps(freq))
file.close()

I have tried using the Filter button built into Excel however there still appear to be strings in the data. This is the error/output:

TypeError: '<' not supported between instances of 'int' and 'str'

I expect the output to be ordered by the age as shown in the example below

[{"Perpetrator_Age": 15, "Freq": 5441}, {"Perpetrator_Age": 17, "Freq": 14196},...
5
  • What do you want to do with the strings inside the Excel data? Do you want to reject the records? Or somehow fix them so they will be brought into Python? Commented May 11, 2019 at 6:40
  • An example of input, expected output and the code you use may help get you a solution... Commented May 11, 2019 at 6:45
  • Can you look at the data that causes the error? This could lead you to the solution. Maybe you can locate the problem and convert the strings into integers before comparing them. (I don't speak python, though) Commented May 11, 2019 at 9:09
  • I edited the question for some clarity. Commented May 11, 2019 at 15:49
  • @WolfgangJacques I can't edit the data because I have over 600,000 rows in the file Commented May 11, 2019 at 16:00

1 Answer 1

1

I would recommend using pandas.astype('int16') as in:

(int16 since you are dealing with age, which has a very limited range)

df['Perpetrator Age'] = df['Perpetrator Age'].astype('int16')
df.sort_values(axis=0)

In [14]: df['Perpetrator Age'].astype('int16').sort_values(axis=0).head()                                 
Out[14]: 
83    15
62    15
64    15
27    15
48    17
Name: Perpetrator Age, dtype: int16

I hope it helps!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.