How to filter strings out of integer column in Excel to process in Python

Question

I am scraping some data from an Excel file and processing it in python. However, the data in the column appear to have some strings while I need them to be integers. I am trying to sort the data but it gives me the error because it is trying to sort numbers on a string.

I am trying to count the number of murders committed by each age in the file.

This is my code to do so.

xl = pd.ExcelFile('Murders.xlsx')
df = xl.parse('Sheet1')
#df = df[df["Perpetrator Age"].ne("Blanks")]
age = df['Perpetrator Age']

#print(df["Perpetrator Age"].dtype)
freq1 = collections.Counter(df['Perpetrator Age'].sort_values())
freq = [{'Perpetrator_Age': m, 'Freq': f} for m, f in freq1.items()]
file = open("MurderPerpAge.js", "w+")
file.write(json.dumps(freq))
file.close()

I have tried using the Filter button built into Excel however there still appear to be strings in the data. This is the error/output:

TypeError: '<' not supported between instances of 'int' and 'str'

I expect the output to be ordered by the age as shown in the example below

[{"Perpetrator_Age": 15, "Freq": 5441}, {"Perpetrator_Age": 17, "Freq": 14196},...

What do you want to do with the strings inside the Excel data? Do you want to reject the records? Or somehow fix them so they will be brought into Python? — wavery
– wavery, Commented May 11, 2019 at 6:40
An example of input, expected output and the code you use may help get you a solution... — Solar Mike
– Solar Mike, Commented May 11, 2019 at 6:45
Can you look at the data that causes the error? This could lead you to the solution. Maybe you can locate the problem and convert the strings into integers before comparing them. (I don't speak python, though) — Wolfgang Jacques
– Wolfgang Jacques, Commented May 11, 2019 at 9:09
@WolfgangJacques I can't edit the data because I have over 600,000 rows in the file — treatyoself
– treatyoself, Commented May 11, 2019 at 16:00

ottobricks · Accepted Answer · 2019-05-12 02:00:36Z

1

I would recommend using pandas.astype('int16') as in:

(int16 since you are dealing with age, which has a very limited range)

df['Perpetrator Age'] = df['Perpetrator Age'].astype('int16')
df.sort_values(axis=0)

In [14]: df['Perpetrator Age'].astype('int16').sort_values(axis=0).head()                                 
Out[14]: 
83    15
62    15
64    15
27    15
48    17
Name: Perpetrator Age, dtype: int16

I hope it helps!

answered May 12, 2019 at 2:00

ottobricks

3432 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to filter strings out of integer column in Excel to process in Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related