17

I frequently deal with data which is poorly formatted (I.e. number fields are not consistent etc)

There may be other ways, which I am not aware of but the way I format a single column in a dataframe is by using a function and mapping the column to that function.

format = df.column_name.map(format_number)

Question: 1 - what if I have a dataframe with 50 columns, and want to apply that formatting to multiple columns, etc column 1, 3, 5, 7, 9,

Can you go:

format = df.1,3,5,9.map(format_number)

.. This way I could format all my number columns in one line?

3 Answers 3

21

You can do df[['Col1', 'Col2', 'Col3']].applymap(format_number). Note, though that this will return new columns; it won't modify the existing DataFrame. If you want to put the values back in the original, you'll have to do df[['Col1', 'Col2', 'Col3']] = df[['Col1', 'Col2', 'Col3']].applymap(format_number).

Sign up to request clarification or add additional context in comments.

2 Comments

can you refer to the columns by number instead of name as well when doing that? Is there also any way to programatically create that string (which would change depending on the number of columns you had) and apply the format_number function? I.e. the above would work fine if I knew exactly how many columns were in the sheet every time, but If I didn't know the number of columns, and wanted to apply the same function to every column, is there a better way of doing it?
@yoshiserry: If you just want to apply it to all the columns, just do df.applymap(format_number).
3

You could use apply like this:

df.apply(lambda row: format_number(row), axis=1)

You would need to specify the columns though in your format_number function:

def format_number(row):
    row['Col1'] = doSomething(row['Col1']
    row['Col2'] = doSomething(row['Col2'])
    row['Col3'] = doSomething(row['Col3'])

This is not as elegant as @BrenBarn's answer but it has an advantage that the dataframe is modified in place so you don't need to assign the columns back again

4 Comments

could you explain what does it mean that it is modified in place, I'm not sure how that differs from the above solution, given I haven't grasped what lamda does?
@yoshiserry ignoring my code example, if you perform apply to a dataframe then the dataframe itself is modified by any changes in your function so you would not need to assign to the column, you may still need to depending on what your function is doing. The point being that you just need to call df.apply and not need to say do df[['col1','col2','col3']]=df.apply(lambda row: format_number(row), axis=1)), in my code the assignment is done by the format_number function so I guess the assignment is implicit rather than explicit like BrenBarn's answer
great, and so what exactly is the function of Lamda? I still haven't grasped what lamda does?
@yoshiserry there are lots of articles online such as pythonconquerstheuniverse.wordpress.com/2011/08/29/…, stackoverflow.com/questions/890128/python-lambda-why and secnetix.de/olli/Python/lambda_functions.hawk. Essentially it allows you to define an anonymous function so you don't have to define a special function or expression. In this case it captures the row and passes this as an arguement to format_number, it could easily have been df.apply(lambda x: x **2) to square the value
0

try:

df.apply(lambda row: [...], axis=1)

to:

df.transpose().apply(lambda col: [...]).transpose()

I kinda just found out by accident.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.