1

I am wanting to experiment with the raw=True option in the pandas apply function, as per p. 155 in High Performance Python, by Gorelick and Ozsvald. However, Python is apparently regarding the raw=True as an argument for the function I'm applying, and not for the .apply function itself! Here's a MWE:

import pandas as pd

df = pd.DataFrame(columns=('a', 'b'))
df.loc[0] = (1, 2)
df.loc[1] = (3, 4)

df['a'] = df['a'].apply(str, raw=True)

When I try to execute this, I get the following error:

TypeError: 'raw' is an invalid keyword argument for str()

The problem stays there even if I use a lambda expression:

df['a'] = df['a'].apply(lambda x: str(x), raw=True)

The problem remains if I call a custom-defined function instead of str.

How do I get Pandas to recognize that raw=True is an argument for .apply and NOT str?

7
  • 2
    I'm not sure but I think it is because you use pd.Series.apply and not pd.DataFrame.apply. Series doesn't seem to accept raw as argument. Try df.apply(str, raw=True). Is that what you are searching for ? Commented Aug 18, 2022 at 21:37
  • @Rabinzel Hmm. I think you've got it. The examples in the book are definitely using the df version, not the ser version. Commented Aug 18, 2022 at 21:39
  • 1
    If you still only want to apply it to column a, use double brackets, that way you pass a dataframe instead of a Series: df[['a']].apply(str, raw=True) Commented Aug 18, 2022 at 21:41
  • That approach does have some side effects, though: ``` a b 0 [1 3] 2 1 [1 3] 4 ``` Commented Aug 18, 2022 at 21:43
  • 1
    Hmmk, well there's nothing to be gained from using raw=True with Series, because pd.Series.apply already passes raw values. raw=True is useful for pd.DataFrame.apply because it passes numpy arrays instead, which depending on your function can improve performance. As you can see in the documentation, there is no raw=True argument for a Series. Commented Aug 18, 2022 at 23:02

1 Answer 1

1

Referring to the comments, I don't think these are side effects. As in the documentation stated, passing raw=True as argument, the "function receive ndarray objects", so you pass an array and convert it to a string. The result is a string like [1 3]. So you don't convert each value to a string, instead the whole Series to a string

If you write a little helper function you can see that.

def conv(col):
    print(f"input values: {col}")
    print(f"type input: {type(col)}\n")
    return str(col)

t = df[['a']].apply(conv, raw=True)
print(f"{type(t)}:\n{t}\n")
print(f"first value: {type(t[0])}:\n{t[0]}\n")
print(f"{t[0][0]}")

Output:

input values: [1 3]
type input: <class 'numpy.ndarray'>

<class 'pandas.core.series.Series'>:
a    [1 3]
dtype: object

first value: <class 'str'>:
[1 3]

[
Sign up to request clarification or add additional context in comments.

2 Comments

So if I want to convert a column to string from int using apply and raw=True, what is the exact syntax I should use?
My knowledge isn't deep enough to give advice here, but I think the answer is just: don't do it. The documentation also says "If you are just applying a NumPy reduction function this will achieve much better performance". You don't reduce anything here, so I don't think there is any advantage in using raw

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.