5

I have a column in my pandas Dataframe df that contains a string with some trailing hex-encoded NULLs (\x00). At least I think that it's that. When I tried to replace them with:

df['SOPInstanceUID'] = df['SOPInstanceUID'].replace('\x00', '')

the column is not updated. When I do the same with

df['SOPInstanceUID'] = df['SOPInstanceUID'].str.replace('\x00', '')

it's working fine. What's the difference here? (SOPInstanceUID is not an index.)

thanks

2 Answers 2

11

The former looks for exact matches, the latter looks for matches in any part of the string, which is why the latter works for you.

The str methods are synonymous with the standard string equivalents but are vectorised

Sign up to request clarification or add additional context in comments.

8 Comments

Not OP but thank you for the info. Just a silly question, what you mean by vectorised here?
@BowenLiu vectorised here means instead of operating on a single row or value at a time, we operate on the entire column (although in practice it really means multiple values) so it's significantly faster
Thanks a lot your explanation. So it can operate on multiple values at once so it can save computation time?
@BowenLiu correct vectorization is in my opinion why you should be using numpy or pandas. Otherwise it's just a fancy data structure that makes indexing easier without any performance gain
Amazing! I never thought about the reasons behind using pandas and numpy for data handling. I just use it because everyone uses it and it has so many useful functions. But the reason for these functions to work well and fast is that they vectorize all the data? Could you explain in layman's terms how it could do it please? I always thought it iterates through objects one by one just like for loops.
|
2

You did not specify a regex or require an exact match, hence str.replace worked

str.replace(old, new[, count])

Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad', axis=None)

parameter: to_replace : str, regex, list, dict, Series, numeric, or None

str or regex: str: string exactly matching to_replace will be replaced with value regex: regexs matching to_replace will be replaced with value

They're not actually in the string: you have unescaped control characters, which Python displays using the hexadecimal notation:

remove all non-word characters in the following way:

re.sub(r'[^\w]', '', '\x00\x00\x00\x08\x01\x008\xe6\x7f')

4 Comments

Ok, thanks to both of you. But when I call replace like this codedf['SOPInstanceUID'].replace('\x00', '')code I get the string back without trailing NULLs!? So, it seems to match, or is it just som kind of output formatting that doesn't show the NULLs?
you'll need to post raw data and code that demonstrates this, also your comment contradicts your question statement in that it didn't work
Yes, sorry. I ment when I call the method without assigning back to the column I get a string output in jupyter without the trailing NULLs. When assigning as in my post - nothing happens. Confusing.
CMari, thanks. That was the missing part! I don't understand it thoroughly, but I'll try.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.