5

I currently have this df where the rect column is all strings. I need to extract the x, y, w and h from it into separate columns. The dataset is very large so I need an efficient approach

df['rect'].head()
0    <Rect (120,168),260 by 120>
1    <Rect (120,168),260 by 120>
2    <Rect (120,168),260 by 120>
3    <Rect (120,168),260 by 120>
4    <Rect (120,168),260 by 120>

So far this solution works however it's very messy as you can see

df[['x', 'y', 'w', 'h']] = df['rect'].str.replace('<Rect \(', '').str.replace('\),', ',').str.replace(' by ', ',').str.replace('>', '').str.split(',', n=3, expand=True)

Is there a better way? Possibly a regex approach

2
  • Where is that string column created? Commented Sep 4, 2018 at 18:10
  • the string column is created in other functions that I don't have access to, so I have to work from here Commented Sep 4, 2018 at 18:22

5 Answers 5

6

Using extractall

df[['x', 'y', 'w', 'h']] = df['rect'].str.extractall('(\d+)').unstack().loc[:,0]
Out[267]: 
match    0    1    2    3
0      120  168  260  120
1      120  168  260  120
2      120  168  260  120
3      120  168  260  120
4      120  168  260  120
Sign up to request clarification or add additional context in comments.

1 Comment

@ksooklall yep ,happy coding
5

Inline

Produce a copy

df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))

                          rect    x    y    w    h
0  <Rect (120,168),260 by 120>  120  168  260  120
1  <Rect (120,168),260 by 120>  120  168  260  120
2  <Rect (120,168),260 by 120>  120  168  260  120
3  <Rect (120,168),260 by 120>  120  168  260  120
4  <Rect (120,168),260 by 120>  120  168  260  120

Or just reassign to df

df = df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))

df

                          rect    x    y    w    h
0  <Rect (120,168),260 by 120>  120  168  260  120
1  <Rect (120,168),260 by 120>  120  168  260  120
2  <Rect (120,168),260 by 120>  120  168  260  120
3  <Rect (120,168),260 by 120>  120  168  260  120
4  <Rect (120,168),260 by 120>  120  168  260  120

Inplace

Modify existing df

df[[*'xywh']] = pd.DataFrame(df.rect.str.findall('\d+').tolist())

df

                          rect    x    y    w    h
0  <Rect (120,168),260 by 120>  120  168  260  120
1  <Rect (120,168),260 by 120>  120  168  260  120
2  <Rect (120,168),260 by 120>  120  168  260  120
3  <Rect (120,168),260 by 120>  120  168  260  120
4  <Rect (120,168),260 by 120>  120  168  260  120

2 Comments

are you sure * works in df[[*'xywh']] = .... I keep getting SyntaxError: invalid syntax
It's your version of Python. Yes, I'm sure. What it's doing is unpacking the string as a list. You could do the same thing with a simple df[list('xywh')] or more explicitly with df[['x', 'y', 'w', 'h']]
3

If the strings follow a specific format <Rect \((\d+),(\d+)\),(\d+) by (\d+)>, you can use this regular expression with str.extract method:

df[['x','y','w','h']] = df.rect.str.extract(r'<Rect \((\d+),(\d+)\),(\d+) by (\d+)>')

df
#                          rect    x    y    w    h
#0  <Rect (120,168),260 by 120>  120  168  260  120
#1  <Rect (120,168),260 by 120>  120  168  260  120
#2  <Rect (120,168),260 by 120>  120  168  260  120
#3  <Rect (120,168),260 by 120>  120  168  260  120
#4  <Rect (120,168),260 by 120>  120  168  260  120

1 Comment

o/ @Psidom (-:
3

Use str.extract, which extracts groups from regex into columns:

df['rect'].str.extract(r'\((?P<x>\d+),(?P<y>\d+)\),(?P<w>\d+) by (?P<h>\d+)', expand=True)

Result:

     x    y    w    h
0  120  168  260  120
1  120  168  260  120
2  120  168  260  120
3  120  168  260  120
4  120  168  260  120

Comments

0

This is one of those cases where it makes sense to "optimize" the data itself instead of trying to morph it into what a consumer wants. It's much easier to change clean data into a specialized format than it is to change a specialized format into something portable.

That said, if you really have to parse this, you can do something like

>>> import re
>>> re.findall(r'\d+', '<Rect (120,168),260 by 120>')
['120', '168', '260', '120']
>>>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.