Using regex to selectively pull data into pandas dataframe

Question

I am using regex and pandas to read through lines of text in a file and selectively pull data into a dataframe.

Say I have the following line of text

Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25"

I want to pull in all of this information into a dataframe so it looks like the following:

Name    Occupation    Age
Bob      Builder       42

I want to ignore reading in any of the information about the second person because their occupation is blank.

Code:

with open(txt, 'r') as txt
    for line in txt:
        line = line.strip
        a = re.findall(r'Name : \"(\S+)\"', line)
        if a:
            b = re.findall(r'Occupation : \"(\S+)\"', line)
            if b:
                c = re.findall(r'Age : \"(\S+)\"', line)
                if c:
                    df = df.append({'Name' : a, 'Occupation' : b, 'Age' : c}, ignore_index = True)

This would return the following (incorrect) dataframe

    Name        Occupation      Age
["Bob", "Jim"]  ["Builder"]  ["42","25"]

I want to modify this code so that it doesn't ever include the situation that "Jim" is in. i.e. if the person has no "occupation" then don't read their info into the dataframe. You can also see that this code is incorrect because it is now saying that "Jim" has an Occupation of "Builder".

If I was given the below line of text:

Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25" Name : "Steve" Occupation : "Clerk" Age : "110"

The resulting df would be:

    Name              Occupation             Age
["Bob", "Steve"]  ["Builder", "Clerk"]  ["42","110"]

This is handy because I would no longer run into any indexing issues, so I could then expand this df into my end goal (know how to do):

Name  Occupation  Age
Bob   Builder     42
Steve Clerk       110

Are the orders of three keys "Name", "Occupation", "Age" always the same? — jxc
– jxc, Commented May 21, 2019 at 19:04
Yes they are, but in this example there may be multiple values of "Occupation", or "Age" for any given "Name". There may also be no values at all, in which case I wouldn't want any of those to be read in. — MaxB
– MaxB, Commented May 21, 2019 at 19:12

jxc · Accepted Answer · 2019-05-21 20:34:07Z

Based on your comment that the 3 keys Name, Occupation and Age are always in the same order, so we can use a single regex pattern to retrieve the field values and meanwhile make sure the matched values are non-EMPTY. Below is an example using Series.str.extractall():

# example texts copied from your post
str="""
Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25" Name : "Steve" Occupation : "Clerk" Age : "110"
Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25"
"""

# read all lines into one field dataframe with column name as 'text'
df = pd.read_csv(pd.io.common.StringIO(str), squeeze=True, header=None).to_frame('text')

# 3 fields which have the same regex sub-pattern
fields = ['Name', 'Occupation', 'Age']

# regex pattern used to retrieve values of the above fields. There are 3 sub-patterns
# corresponding to the above 3 fields and joined by at least one white spaces(\s+)
ptn = r'\s+'.join([ r'{0}\s*:\s*"(?P<{0}>[^"]+)"'.format(f) for f in fields ])
print(ptn)
#Name\s*:\s*"(?P<Name>[^"]+)"\s+Occupation\s*:\s*"(?P<Occupation>[^"]+)"\s+Age\s*:\s*"(?P<Age>[^"]+)"

Where:

The sub-pattern Name\s*:\s*"(?P<Name>[^"]+)" is basically doing the same as Name : "([^"]+)", but with optionally 0 to more white spaces surrounding the colon : and a named capturing group.
the plus character + in "([^"]+)" is to make sure the value enclosed by double-quotes is not EMPTY, thus will skip Jim's profile since his Occupation is EMPTY.
Using named capturing groups so that we can have correct column names after running Series.str.extractall(), otherwise the resulting column names will be default to 0, 1 and 2.

Then you can check the result from Series.str.extractall():

df['text'].str.extractall(ptn)
          Name Occupation  Age
  match
0 0        Bob    Builder   42
  1      Steve      Clerk  110
1 0        Bob    Builder   42

drop the level-1 index, you will get a dataframe with the original index. you can join this back to the original dataframe if there are other columns used in your tasks.

df['text'].str.extractall(ptn).reset_index(level=1, drop=True)
###
    Name Occupation  Age
0    Bob    Builder   42
0  Steve      Clerk  110
1    Bob    Builder   42

Rakesh · Accepted Answer · 2019-05-21 14:11:04Z

0

Using regex --> re.finditer with regex grouping.

Ex:

import re
import pandas as pd

s = 'Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25"'

name = re.findall(r'Name : \"(.*)\" ', s)
occupation = re.findall(r'Occupation : \"(.*)\" ', s)
age = re.findall(r'Age : \"(.*)\" ', s)

regexPattern = re.compile(r'Name : \"(?P<name>.*?)\"\s+Occupation : \"(?P<occupation>.*?)\"\s+Age : \"(?P<age>.*?)\"')

df = pd.DataFrame([i.groupdict() for i in regexPattern.finditer(s) if len(filter(None, i.groupdict().values())) == 3])
print(df)

Output:

  age name occupation
0  42  Bob    Builder

edited May 21, 2019 at 14:11

answered May 21, 2019 at 13:57

Rakesh

82.9k17 gold badges86 silver badges122 bronze badges

Comments

Wiktor Stribiżew · Accepted Answer · 2019-05-21 20:30:08Z

You say these strings have a fixed format, Name comes first, Occupation follows and then comes Age. You may use

df = pd.DataFrame()
pat = r'Name\s*:\s*"([^"]+)"\s*Occupation\s*:\s*"([^"]+)"\s*Age\s*:\s*"(\d+)"'
s='Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25" Name : "Steve" Occupation : "Clerk" Age : "110"'
for name, occupation, age in re.findall(pat, s):
    df = df.append({'Name' : name, 'Occupation' : occupation, 'Age' : age}, ignore_index = True)

Output:

>>> df
   Age   Name Occupation
0   42    Bob    Builder
1  110  Steve      Clerk

The regex is

Name\s*:\s*"([^"]+)"\s*Occupation\s*:\s*"([^"]+)"\s*Age\s*:\s*"(\d+)"

See the regex demo. As the quantifier in the capturing groups is set to + (one or more occurrences), the values will never be empty. To avoid empty values in the first two, you may alter the pattern as Name\s*:\s*"([^"]*[^\s"][^"]*)"\s*Occupation\s*:\s*"([^"]*[^\s"][^"]*)"\s*Age\s*:\s*"(\d+)", see this demo.

Details

Name - Name
\s*:\s* - : enclosed with 0+ whitespaces
" - a double quote
([^"]+) - Group 1: one or more chars other than "
" - a double quote
\s* - 0+ whitespaces
Occupation\s*:\s*"
([^"]+) - Group 2: one or more chars other than "
"\s*Age\s*:\s*" - ", 0+ whitespaces, Age, : enclosed with 0+ whitespaces and then "
(\d+) - Group 3: one or more digits
" - a double quote

Collectives™ on Stack Overflow

Using regex to selectively pull data into pandas dataframe

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related