0

I am using regex and pandas to read through lines of text in a file and selectively pull data into a dataframe.

Say I have the following line of text

Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25"

I want to pull in all of this information into a dataframe so it looks like the following:

Name    Occupation    Age
Bob      Builder       42

I want to ignore reading in any of the information about the second person because their occupation is blank.

Code:

with open(txt, 'r') as txt
    for line in txt:
        line = line.strip
        a = re.findall(r'Name : \"(\S+)\"', line)
        if a:
            b = re.findall(r'Occupation : \"(\S+)\"', line)
            if b:
                c = re.findall(r'Age : \"(\S+)\"', line)
                if c:
                    df = df.append({'Name' : a, 'Occupation' : b, 'Age' : c}, ignore_index = True)

This would return the following (incorrect) dataframe

    Name        Occupation      Age
["Bob", "Jim"]  ["Builder"]  ["42","25"]

I want to modify this code so that it doesn't ever include the situation that "Jim" is in. i.e. if the person has no "occupation" then don't read their info into the dataframe. You can also see that this code is incorrect because it is now saying that "Jim" has an Occupation of "Builder".

If I was given the below line of text:

Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25" Name : "Steve" Occupation : "Clerk" Age : "110"

The resulting df would be:

    Name              Occupation             Age
["Bob", "Steve"]  ["Builder", "Clerk"]  ["42","110"]

This is handy because I would no longer run into any indexing issues, so I could then expand this df into my end goal (know how to do):

Name  Occupation  Age
Bob   Builder     42
Steve Clerk       110
2
  • Are the orders of three keys "Name", "Occupation", "Age" always the same? Commented May 21, 2019 at 19:04
  • Yes they are, but in this example there may be multiple values of "Occupation", or "Age" for any given "Name". There may also be no values at all, in which case I wouldn't want any of those to be read in. Commented May 21, 2019 at 19:12

3 Answers 3

2

Based on your comment that the 3 keys Name, Occupation and Age are always in the same order, so we can use a single regex pattern to retrieve the field values and meanwhile make sure the matched values are non-EMPTY. Below is an example using Series.str.extractall():

# example texts copied from your post
str="""
Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25" Name : "Steve" Occupation : "Clerk" Age : "110"
Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25"
"""

# read all lines into one field dataframe with column name as 'text'
df = pd.read_csv(pd.io.common.StringIO(str), squeeze=True, header=None).to_frame('text')

# 3 fields which have the same regex sub-pattern
fields = ['Name', 'Occupation', 'Age']

# regex pattern used to retrieve values of the above fields. There are 3 sub-patterns
# corresponding to the above 3 fields and joined by at least one white spaces(\s+)
ptn = r'\s+'.join([ r'{0}\s*:\s*"(?P<{0}>[^"]+)"'.format(f) for f in fields ])
print(ptn)
#Name\s*:\s*"(?P<Name>[^"]+)"\s+Occupation\s*:\s*"(?P<Occupation>[^"]+)"\s+Age\s*:\s*"(?P<Age>[^"]+)"

Where:

  • The sub-pattern Name\s*:\s*"(?P<Name>[^"]+)" is basically doing the same as Name : "([^"]+)", but with optionally 0 to more white spaces surrounding the colon : and a named capturing group.
  • the plus character + in "([^"]+)" is to make sure the value enclosed by double-quotes is not EMPTY, thus will skip Jim's profile since his Occupation is EMPTY.
  • Using named capturing groups so that we can have correct column names after running Series.str.extractall(), otherwise the resulting column names will be default to 0, 1 and 2.

Then you can check the result from Series.str.extractall():

df['text'].str.extractall(ptn)
          Name Occupation  Age
  match
0 0        Bob    Builder   42
  1      Steve      Clerk  110
1 0        Bob    Builder   42

drop the level-1 index, you will get a dataframe with the original index. you can join this back to the original dataframe if there are other columns used in your tasks.

df['text'].str.extractall(ptn).reset_index(level=1, drop=True)
###
    Name Occupation  Age
0    Bob    Builder   42
0  Steve      Clerk  110
1    Bob    Builder   42
Sign up to request clarification or add additional context in comments.

Comments

0

Using regex --> re.finditer with regex grouping.

Ex:

import re
import pandas as pd

s = 'Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25"'

name = re.findall(r'Name : \"(.*)\" ', s)
occupation = re.findall(r'Occupation : \"(.*)\" ', s)
age = re.findall(r'Age : \"(.*)\" ', s)

regexPattern = re.compile(r'Name : \"(?P<name>.*?)\"\s+Occupation : \"(?P<occupation>.*?)\"\s+Age : \"(?P<age>.*?)\"')

df = pd.DataFrame([i.groupdict() for i in regexPattern.finditer(s) if len(filter(None, i.groupdict().values())) == 3])
print(df)

Output:

  age name occupation
0  42  Bob    Builder

Comments

0

You say these strings have a fixed format, Name comes first, Occupation follows and then comes Age. You may use

df = pd.DataFrame()
pat = r'Name\s*:\s*"([^"]+)"\s*Occupation\s*:\s*"([^"]+)"\s*Age\s*:\s*"(\d+)"'
s='Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25" Name : "Steve" Occupation : "Clerk" Age : "110"'
for name, occupation, age in re.findall(pat, s):
    df = df.append({'Name' : name, 'Occupation' : occupation, 'Age' : age}, ignore_index = True)

Output:

>>> df
   Age   Name Occupation
0   42    Bob    Builder
1  110  Steve      Clerk

The regex is

Name\s*:\s*"([^"]+)"\s*Occupation\s*:\s*"([^"]+)"\s*Age\s*:\s*"(\d+)"

See the regex demo. As the quantifier in the capturing groups is set to + (one or more occurrences), the values will never be empty. To avoid empty values in the first two, you may alter the pattern as Name\s*:\s*"([^"]*[^\s"][^"]*)"\s*Occupation\s*:\s*"([^"]*[^\s"][^"]*)"\s*Age\s*:\s*"(\d+)", see this demo.

Details

  • Name - Name
  • \s*:\s* - : enclosed with 0+ whitespaces
  • " - a double quote
  • ([^"]+) - Group 1: one or more chars other than "
  • " - a double quote
  • \s* - 0+ whitespaces
  • Occupation\s*:\s*"
  • ([^"]+) - Group 2: one or more chars other than "
  • "\s*Age\s*:\s*" - ", 0+ whitespaces, Age, : enclosed with 0+ whitespaces and then "
  • (\d+) - Group 3: one or more digits
  • " - a double quote

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.