I am using regex and pandas to read through lines of text in a file and selectively pull data into a dataframe.
Say I have the following line of text
Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25"
I want to pull in all of this information into a dataframe so it looks like the following:
Name Occupation Age
Bob Builder 42
I want to ignore reading in any of the information about the second person because their occupation is blank.
Code:
with open(txt, 'r') as txt
for line in txt:
line = line.strip
a = re.findall(r'Name : \"(\S+)\"', line)
if a:
b = re.findall(r'Occupation : \"(\S+)\"', line)
if b:
c = re.findall(r'Age : \"(\S+)\"', line)
if c:
df = df.append({'Name' : a, 'Occupation' : b, 'Age' : c}, ignore_index = True)
This would return the following (incorrect) dataframe
Name Occupation Age
["Bob", "Jim"] ["Builder"] ["42","25"]
I want to modify this code so that it doesn't ever include the situation that "Jim" is in. i.e. if the person has no "occupation" then don't read their info into the dataframe. You can also see that this code is incorrect because it is now saying that "Jim" has an Occupation of "Builder".
If I was given the below line of text:
Name : "Bob" Occupation : "Builder" Age : "42" Name : "Jim" Occupation : "" Age : "25" Name : "Steve" Occupation : "Clerk" Age : "110"
The resulting df would be:
Name Occupation Age
["Bob", "Steve"] ["Builder", "Clerk"] ["42","110"]
This is handy because I would no longer run into any indexing issues, so I could then expand this df into my end goal (know how to do):
Name Occupation Age
Bob Builder 42
Steve Clerk 110