Python Text File to Data Frame with Specific Pattern

Question

I am trying to convert a bunch of text files into a data frame using Pandas.

Each text file contains simple text which starts with two relevant information: the Number and the Register variables.

Then, the text files have some random text we should not be taken into consideration.

Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id), followed by the "SHARE" word.

Here is an example:

Number 01600 London                           Register  4314

Some random text...

 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN: 1965-01-01 ADR: Streetname 4   2000
   c 4222/1988
   f 4222/2000

I need to transform the text into a data frame with the following output, where each group is stored in one row:

Number	Register	City	Id	Share	Name	Born	c	f	h	i
01600	4314	London	1	73/1284	John Smith	1960-01-01	NaN	4222/2001	1334/2000	5774/2000
01600	4314	London	4	58/1284	Boris Morgan	1965-01-01	4222/1988	4222/2000	NaN	NaN

My initial approach was to first import the text file and apply regular expression for each case:

import pandas as pd
import re

df = open(r'Test.txt', 'r').read()

for line in re.findall('SHARE.*', df):
   print(line)

But probably there is a better way to do it.

Any help is highly appreciated. Thanks in advance.

RJ Adriaansen · Accepted Answer · 2021-07-07 19:33:11Z

2

This can be done without regex with list comprehension and splitting strings:

import pandas as pd

text = '''Number 01600 London                           Register  4314

Some random text...

 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN: 1965-01-01 ADR: Streetname 4   2000
   c 4222/1988
   f 4222/2000'''

text = [i.strip() for i in text.splitlines()] # create a list of lines

data = []

# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]

# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]

for i in items:
    d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
    items = list(s.split() for s in i[3:])
    merged_items = []

    for i in items:
        if len(i[0]) == 1 and i[0].isalpha():
            merged_items.append(i)
        else:
            merged_items[-1][-1] = merged_items[-1][-1] + i[0]
    d.update({name: value for name,value in merged_items})
    data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

Output:

	Number	Register	City	Id	Share	Name	Born	f	h	i	c
0	01600	4314	London	1	73/1284	John Smith	1960-01-01	4222/2001	1334/2000	5774/2000	nan
1	01600	4314	London	4	58/1284	Boris Morgan	1965-01-01	4222/2000	nan	nan	4222/1988

edited Jul 7, 2021 at 19:33

answered Jul 5, 2021 at 22:03

RJ Adriaansen

9,7092 gold badges16 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

roboes Over a year ago

First of all, thank you very much for your answer. For some rows starting with a letter (similar to my example above with the letters c, f, h, i) (code: d.update({name: value for name,value in (s.split() for s in i[3:])})) I am receiving the following error: "ValueError: too many values to unpack (expected 2)". Is it possible maybe to ignore some specific letters or increase the expected amount of information for each? Thanks in advance!

RJ Adriaansen Over a year ago

There are possibly multiple space-separated values on that line. You could replace s.split() with s.split(" ", 1). This will split on the first space only.

roboes Over a year ago

Now I finally understood why it is not fully working. The s.split(" ", 1) was very helpful - but is seems that in some cases, the row is split into two. So basically what I need is to get all the content until the next row where it starts with a lowercase letter (a, b, h, etc.). Do you know how to solve it? As the answer above is correct, I have already marked it as correct. :)

RJ Adriaansen Over a year ago

Ah great that you found the problem. I have updated the answer with a solution.

roboes Over a year ago

Thank you! Unfortunately the proposed answer did not work - I receive the following error: "IndentationError: expected an indented block (<string>, line 7)". Even trying with the original example.

|

Collectives™ on Stack Overflow

Python Text File to Data Frame with Specific Pattern

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related