I am trying to convert a bunch of text files into a data frame using Pandas.
Each text file contains simple text which starts with two relevant information: the Number and the Register variables.
Then, the text files have some random text we should not be taken into consideration.
Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id), followed by the "SHARE" word.
Here is an example:
Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000
I need to transform the text into a data frame with the following output, where each group is stored in one row:
| Number | Register | City | Id | Share | Name | Born | c | f | h | i |
|---|---|---|---|---|---|---|---|---|---|---|
| 01600 | 4314 | London | 1 | 73/1284 | John Smith | 1960-01-01 | NaN | 4222/2001 | 1334/2000 | 5774/2000 |
| 01600 | 4314 | London | 4 | 58/1284 | Boris Morgan | 1965-01-01 | 4222/1988 | 4222/2000 | NaN | NaN |
My initial approach was to first import the text file and apply regular expression for each case:
import pandas as pd
import re
df = open(r'Test.txt', 'r').read()
for line in re.findall('SHARE.*', df):
print(line)
But probably there is a better way to do it.
Any help is highly appreciated. Thanks in advance.