0

I am trying to convert a bunch of text files into a data frame using Pandas.

Each text file contains simple text which starts with two relevant information: the Number and the Register variables.

Then, the text files have some random text we should not be taken into consideration.

Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id), followed by the "SHARE" word.

Here is an example:

Number 01600 London                           Register  4314

Some random text...

 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN: 1965-01-01 ADR: Streetname 4   2000
   c 4222/1988
   f 4222/2000

I need to transform the text into a data frame with the following output, where each group is stored in one row:

Number Register City Id Share Name Born c f h i
01600 4314 London 1 73/1284 John Smith 1960-01-01 NaN 4222/2001 1334/2000 5774/2000
01600 4314 London 4 58/1284 Boris Morgan 1965-01-01 4222/1988 4222/2000 NaN NaN

My initial approach was to first import the text file and apply regular expression for each case:

import pandas as pd
import re

df = open(r'Test.txt', 'r').read()

for line in re.findall('SHARE.*', df):
   print(line)

But probably there is a better way to do it.

Any help is highly appreciated. Thanks in advance.

1 Answer 1

2

This can be done without regex with list comprehension and splitting strings:

import pandas as pd

text = '''Number 01600 London                           Register  4314

Some random text...

 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN: 1965-01-01 ADR: Streetname 4   2000
   c 4222/1988
   f 4222/2000'''

text = [i.strip() for i in text.splitlines()] # create a list of lines

data = []

# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]

# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]

for i in items:
    d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
    items = list(s.split() for s in i[3:])
    merged_items = []

    for i in items:
        if len(i[0]) == 1 and i[0].isalpha():
            merged_items.append(i)
        else:
            merged_items[-1][-1] = merged_items[-1][-1] + i[0]
    d.update({name: value for name,value in merged_items})
    data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

Output:

Number Register City Id Share Name Born f h i c
0 01600 4314 London 1 73/1284 John Smith 1960-01-01 4222/2001 1334/2000 5774/2000 nan
1 01600 4314 London 4 58/1284 Boris Morgan 1965-01-01 4222/2000 nan nan 4222/1988
Sign up to request clarification or add additional context in comments.

10 Comments

First of all, thank you very much for your answer. For some rows starting with a letter (similar to my example above with the letters c, f, h, i) (code: d.update({name: value for name,value in (s.split() for s in i[3:])})) I am receiving the following error: "ValueError: too many values to unpack (expected 2)". Is it possible maybe to ignore some specific letters or increase the expected amount of information for each? Thanks in advance!
There are possibly multiple space-separated values on that line. You could replace s.split() with s.split(" ", 1). This will split on the first space only.
Now I finally understood why it is not fully working. The s.split(" ", 1) was very helpful - but is seems that in some cases, the row is split into two. So basically what I need is to get all the content until the next row where it starts with a lowercase letter (a, b, h, etc.). Do you know how to solve it? As the answer above is correct, I have already marked it as correct. :)
Ah great that you found the problem. I have updated the answer with a solution.
Thank you! Unfortunately the proposed answer did not work - I receive the following error: "IndentationError: expected an indented block (<string>, line 7)". Even trying with the original example.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.