python extracting data from file to dataframe

Question

I have some sort of generic index imported with

f = open(indexfile, "r")

and the resulting object is a _io.TextIOWrapper that looks like this:

GROUP_FIELD_NAME:ID
GROUP_FIELD_VALUE:1 
GROUP_FIELD_NAME:NAME
GROUP_FIELD_VALUE:Joe 
GROUP_OFFSET:0
GROUP_LENGTH:1234
GROUP_FILENAME:/tmp/something1
GROUP_FIELD_NAME:ID
GROUP_FIELD_VALUE:2 
GROUP_FIELD_NAME:NAME
GROUP_FIELD_VALUE:Jenny 
GROUP_OFFSET:1235
GROUP_LENGTH:12
GROUP_FILENAME:/tmp/something2

Where some data fields can be extracted by combining a correspongning _NAME and _VALUE, and some fields just require looking at the name (_OFFSET, _LENGTH, _FILENAME). E.g by looping through each line and populating lists, something like this:

Import pandas as pd

ID = []
NAME = []
GROUP_LENGTH = []
GROUP_OFFSET = []
GROUP_FILENAME = []

for line in file:
    if GROUP_OFFSET then add to list
    if GROUP_FIELD_NAME:ID then add GROUP_FIELD_VALUE from next line


a = {'ID': ID,
     'NAME': NAME,
     'GROUP_LENGTH': GROUP_LENGTH,
     'GROUP_OFFSET': GROUP_OFFSET,
     'GROUP_FILENAME': GROUP_FILENAME     
     }

df = pd.DataFrame.from_dict(a, orient='index')

df = df.transpose()

How can I get to something like this:

ID     NAME    GROUP_LENGTH    GROUP_OFFSET    GROUP_FILENAME
1      Joe     1234            0               /tmp/something1
2      Jenny   12              1235            /tmp/something2

The file is imported using f = open(indexfile, "r"), and the resulting object is a _io.TextIOWrapper — Ullsokk
– Ullsokk, Commented Sep 20, 2019 at 11:11

RomanPerekhrest · Accepted Answer · 2019-09-20 11:53:17Z

2

Accumulate records with collections.OrderedDict object:

import pandas as pd
from collections import OrderedDict

with open('input.ind') as f:
    records = []
    for line in f:
        name, val = line.strip().split(':')
        if name == 'GROUP_FIELD_NAME':
            if val == 'ID':
                records.append(OrderedDict())
            records[-1][val] = next(f).strip().split(':')[1]
        else:
            records[-1][name] = val

df = pd.DataFrame(records)
print(df)

The expected output:

  ID   NAME GROUP_OFFSET GROUP_LENGTH   GROUP_FILENAME
0  1    Joe            0         1234  /tmp/something1
1  2  Jenny         1235           12  /tmp/something2

edited Sep 20, 2019 at 11:53

answered Sep 20, 2019 at 11:51

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ullsokk Over a year ago

This worked out great! Thank you for a brilliant sollution

Lore · Accepted Answer · 2019-09-20 11:07:51Z

0

If you want to obtain directly a Dataframe, I suggest to use the read_csv, with sep parameter setted as :.

Now, you should have a DataFrame with two columns: one with names and other with values.

Then, you can use for example the groupby to group rows and have some operations on grouping. An "official" example

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal']).mean()
        Max Speed
Animal
Falcon      375.0
Parrot       25.0

Last, with transpose, you can obtain the final Dataframe.

edited Sep 20, 2019 at 11:07

answered Sep 20, 2019 at 11:06

Lore

2,0925 gold badges40 silver badges78 bronze badges

2 Comments

Ullsokk Over a year ago

the file is not a csv, but some Generic Indexer (see ibm.com/support/knowledgecenter/en/SSQHWE_9.5.0/…). Can i "force" it to read as a csv?

Lore Over a year ago

I think you can try. At least, try to convert in a txt, to export or something

Collectives™ on Stack Overflow

python extracting data from file to dataframe

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related