Python Create Pandas Dataframe From txt file

Question

i have file.txt like this. i want create dataframe with pandas.

# NISN:- 1234567
# FullName:- Joe Doe
# FirstName:- Joe
# LastName:- Doe
# School:- Klima
# E-mail:- [email protected]

# NISN:- 8901234
# FullName:- Jenny Low
# FirstName:- Jenny
# LastName:- Low
# School:- Kimcil
# E-mail:- [email protected]

how to convert dataframe to this?

NISN    Fullname    FirstName   LastName    School  E-mail
1234567 Joe Doe     Joe     Doe     Klima   [email protected]
8901234 Jenny Low   Jenny       Low     Kimcil  [email protected]

edited

i found sample bad line in file. how to handle this?

# NISN:- 123456

7
# FullName:- Joe Doe
# FirstName:- Joe
# LastName:- Doe
# School:- Klima
# E-mail:- [email protected]



# NISN:- 8901234
# FullName:- Jenny Low
# FirstName:- Jenny
# LastName:- Low
# School:- Kimc

il
# E-mail:- [email protected]

jezrael · Accepted Answer · 2020-10-30 07:07:27Z

2

If each group has exactly 6 rows you can use read_csv with name parameter for 2 columns, in separator \s* is regex for zero or more spaces after :-:

df = pd.read_csv(file, sep=":-\s*", names=['a', 'b'], engine='python')
print (df)
              a                b
0        # NISN          1234567
1    # FullName          Joe Doe
2   # FirstName              Joe
3    # LastName              Doe
4      # School            Klima
5      # E-mail    [email protected]
6        # NISN          8901234
7    # FullName        Jenny Low
8   # FirstName            Jenny
9    # LastName              Low
10     # School           Kimcil
11     # E-mail  [email protected]

Alternative for read file - used separator which not exist in data like | or ¥ and then use Series.str.split, but only by first separator by n=1:

df = pd.read_csv(file, sep="|", names=['data'])
print (df)
                          data
0             # NISN:- 1234567
1         # FullName:- Joe Doe
2            # FirstName:- Joe
3             # LastName:- Doe
4             # School:- Klima
5     # E-mail:- [email protected]
6             # NISN:- 8901234
7       # FullName:- Jenny Low
8          # FirstName:- Jenny
9             # LastName:- Low
10           # School:- Kimcil
11  # E-mail:- [email protected]

df = df.pop('data').str.split(':-\s', n=1, expand=True)
df.columns = ['a','b']
print (df)
              a                b
0        # NISN          1234567
1    # FullName          Joe Doe
2   # FirstName              Joe
3    # LastName              Doe
4      # School            Klima
5      # E-mail    [email protected]
6        # NISN          8901234
7    # FullName        Jenny Low
8   # FirstName            Jenny
9    # LastName              Low
10     # School           Kimcil
11     # E-mail  [email protected]

Then use Series.str.strip and reshape valus of column b by numpy.ndarray.reshape:

df['a'] = df['a'].str.strip('# ')
df = pd.DataFrame(df.b.to_numpy().reshape(-1, 6), columns = df.a.iloc[:6].rename(None))
print (df)
      NISN   FullName FirstName LastName  School           E-mail
0  1234567    Joe Doe       Joe      Doe   Klima    [email protected]
1  8901234  Jenny Low     Jenny      Low  Kimcil  [email protected]

If possible some values missing, but always NISN exist for each group use DataFrame.pivot with helper column for distinguish each group by compare a by first value NISN and Series.cumsum:

df['a'] = df['a'].str.strip('# ')
df['idx'] = df['a'].eq('NISN').cumsum()
df = df.pivot(index='idx', columns='a', values='b').reset_index(drop=True)
print (df)
a           E-mail FirstName   FullName LastName     NISN  School
0    [email protected]       Joe    Joe Doe      Doe  1234567   Klima
1  [email protected]     Jenny  Jenny Low      Low  8901234  Kimcil

edited Oct 30, 2020 at 7:07

answered Oct 30, 2020 at 6:44

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Hendra Over a year ago

hi thank you for answer, i getting error ParserError: Expected 2 fields in line 2737, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

jezrael Over a year ago

@Hendra - added alternative solution for read file.

Hendra Over a year ago

it turns out that my file is bad. I found some lines that are not properly formatted. how to find non-formatted lines and resolve them?

Hendra Over a year ago

i dont know, i cant search manually, cuz there 4500 lines. i have added sample bad lines in my post.

Hendra Over a year ago

hi, sorry i cant share my file. but my problem is solved. thank you @jezrael

|

Cameron Riddell · Accepted Answer · 2020-10-30 07:02:08Z

1

You can iterate over the lines of your file in python and store the relevant data into a dictionary before converting it to a DataFrame

import pandas as pd
from collections import defaultdict

data = defaultdict(list)
with open("file.txt") as my_file:
    for line in my_file:
        line = line.strip("# \n")        # clean up whitespace and # for lines
        if not line:                     # skip empty lines
            continue

        name, value = line.split(":- ") 
        data[name].append(value)
    
df = pd.DataFrame.from_dict(data)

print(df)
      NISN   FullName FirstName LastName  School           E-mail
0  1234567    Joe Doe       Joe      Doe   Klima    [email protected]
1  8901234  Jenny Low     Jenny      Low  Kimcil  [email protected]

answered Oct 30, 2020 at 7:02

Cameron Riddell

13.8k14 silver badges21 bronze badges

4 Comments

Hendra Over a year ago

hi friend, why i getting error ValueError: not enough values to unpack (expected 2, got 1)

Cameron Riddell Over a year ago

Most likely you have lines in your file that are not representative of your post. Certain lines that don't contain information that should be skipped are not being skipped. Are there any lines that don't contain this format # name:- value that are not simply blank lines? Alternatively, are you sure that there are no pairings that are missing the space e.g. # name:-value?

Hendra Over a year ago

it turns out that my file is bad. I found some lines that are not properly formatted. how to find non-formatted rows and resolve them?

Hendra Over a year ago

my problem solved with your code. thank you. you true, i found # name:-value

Collectives™ on Stack Overflow

Python Create Pandas Dataframe From txt file

2 Answers 2

9 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related