0

i have file.txt like this. i want create dataframe with pandas.

# NISN:- 1234567
# FullName:- Joe Doe
# FirstName:- Joe
# LastName:- Doe
# School:- Klima
# E-mail:- [email protected]

# NISN:- 8901234
# FullName:- Jenny Low
# FirstName:- Jenny
# LastName:- Low
# School:- Kimcil
# E-mail:- [email protected]

how to convert dataframe to this?

NISN    Fullname    FirstName   LastName    School  E-mail
1234567 Joe Doe     Joe     Doe     Klima   [email protected]
8901234 Jenny Low   Jenny       Low     Kimcil  [email protected]

edited


i found sample bad line in file. how to handle this?
# NISN:- 123456

7
# FullName:- Joe Doe
# FirstName:- Joe
# LastName:- Doe
# School:- Klima
# E-mail:- [email protected]



# NISN:- 8901234
# FullName:- Jenny Low
# FirstName:- Jenny
# LastName:- Low
# School:- Kimc

il
# E-mail:- [email protected]

2 Answers 2

2

If each group has exactly 6 rows you can use read_csv with name parameter for 2 columns, in separator \s* is regex for zero or more spaces after :-:

df = pd.read_csv(file, sep=":-\s*", names=['a', 'b'], engine='python')
print (df)
              a                b
0        # NISN          1234567
1    # FullName          Joe Doe
2   # FirstName              Joe
3    # LastName              Doe
4      # School            Klima
5      # E-mail    [email protected]
6        # NISN          8901234
7    # FullName        Jenny Low
8   # FirstName            Jenny
9    # LastName              Low
10     # School           Kimcil
11     # E-mail  [email protected]

Alternative for read file - used separator which not exist in data like | or ¥ and then use Series.str.split, but only by first separator by n=1:

df = pd.read_csv(file, sep="|", names=['data'])
print (df)
                          data
0             # NISN:- 1234567
1         # FullName:- Joe Doe
2            # FirstName:- Joe
3             # LastName:- Doe
4             # School:- Klima
5     # E-mail:- [email protected]
6             # NISN:- 8901234
7       # FullName:- Jenny Low
8          # FirstName:- Jenny
9             # LastName:- Low
10           # School:- Kimcil
11  # E-mail:- [email protected]

df = df.pop('data').str.split(':-\s', n=1, expand=True)
df.columns = ['a','b']
print (df)
              a                b
0        # NISN          1234567
1    # FullName          Joe Doe
2   # FirstName              Joe
3    # LastName              Doe
4      # School            Klima
5      # E-mail    [email protected]
6        # NISN          8901234
7    # FullName        Jenny Low
8   # FirstName            Jenny
9    # LastName              Low
10     # School           Kimcil
11     # E-mail  [email protected]

Then use Series.str.strip and reshape valus of column b by numpy.ndarray.reshape:

df['a'] = df['a'].str.strip('# ')
df = pd.DataFrame(df.b.to_numpy().reshape(-1, 6), columns = df.a.iloc[:6].rename(None))
print (df)
      NISN   FullName FirstName LastName  School           E-mail
0  1234567    Joe Doe       Joe      Doe   Klima    [email protected]
1  8901234  Jenny Low     Jenny      Low  Kimcil  [email protected]

If possible some values missing, but always NISN exist for each group use DataFrame.pivot with helper column for distinguish each group by compare a by first value NISN and Series.cumsum:

df['a'] = df['a'].str.strip('# ')
df['idx'] = df['a'].eq('NISN').cumsum()
df = df.pivot(index='idx', columns='a', values='b').reset_index(drop=True)
print (df)
a           E-mail FirstName   FullName LastName     NISN  School
0    [email protected]       Joe    Joe Doe      Doe  1234567   Klima
1  [email protected]     Jenny  Jenny Low      Low  8901234  Kimcil
Sign up to request clarification or add additional context in comments.

9 Comments

hi thank you for answer, i getting error ParserError: Expected 2 fields in line 2737, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
@Hendra - added alternative solution for read file.
it turns out that my file is bad. I found some lines that are not properly formatted. how to find non-formatted lines and resolve them?
i dont know, i cant search manually, cuz there 4500 lines. i have added sample bad lines in my post.
hi, sorry i cant share my file. but my problem is solved. thank you @jezrael
|
1

You can iterate over the lines of your file in python and store the relevant data into a dictionary before converting it to a DataFrame

import pandas as pd
from collections import defaultdict

data = defaultdict(list)
with open("file.txt") as my_file:
    for line in my_file:
        line = line.strip("# \n")        # clean up whitespace and # for lines
        if not line:                     # skip empty lines
            continue

        name, value = line.split(":- ") 
        data[name].append(value)
    
df = pd.DataFrame.from_dict(data)

print(df)
      NISN   FullName FirstName LastName  School           E-mail
0  1234567    Joe Doe       Joe      Doe   Klima    [email protected]
1  8901234  Jenny Low     Jenny      Low  Kimcil  [email protected]

4 Comments

hi friend, why i getting error ValueError: not enough values to unpack (expected 2, got 1)
Most likely you have lines in your file that are not representative of your post. Certain lines that don't contain information that should be skipped are not being skipped. Are there any lines that don't contain this format # name:- value that are not simply blank lines? Alternatively, are you sure that there are no pairings that are missing the space e.g. # name:-value?
it turns out that my file is bad. I found some lines that are not properly formatted. how to find non-formatted rows and resolve them?
my problem solved with your code. thank you. you true, i found # name:-value

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.