python regex to read text file and split row to column

Question

My text file format is like below

ID  col_A   col_B   col_C
1   0.26    0.11    0.18
2   0.27    0.12    0.17
3   0.21    0.10    0.15
----------------------------
AVG 0.25    0.11    0.17
----------------------------
ID  col_D   col_E   col_F
1   0.23    0.18    0.20
2   0.24    0.14    0.17
3   0.23    0.10    0.13
----------------------------
AVG 0.23    0.14    0.17
----------------------------

I'm attempting to use python and regex to export two separate csv files with the format like below

Table 1

ID	col_A	col_B	col_C	col_D	col_E	col_F
1	0.26	0.11	0.18	0.23	0.18	0.20
2	0.27	0.12	0.17	0.24	0.14	0.17
3	0.21	0.10	0.15	0.23	0.10	0.13

Table 2

	col_A	col_B	col_C	col_D	col_E	col_F
AVG	0.25	0.11	0.17	0.23	0.14	0.17

Here's my code:

import re
import pandas as pd


with open('test.txt') as file:
        lines = file.readlines()
        regex = r'\A(?P<ID>\S+)\s*(?P<COL_A>\S+)\s*(?P<COL_B>\S+)\s*(?P<COL_C>\S+)'
        data = []
        
        for line in lines:
            m = re.search(regex, line)
            if m != None:
                data.append([m.group(1),m.group(2),m.group(3),m.group(4)])
        
        df = pd.DataFrame(data)
        
df.to_csv('test.csv', index = False)

My code would result in a strange format like

0	1	2	3
ID	col_A	col_B	col_C
1	0.26	0.11	0.18
2	0.27	0.12	0.17
3	0.21	0.10	0.15
------	---------	---------	---------
AVG	0.25	0.11	0.17
------	---------	---------	---------
ID	col_D	col_E	col_F
1	0.23	0.18	0.20
2	0.24	0.14	0.17
3	0.23	0.10	0.13
------	---------	---------	---------
AVG	0.23	0.14	0.17
------	---------	---------	---------

How can I modify my code to achieve my request? Thank you!

Wiktor Stribiżew · Accepted Answer · 2022-01-28 16:13:48Z

1

You can use

import re, itertools
import pandas as pd

data = []
colnames = []
avg = []
start = True
is_next_avg = False
id_start_label = ''

with open('test.txt', 'r') as file:
    for line in file:
        line = line.strip()
        if start and line.replace('-', '').strip():
            id_start_label = line.split()[0]
            colnames.extend(line.split()[1:])
            start = False
        elif '---' in line and not line.replace('-', '').strip():
            if start:
                continue
            is_next_avg = not is_next_avg
        elif is_next_avg:
            avg.extend(line.split()[1:])
            is_next_avg = not is_next_avg
            start = not start
        else:
            data.append(line.split())

colnames.insert(0, id_start_label)
data = sorted(data, key=lambda x: x[0])
new_data = []
for key, g in itertools.groupby(data, key=lambda x: x[0]):
    entry = [key]
    entry.extend( [i for x in g for i in x[1:]] )
    new_data.append(entry)

df = pd.DataFrame(new_data, columns=colnames)
df2 = pd.DataFrame(avg, columns=['AVG'])
df2 = df2.transpose()
df2.columns = df.columns[1:]

Output:

>>> df
  ID col_A col_B col_C col_D col_E col_F
0  1  0.26  0.11  0.18  0.23  0.18  0.20
1  2  0.27  0.12  0.17  0.24  0.14  0.17
2  3  0.21  0.10  0.15  0.23  0.10  0.13

>>> df2
    col_A col_B col_C col_D col_E col_F
AVG  0.25  0.11  0.17  0.23  0.14  0.17

answered Jan 28, 2022 at 16:13

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Wiktor Stribiżew Over a year ago

Also, tested here. Please let know if you need more clarifications.

AloneTogether · Accepted Answer · 2022-01-28 16:30:43Z

Something like this should work for you:

import re
import pandas as pd
import numpy as np

data = []
headers = []
averages = []
ids = []
with open('/content/text.txt') as file:
  for l in file.readlines():
      s = re.findall(r'\S+', l.rstrip().split(',')[0])
      if '-' not in s[0]:
        if 'ID' in s[0]: headers.append(s)
        elif 'AVG' in s[0]: averages.append(s)
        else: 
          data.append(s[1:])
          ids.append(s[0])

data = np.hstack([data[:len(data)//len(headers)], data[len(data)//len(headers):]])
data = np.concatenate([np.expand_dims(np.transpose(sorted(set(ids))), axis=-1), data], axis=1)
columns = sorted(set(np.concatenate(headers)))
df = pd.DataFrame(data, columns=columns).astype({'ID': 'int64'})
main_df = df.set_index('ID')

Table 1:

   col_A col_B col_C col_D col_E col_F
ID                                    
1   0.26  0.11  0.18  0.23  0.18  0.20
2   0.27  0.12  0.17  0.24  0.14  0.17
3   0.21  0.10  0.15  0.23  0.10  0.13

Table 2:

average_df = pd.DataFrame(np.expand_dims(list(map(float, [d for d in np.concatenate(averages) if 'AVG' not in d])), axis=0), columns=columns[1:])
average_df = average_df.rename_axis('AVG')

     col_A  col_B  col_C  col_D  col_E  col_F
AVG                                          
0     0.25   0.11   0.17   0.23   0.14   0.17

Collectives™ on Stack Overflow

python regex to read text file and split row to column

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related