2

My text file format is like below

ID  col_A   col_B   col_C
1   0.26    0.11    0.18
2   0.27    0.12    0.17
3   0.21    0.10    0.15
----------------------------
AVG 0.25    0.11    0.17
----------------------------
ID  col_D   col_E   col_F
1   0.23    0.18    0.20
2   0.24    0.14    0.17
3   0.23    0.10    0.13
----------------------------
AVG 0.23    0.14    0.17
----------------------------

I'm attempting to use python and regex to export two separate csv files with the format like below

Table 1

ID col_A col_B col_C col_D col_E col_F
1 0.26 0.11 0.18 0.23 0.18 0.20
2 0.27 0.12 0.17 0.24 0.14 0.17
3 0.21 0.10 0.15 0.23 0.10 0.13

Table 2

col_A col_B col_C col_D col_E col_F
AVG 0.25 0.11 0.17 0.23 0.14 0.17

Here's my code:

import re
import pandas as pd


with open('test.txt') as file:
        lines = file.readlines()
        regex = r'\A(?P<ID>\S+)\s*(?P<COL_A>\S+)\s*(?P<COL_B>\S+)\s*(?P<COL_C>\S+)'
        data = []
        
        for line in lines:
            m = re.search(regex, line)
            if m != None:
                data.append([m.group(1),m.group(2),m.group(3),m.group(4)])
        
        df = pd.DataFrame(data)
        
df.to_csv('test.csv', index = False)

My code would result in a strange format like

0 1 2 3
ID col_A col_B col_C
1 0.26 0.11 0.18
2 0.27 0.12 0.17
3 0.21 0.10 0.15
------ --------- --------- ---------
AVG 0.25 0.11 0.17
------ --------- --------- ---------
ID col_D col_E col_F
1 0.23 0.18 0.20
2 0.24 0.14 0.17
3 0.23 0.10 0.13
------ --------- --------- ---------
AVG 0.23 0.14 0.17
------ --------- --------- ---------

How can I modify my code to achieve my request? Thank you!

2 Answers 2

1

You can use

import re, itertools
import pandas as pd

data = []
colnames = []
avg = []
start = True
is_next_avg = False
id_start_label = ''

with open('test.txt', 'r') as file:
    for line in file:
        line = line.strip()
        if start and line.replace('-', '').strip():
            id_start_label = line.split()[0]
            colnames.extend(line.split()[1:])
            start = False
        elif '---' in line and not line.replace('-', '').strip():
            if start:
                continue
            is_next_avg = not is_next_avg
        elif is_next_avg:
            avg.extend(line.split()[1:])
            is_next_avg = not is_next_avg
            start = not start
        else:
            data.append(line.split())

colnames.insert(0, id_start_label)
data = sorted(data, key=lambda x: x[0])
new_data = []
for key, g in itertools.groupby(data, key=lambda x: x[0]):
    entry = [key]
    entry.extend( [i for x in g for i in x[1:]] )
    new_data.append(entry)

df = pd.DataFrame(new_data, columns=colnames)
df2 = pd.DataFrame(avg, columns=['AVG'])
df2 = df2.transpose()
df2.columns = df.columns[1:]

Output:

>>> df
  ID col_A col_B col_C col_D col_E col_F
0  1  0.26  0.11  0.18  0.23  0.18  0.20
1  2  0.27  0.12  0.17  0.24  0.14  0.17
2  3  0.21  0.10  0.15  0.23  0.10  0.13

>>> df2
    col_A col_B col_C col_D col_E col_F
AVG  0.25  0.11  0.17  0.23  0.14  0.17
Sign up to request clarification or add additional context in comments.

1 Comment

Also, tested here. Please let know if you need more clarifications.
1

Something like this should work for you:

import re
import pandas as pd
import numpy as np

data = []
headers = []
averages = []
ids = []
with open('/content/text.txt') as file:
  for l in file.readlines():
      s = re.findall(r'\S+', l.rstrip().split(',')[0])
      if '-' not in s[0]:
        if 'ID' in s[0]: headers.append(s)
        elif 'AVG' in s[0]: averages.append(s)
        else: 
          data.append(s[1:])
          ids.append(s[0])

data = np.hstack([data[:len(data)//len(headers)], data[len(data)//len(headers):]])
data = np.concatenate([np.expand_dims(np.transpose(sorted(set(ids))), axis=-1), data], axis=1)
columns = sorted(set(np.concatenate(headers)))
df = pd.DataFrame(data, columns=columns).astype({'ID': 'int64'})
main_df = df.set_index('ID')

Table 1:

   col_A col_B col_C col_D col_E col_F
ID                                    
1   0.26  0.11  0.18  0.23  0.18  0.20
2   0.27  0.12  0.17  0.24  0.14  0.17
3   0.21  0.10  0.15  0.23  0.10  0.13

Table 2:

average_df = pd.DataFrame(np.expand_dims(list(map(float, [d for d in np.concatenate(averages) if 'AVG' not in d])), axis=0), columns=columns[1:])
average_df = average_df.rename_axis('AVG')
     col_A  col_B  col_C  col_D  col_E  col_F
AVG                                          
0     0.25   0.11   0.17   0.23   0.14   0.17

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.