- find the rows contains msg sep line, e.g. '-----', '======'
- then use
np.where(cond, 1, 0).cumsum() to tag every separate msg.
- filter the lines without '-----' or '======'
- groupby tag, and join with sep '\n', then use str.split to expand the columns.
# read the file with only one col
df = pd.read_csv(file, sep='\n', header=None)
# located the row contains ------ or ======
cond = df[0].str.contains('-----|======')
df['tag'] = np.where(cond, 1, 0).cumsum()
# filter the line contains msg
cond2 = df['tag'] >=2
dfn = df[(~cond & cond2)].copy()
# output
df_output = (dfn.groupby('tag')[0]
.apply('\n'.join)
.str.split('\n', n=2, expand=True))
df_output.columns = ['name', 'date', 'Description']
output:
name date \
tag
2.0 1: Aki 2018/12/05 (Wed) 17:33:17
3.0 2: Aki 2018/12/06 (Thursday) 17:14:30
4.0 3: kano, etc. 2018/12/07 (Friday) 11:44:45
Description
tag
2.0 An approval notice has been sent.
3.0 I was notified by Mr. Id, the agent of the oth...
4.0 Please call rito.
df:
0 tag
0 ==============================================... 1
1 Title: Whole case 1
2 Location: oyuri 1
3 From: Aki 1
4 Date: 2018/11/30 (Friday) 11:55:29 1
5 ==============================================... 2
6 1: Aki 2
7 2018/12/05 (Wed) 17:33:17 2
8 An approval notice has been sent. 2
9 ----------------------------------------------... 3
10 2: Aki 3
11 2018/12/06 (Thursday) 17:14:30 3
12 I was notified by Mr. Id, the agent of the oth... 3
13 ----------------------------------------------... 4
14 3: kano, etc. 4
15 2018/12/07 (Friday) 11:44:45 4
16 Please call rito. 4
17 ----------------------------------------------... 5
you can continue handle the name:
obj = df_output['name'].str.strip().str.split(':\s*')
df_output['name'] = obj.str[-1]
df_output['idx'] = obj.str[0]
df_output = df_output.set_index('idx')
name date \
idx
1 Aki 2018/12/05 (Wed) 17:33:17
2 Aki 2018/12/06 (Thursday) 17:14:30
3 kano, etc. 2018/12/07 (Friday) 11:44:45
Description
idx
1 An approval notice has been sent.
2 I was notified by Mr. Id, the agent of the oth...
3 Please call rito.
add more header columns:
cond = (df['tag'] == 1) & (df[0].str.contains(':'))
header_dict = dict(df.loc[cond, 0].str.split(': ', n=1).values)
# {'Title': 'Whole case',
# 'Location': 'oyuri',
# 'From': 'Aki ',
# 'Date': '2018/11/30 (Friday) 11:55:29'}
for k,v in header_dict.items():
df_output[k] = v