1

I want to convert a text file to a csv file with the columns such name,date,Description Im new to python so not getting a proper way to do this can someone guide me regarding this. below is the sample text file.

================================================== ====
Title: Whole case
Location: oyuri
From: Aki 
Date: 2018/11/30 (Friday) 11:55:29
================================================== =====
1: Aki 
2018/12/05 (Wed) 17:33:17
An approval notice has been sent.
-------------------------------------------------- ------------------
2: Aki
2018/12/06 (Thursday) 17:14:30
I was notified by Mr. Id, the agent of the other party.

-------------------------------------------------- ------------------
3: kano, etc.
2018/12/07 (Friday) 11:44:45
Please call rito.
-------------------------------------------------- ------------------

2 Answers 2

1
  1. find the rows contains msg sep line, e.g. '-----', '======'
  2. then use np.where(cond, 1, 0).cumsum() to tag every separate msg.
  3. filter the lines without '-----' or '======'
  4. groupby tag, and join with sep '\n', then use str.split to expand the columns.
# read the file with only one col
df = pd.read_csv(file, sep='\n', header=None)

# located the row contains ------ or ======
cond = df[0].str.contains('-----|======')
df['tag'] = np.where(cond, 1, 0).cumsum()

# filter the line contains msg
cond2 = df['tag'] >=2
dfn = df[(~cond & cond2)].copy()

# output
df_output = (dfn.groupby('tag')[0]
            .apply('\n'.join)
            .str.split('\n', n=2, expand=True))
df_output.columns = ['name', 'date', 'Description']

output:

              name                            date  \
tag                                                  
2.0        1: Aki        2018/12/05 (Wed) 17:33:17   
3.0         2: Aki  2018/12/06 (Thursday) 17:14:30   
4.0  3: kano, etc.    2018/12/07 (Friday) 11:44:45   

                                           Description  
tag                                                     
2.0                  An approval notice has been sent.  
3.0  I was notified by Mr. Id, the agent of the oth...  
4.0                                  Please call rito.  

df:

                                                    0  tag
0   ==============================================...    1
1                                   Title: Whole case    1
2                                     Location: oyuri    1
3                                          From: Aki     1
4                  Date: 2018/11/30 (Friday) 11:55:29    1
5   ==============================================...    2
6                                             1: Aki     2
7                           2018/12/05 (Wed) 17:33:17    2
8                   An approval notice has been sent.    2
9   ----------------------------------------------...    3
10                                             2: Aki    3
11                     2018/12/06 (Thursday) 17:14:30    3
12  I was notified by Mr. Id, the agent of the oth...    3
13  ----------------------------------------------...    4
14                                      3: kano, etc.    4
15                       2018/12/07 (Friday) 11:44:45    4
16                                  Please call rito.    4
17  ----------------------------------------------...    5

you can continue handle the name:

obj = df_output['name'].str.strip().str.split(':\s*')
df_output['name'] = obj.str[-1]
df_output['idx'] = obj.str[0]
df_output = df_output.set_index('idx')
           name                            date  \
idx                                               
1           Aki       2018/12/05 (Wed) 17:33:17   
2           Aki  2018/12/06 (Thursday) 17:14:30   
3    kano, etc.    2018/12/07 (Friday) 11:44:45   

                                           Description  
idx                                                     
1                    An approval notice has been sent.  
2    I was notified by Mr. Id, the agent of the oth...  
3                                    Please call rito.

add more header columns:

cond = (df['tag'] == 1) & (df[0].str.contains(':'))
header_dict = dict(df.loc[cond, 0].str.split(': ', n=1).values)

    # {'Title': 'Whole case',
    #  'Location': 'oyuri',
    #  'From': 'Aki ',
    #  'Date': '2018/11/30 (Friday) 11:55:29'}

for k,v in header_dict.items():
    df_output[k] = v
Sign up to request clarification or add additional context in comments.

5 Comments

ThankYou for the Answer Ferris. :) It is working completely fine What is we want to add the Header content as well?? In the Name column (From : i.e aki) Date in Date column and the rest in Description.
I'm trying to do that. Thank you a lot Ferris
I am trying to make the index and name column different but im not able to do that using above code
you can post a new question, and describe the detail.
yes sure i will do that . thank you ferris
1

I outline below a very simplistic approach to achieving your task. The general idea is to:

  1. Read in your text file using open()
  2. Split the text into a list
  3. Isolate the information in each element of the list
  4. Export the information to a csv using pandas

I would recommend using Jupyter Notebooks to get a better idea of what I have done here.

import pandas as pd

# open file and extract text
text_path = 'text.txt'
with open(text_path) as f:
    text = f.read()

# split text into a list
lines = text.split('\n')

# remove heading
len_heading = 6
lines = lines[6:]

# seperate information using divider
divider = '-----'
data = []
start = 0
for i, line in enumerate(lines):
    
    # add elements to data if divider found
    if line.startswith(divider):
        data.append(lines[start:i])
        start = i+1

# extract name, date and description from data
names, dates, description = [], [], []
for info in data:
    
    # this is a very simplistic approach, please add checks
    # to make sure you are getting the right data
    name = info[0][2:]
    date = info[1][:11]
    desc = info[2]
    
    names.append(name)
    dates.append(date)
    description.append(desc)

# create pandas dataframe
df = pd.DataFrame({'name': names, 'date': dates, 'description': description})

# export dataframe to csv
df.to_csv('converted_text.csv', index=False)

You should get a CSV file that looks like this.

enter image description here

1 Comment

Thank you for the help ALS777. it is partially working. working - Data is now seperated in columns Not Working - 1. description column is empty 2. Some of the entries are missing from the above

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.