Converting a collection of txt files into ONE CSV file by using Python

Question

I have a folder with multiple txt files. Each file contain information of a client of my friend's business that he entered manually from a hardcopy document. These information can be e-mails, addresses, requestID, etc. Each time he get a new client he creates a new txt file in that folder.

Using Python, I want to create a CSV file that contain all information about all clients from the txt files so that I can open it on Excel. The files content looks like this:

Date:24/02/2021
Email:*****@gmail.com
Product:Hard Drives
Type:Sandisk
Size:128GB

Some files have additional information. And each file is labeled by an ID (which is the name of the txt file).

What I'm thinking of is to make the code creates a dictionary for each file. Each dict will be named by the name of the txt file. The data types (date,email,product.etc) will be the indexes and (keep in mind that not all files has the same number of indexes as some files have more or less data than others) then there are the values. And then convert this collection of dicts into one CSV file that when opened in Excel should look like this:

FileID	Date	Email	Address	Product	Type	Color	Size
01-2021	02-01-2021		Hard Drive	SanDisk		128GB

Is this a good way to achieve this goal? or there is a shorter and more effective one?

This code by @dukkee seems to logically fulfill the task required:

import os

import pandas as pd

FOLDER_PATH = "folder_path"

raw_data = []

for filename in os.listdir(FOLDER_PATH):
    with open(os.path.join(FOLDER_PATH, filename)) as fp:
        file_data = dict(line.split(":", 1) for line in fp if line)
        file_data["FileID"] = filename

    raw_data.append(file_data)


frame = pd.DataFrame(raw_data)
frame.to_csv("output.csv", index=False)

However, it keeps showing me this error:

The following code by @dm2 should also work but it also shows me an error which I couldn't figure why:

import pandas as pd
import os

files = os.listdir('test/')

df_list = [pd.read_csv(f'test/{file}', sep = ':', header = None).set_index(0).T for file in files]
df_out = pd.concat(df_list)
# to reindex by filename
df_out.index = [file.strip('.txt') for file in files]

I made sure that all txt files has no empty lines but this wasn't the solution for these errors.

I would read each text file into a dictionary ({"Product": "Hard Drives", ...} and then create a list of these dictionaries. Then you can create a data frame (pandas.DataFrame(list_of_dicts)) and save to CSV. It's OK to have more data in some dictionaries, but the keys of the dictionary keys will have to be the same. Eg, you can make all of the keys lower-case to reduce changes that keys differ. — jkr
– jkr, Commented Mar 26, 2021 at 13:47

dm2 · Accepted Answer · 2021-03-26 14:04:25Z

1

You can actually read these files into pandas DataFrames and then concatenate them into one single DataFrame.

I've made a test folder with 5 slightly different test files (named '1.txt.', '2.txt.', ...).

Code:

import pandas as pd
import os

files = os.listdir('test/')

df_list = [pd.read_csv(f'test/{file}', sep = ':', header = None).set_index(0).T for file in files]
df_out = pd.concat(df_list)
# to reindex by filename
df_out.index = [file.strip('.txt') for file in files]

df_out:

0        Date            Email      Product     Type   Size  Size2    Type2   Test
1  24/02/2021  *****@gmail.com  Hard Drives  Sandisk  128GB  128GB      NaN   NaN  
2  24/02/2021  *****@gmail.com  Hard Drives  Sandisk  128GB    NaN  Sandisk   NaN  
3  24/02/2021  *****@gmail.com  Hard Drives  Sandisk  128GB    NaN      NaN   Test  
4  24/02/2021  *****@gmail.com  Hard Drives  Sandisk  128GB    NaN      NaN   2  
5  24/02/2021  *****@gmail.com  Hard Drives  Sandisk  128GB    NaN      NaN   NaN

answered Mar 26, 2021 at 14:04

dm2

4,3153 gold badges21 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AhmedA Over a year ago

This could should work fine. but I keep getting an error as it showing in my main post

dm2 Over a year ago

@AhmedA it would seem in that file there's an entry in the format of 'Feature:Entry:ExtraInfo' (e.g. 'Product:SanDisk:HardDrive'), i.e. pandas gets 3 columns where it is expecting only to see 2. So long there's only one ':' symbol per line it should work. You could try adding error_bad_lines = False to pd.read_csv(), but this will skip that line entirely (it will be missing in the final DataFrame), or locate the problematic file and see if you can change ':' to something else.

dukkee · Accepted Answer · 2021-03-26 20:03:18Z

1

You can use smth like this:

import os

import pandas as pd

FOLDER_PATH = "folder_path"

raw_data = []

for filename in os.listdir(FOLDER_PATH):
    with open(os.path.join(FOLDER_PATH, filename), errors="ignore") as fp:
        file_data = dict(line.split(":", 1) for line in fp if line)
        file_data["FileID"] = filename

    raw_data.append(file_data)


frame = pd.DataFrame(raw_data)
frame.to_csv("output.csv", index=False)

edited Mar 26, 2021 at 20:03

answered Mar 26, 2021 at 13:53

dukkee

1,1221 gold badge10 silver badges17 bronze badges

1 Comment

AhmedA Over a year ago

Do you think the error shown in the main post might be related to the encoding type of the txt files?

Collectives™ on Stack Overflow

Converting a collection of txt files into ONE CSV file by using Python

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related