How to split a csv into multiple csv files using a list of keywords

Question

I am trying to read performance reports from multiple machines and would like to parse them and combine them to easily compare machine performances on single plots. Once divided into multiple csv I plan on reading them using pd.read_csv() and combine mutliple tools into single df's.

But in order to that I must first deal with & split up rather ugly csv files with semicolon separators.
The structure of the CSV is like this:

KEYWORD_01;;;...;;
COL_01;COL_02;COL03;...;COL_n;
Line_1;
Line_2;
Line_3;
...
Line_m;
KEYWORD_02;;;...;;
COL_01;COL_02;COL03;...;COL_x;
Line_1;
Line_2;
Line_3;
...
Line_y;
KEYWORD_03;;;...;;
COL_01;COL_02;COL03;...;COL_f;
Line_1;
Line_2;
Line_3;
...
Line_g;

Data csv file available here

The csv report is made of multiple sections each beginning with a fixed keyword (or keyphrase), each section has a set number of column (which can vary from section to section) and a dynamic number of rows depending on the number of events reported (cfr structure above).

I create a list with all my keywords called tpm_sections

tpm_sections = ['Summary of time consumption',
    'Equipment Indicators',
    'Batch Profile',
    'Jam Profile',
    'Jam Time Profile',
    'Jam Table',
    'Handler Model profile',
    'Miscellaneous Indicators ',
    'Tape Job Profile ']
tpm_idx = [None]*len(tpm_sections)

I read my csv and use regex to match any element of my tpm_sections list to the rows of my csv file and I use the function enumerate so I can return the row index to in a separate list tpm_idx :

for file in os.listdir(input_folder):
   input_file=os.path.join(input_folder, file)
   if file.endswith('.csv'):
     tpm_date=datetime.fromtimestamp(os.path.getctime(input_file)).strftime('%Y%m%d') # get TPM report date from file creation timestamp
        with open(input_file, "r") as f: 
          
            reader = csv.reader(f, delimiter=";")
            #for line in reader:
            
            for i, row in enumerate(reader):
                if r'Machine' in row:
                    mcpat = re.compile(r'\\\\7icost\d\d')
                    mcline = str(row[1])
                    mcname = mcpat.match(mcline).group(0)[2:]
                    mcid = mcname[6:]
                    print('Report date is: ' + tpm_date + "\nMachine Name: " + mcname + '\nMachine ID: ' + mcid)
                for j in range(len(tpm_sections)):
                    if tpm_sections[j] in row:
                        tpm_idx[j] = i
                        print('Section '+tpm_sections[j]+' starts at line: ' + str(tpm_idx[j]) )
            tpm_dict = {tpm_idx_names[i]: tpm_idx[i] for i in range(len(tpm_idx))}

I now have a list of keywords, a list of matching row indexes and a dictionary which links both, how should I proceed about splitting the csv file? My code to write csv file per section of my reader object for future pandas import, optional] create sub-folders per section for more structure

for j in range(len(tpm_idx_names))
output_file = tpm_date + mcname + tpm_idx_name[j]
with open(output_file, 'w', newline='') as o:
    if j+1 < len(tpm_idx):
        #for row_idx in range(tpm_idx[j]:tpm_idx[j+1]):
        for line in reader[tpm_idx[j]:tpm_idx[j+1]]:
            o.write(''.join())
    else:
        for line in reader[tpm_idx[j]:]:
            o.write(''.join())

Is there a simpler method of doing this by passing a list of keywords to the split() function? That would be awesome but I couldn't find any example of this being possible. Or by better using regex and then a while "line is not empty" loop? Bearing in mind an empty line in my csv is made of ;;;;;
Should I instead create a list of list (LoL) using ls.append() or a numpy array; for every line in between matched tpm_section[j] keywords? Then I could easily add columns for my machine name and ID. I could choose to create a single LoL/array appending all of my 20 machines or else create one per machine and append them later either in pandas or before writing my csv. Code example to add in part 2:

elif tpm_sections[j] in row: TPM_LoL.j.append(row)

Please, post a testable sample data with the expected output, not like this ... — James Brown
– James Brown, Commented Dec 1, 2020 at 3:14
Hi thanks for your advice, I'm rather new at posting on this forum and to coding in general, so I don't always know what the best approach is. This being said I wanted to keep the topic general so others could benefit from it, the csv file has a really ugly structure, I've tried to trim it down for a simple example but even that takes up a whole lot of space... is it ok if I attach a link to it? — Edgar
– Edgar, Commented Dec 1, 2020 at 4:55
It's OK to attach a link, but it's still a good idea to provide a sample of the data that illustrates the problem - that way, people don't waste their time answering a question that becomes less meaningful if the example is no longer available (regardless of your intentions to keep it available) — Grismar
– Grismar, Commented Dec 1, 2020 at 6:37
It's unclear what section of the source .csv you're exactly interested in? Your description in the questions suggests the format is fairly regular, but there are many lines in the data that would likely need to be ignored and the format of the various sections does not appear to be identical. You've said what sections you want by keyword, but what parts of those sections do you then require to end up in the output? — Grismar
– Grismar, Commented Dec 1, 2020 at 6:43
@Grismar thanks for clearing things up. yes the format is quite ugly, I also attached the html version which can help visualize, but mainly the sections in the report are delimited by the keywords in my tpm_sections list in point 1. After the section names comes a line with the table columns and then a variable amount of rows of data. Each section of the report has its own set of columns which makes it worse. If we zoom out though: I'd like to separate a csv into sections delimited by multiple keywords and the end of one section is the start of another. I'll clean up later! Thanks! — Edgar
– Edgar, Commented Dec 1, 2020 at 17:31

Grismar · Accepted Answer · 2020-12-01 21:26:03Z

I think you're making things harder by breaking up the problem into too many smaller problems. Extracting the data from the original html (being a structured data format of sorts as well) and only the data you need, would probably have been easiest.

However, if you're looking for a way to:

split up an existing text file into multiple text files
split just before a keyword line
only write output for selected keywords

And assuming the text file is a semi-colon-separated file for which any line that only has a single term in the first column is a keyword line, then this should work:

tpm_sections = [
    'Summary of time consumption',
    'Equipment Indicators',
    'Batch Profile',
    'Jam Profile',
    'Jam Time Profile',
    'Jam Table',
    'Handler Model profile',
    'Miscellaneous Indicators ',
    'Tape Job Profile '
]
out_f = None
with open('ICOST_19_TPM_20201124.csv') as f:
    for line in f:
        parts = line.strip().split(';')
        if parts[1] and (parts[1:].count('') == len(parts) - 1):
            # new keyword line, close previous file if any
            if out_f is not None:
                out_f.close()
            if line[1] in tpm_sections:
                # naming the new file after the section
                out_f = open(f'{line[1]}.csv', 'w')
            else:
                out_f = None
        # for any line, if an output file is open at this point, write to it
        if out_f is not None:
            out_f.write(line)
    else:
        if out_f is not None:
            out_f.close()

If you don't want to recognise each line with only one value in the first column as a keyword line, but only want lines that have a recognised keyword to cause a split (and include everything after it in that file), you can simply change this:

        if parts[1] and (parts[1:].count('') == len(parts) - 1):
            # new keyword line, close previous file if any
            if out_f is not None:
                out_f.close()
            if line[1] in tpm_sections:
                # naming the new file after the section
                out_f = open(f'{line[1]}.csv', 'w')
            else:
                out_f = None

To:

        if (parts[1] and (parts[1:].count('') == len(parts) - 1) and 
            (line[1] in tpm_sections)):
            # new keyword line, close previous file if any
            if out_f is not None:
                out_f.close()
            out_f = open(f'{line[1]}.csv', 'w')

But it's not entirely clear from the question or the data which it should be. Either does as advertised.

Thanks, yes indeed I think I made out the issue to be more complicated than it should have been. This helps a lot thanks, I'll feedback which of these 2 methods I end up using!

Collectives™ on Stack Overflow

How to split a csv into multiple csv files using a list of keywords

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related