I am trying to read performance reports from multiple machines and would like to parse them and combine them to easily compare machine performances on single plots. Once divided into multiple csv I plan on reading them using pd.read_csv() and combine mutliple tools into single df's.
But in order to that I must first deal with & split up rather ugly csv files with semicolon separators.
The structure of the CSV is like this:
KEYWORD_01;;;...;;
COL_01;COL_02;COL03;...;COL_n;
Line_1;
Line_2;
Line_3;
...
Line_m;
KEYWORD_02;;;...;;
COL_01;COL_02;COL03;...;COL_x;
Line_1;
Line_2;
Line_3;
...
Line_y;
KEYWORD_03;;;...;;
COL_01;COL_02;COL03;...;COL_f;
Line_1;
Line_2;
Line_3;
...
Line_g;
The csv report is made of multiple sections each beginning with a fixed keyword (or keyphrase), each section has a set number of column (which can vary from section to section) and a dynamic number of rows depending on the number of events reported (cfr structure above).
I create a list with all my keywords called tpm_sections
tpm_sections = ['Summary of time consumption', 'Equipment Indicators', 'Batch Profile', 'Jam Profile', 'Jam Time Profile', 'Jam Table', 'Handler Model profile', 'Miscellaneous Indicators ', 'Tape Job Profile '] tpm_idx = [None]*len(tpm_sections)I read my csv and use regex to match any element of my tpm_sections list to the rows of my csv file and I use the function enumerate so I can return the row index to in a separate list tpm_idx :
for file in os.listdir(input_folder):
input_file=os.path.join(input_folder, file)
if file.endswith('.csv'):
tpm_date=datetime.fromtimestamp(os.path.getctime(input_file)).strftime('%Y%m%d') # get TPM report date from file creation timestamp
with open(input_file, "r") as f:
reader = csv.reader(f, delimiter=";")
#for line in reader:
for i, row in enumerate(reader):
if r'Machine' in row:
mcpat = re.compile(r'\\\\7icost\d\d')
mcline = str(row[1])
mcname = mcpat.match(mcline).group(0)[2:]
mcid = mcname[6:]
print('Report date is: ' + tpm_date + "\nMachine Name: " + mcname + '\nMachine ID: ' + mcid)
for j in range(len(tpm_sections)):
if tpm_sections[j] in row:
tpm_idx[j] = i
print('Section '+tpm_sections[j]+' starts at line: ' + str(tpm_idx[j]) )
tpm_dict = {tpm_idx_names[i]: tpm_idx[i] for i in range(len(tpm_idx))}
I now have a list of keywords, a list of matching row indexes and a dictionary which links both, how should I proceed about splitting the csv file? My code to write csv file per section of my reader object for future pandas import, optional] create sub-folders per section for more structure
for j in range(len(tpm_idx_names)) output_file = tpm_date + mcname + tpm_idx_name[j] with open(output_file, 'w', newline='') as o: if j+1 < len(tpm_idx): #for row_idx in range(tpm_idx[j]:tpm_idx[j+1]): for line in reader[tpm_idx[j]:tpm_idx[j+1]]: o.write(''.join()) else: for line in reader[tpm_idx[j]:]: o.write(''.join())Is there a simpler method of doing this by passing a list of keywords to the split() function? That would be awesome but I couldn't find any example of this being possible. Or by better using regex and then a while "line is not empty" loop? Bearing in mind an empty line in my csv is made of ;;;;;
Should I instead create a list of list (LoL) using ls.append() or a numpy array; for every line in between matched tpm_section[j] keywords? Then I could easily add columns for my machine name and ID. I could choose to create a single LoL/array appending all of my 20 machines or else create one per machine and append them later either in pandas or before writing my csv. Code example to add in part 2:
elif tpm_sections[j] in row: TPM_LoL.j.append(row)
...