0

I have a folder with multiple files, each with a varying number of columns in each file. I want to go through the directory, open each file and loop through each line, writing the line to a new CSV file based on the number of columns in that line. I want to end up with a single big CSV for all lines with 14 columns, another big CSV for all lines with 18 columns, and the last CSV with all the other columns.

Here's what I have so far.

import pandas as pd
import glob
import os
import csv


path = r'C:\Users\Vladimir\Documents\projects\ETLassig\W3SVC2'
all_files = glob.glob(os.path.join(path, "*.log")) 

for file in all_files:
    for line in file:
        if len(line.split()) == 14:
            with open('c14.csv', 'wb') as csvfile:
                csvwriter = csv.writer(csvfile, delimiter=' ')
                csvwriter.writerow([line])
        elif len(line.split()) == 18:
            with open('c14.csv', 'wb') as csvfile:
                csvwriter = csv.writer(csvfile, delimiter=' ')
                csvwriter.writerow([line])          
            #open 18.csv
        else:
            with open('misc.csv', 'wb') as csvfile:
                csvwriter = csv.writer(csvfile, delimiter=' ')
                csvwriter.writerow([line])
print(c14.csv)

Can anyone offer any feedback on how to approach this?

2 Answers 2

5

You can add all of your columns as as a list in list:

l = []
for file in [your_files]:
    with open(file, 'r') as f:
        for line in f.readlines()
            l.appned(line.split(" "))

Now you have list of lists, so just sort them with length of sublists then put it in a new file:

l.sort(key=len)

with open(outputfile, 'w'):
     # Write  lines here as you want
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks very much - is there a writelines() method for writing to the output file?
Yes, you can do lots of things. @JamesCooper
0

Beforehand, please note that you can copy the lines as is from the input files to the output ones, no need for the CSV machinery.

That said, I propose to use a dictionary of file objects and the get method of dictionaries, that permits to specify a default value.

files = {14:open('14.csv', 'wb'),
         18:open('18.csv', 'wb')}
other = open('other.csv', 'wb')

for file in all_files:
    for line in open(file):
        llen = len(line.split())
        target = files.get(llen, other)
        target.write(line)

If you have to process some million records then note that, because

In [20]: a = 'a '*20                                                                      

In [21]: %timeit len(a.split())                                                           
599 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [22]: %timeit a.count(' ')+1                                                           
328 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
 

you should substitute the for loops above with

for file in all_files:
    for line in open(file):
        fields_count = line.count(' ')+1
        target = files.get(fields_count, other)
        target.write(line)

Should because, even if we speak of nano seconds , the file system access is in the same ballpark

In [23]: f = open('dele000', 'w')                                                         

In [24]: %timeit f.write(a)                                                               
508 ns ± 154 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

as splitting/counting.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.