1

Having some issues calling awk from within Python. Normally, I'd do the following to call the command in awk from the command line.

  1. Open up command line, in admin mode or not.
  2. Change my directory to awk.exe, namely cd R\GnuWin32\bin
  3. Call awk -F "," "{ print > (\"split-\" $10 \".csv\") }" large.csv

My command is used to split up the large.csv file based on the 10th column into a number of files named split-[COL VAL HERE].csv. I have no issues running this command. I tried to run the same code in Python using subprocess.call() but I'm having some issues. I run the following code:

def split_ByInputColumn():
     subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', '\",\"', 
              '\"{ print > (\\"split-\\" $10 \\".csv\\") }\"', 'large.csv'],
                  cwd = 'C:/R/GnuWin32/bin/')

and clearly, something is running when I execute the function (CPU usage, etc) but when I go to check C:/R/GnuWin32/bin/ there are no split files in the directory. Any idea on what's going wrong?

1
  • Any reason why you don't just do the equivalent with Python,so you don't have to run awk? Commented Nov 3, 2016 at 18:19

2 Answers 2

1

As I stated in my previous answer that was downvoted, you overprotect the arguments, making awk argument parsing fail.

Since there was no comment, I supposed there was a typo but it worked... So I suppose that's because I should have strongly suggested a full-fledged python solution, which is the best thing to do here (as stated in my previous answer)

Writing the equivalent in python is not trivial as we have to emulate the way awk opens files and appends to them afterwards. But it is more integrated, pythonic and handles quoting properly if quoting occurs in the input file.

I took the time to code & test it:

def split_ByInputColumn():
    # get rid of the old data from previous runs
    for f in glob.glob("split-*.csv"):
        os.remove(f)

    open_files = dict()

    with open('large.csv') as f:
        cr = csv.reader(f,delimiter=',')
        for r in cr:
            tenth_row = r[9]
            filename = "split-{}.csv".format(tenth_row)
            if not filename in open_files:
                handle = open(filename,"wb")
                open_files[filename] = (handle,csv.writer(handle,delimiter=','))
            open_files[filename][1].writerow(r)

    for f,_ in open_files.values():
        f.close()

split_ByInputColumn()

in detail:

  • read the big file as csv (advantage: quoting is handled properly)
  • compute the destination filename
  • if filename not in dictionary, open it and create csv.writer object
  • write the row in the corresponding dictionary
  • in the end, close file handles

Aside: My old solution, using awk properly:

import subprocess

def split_ByInputColumn():
     subprocess.call(['awk.exe', '-F', ',',
              '{ print > ("split-" $10 ".csv") }', 'large.csv'],cwd = 'some_directory')
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for re-answering and providing a great python solution as well!
1

Someone else posted an answer (and then subsequently deleted it), but the issue was that I was over-protecting my arguments. The following code works:

def split_ByInputColumn():
 subprocess.call(['C:/R/GnuWin32/bin/awk.exe', '-F', ',', 
          '{ print > (\"split-\" $10 \".csv\") }', 'large.csv'],
              cwd = 'C:/R/GnuWin32/bin/')

2 Comments

my answer was downvoted, so I supposed that it wasn't working, just tested and it works...
@Jean-FrançoisFabre Unsure why you got downvoted, but the new answer is even better - thanks for the help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.