Calling a python script from awk

Question

I have a file without about 150k rows, and two columns. I need to to run a a python script on the first field, and save its output as a third column, such that the change looks like this:

Original File:

Col1  Col2 
d     2
e     4
f     6

New file:


Col1  Col2  Col3
d     2     output
e     4     output
f     6     output

I'm not able to run the script from inside awk.

cat original.list | awk -F" " ' {`/homes/script.py $1`}'

If I were able to, I would then want to save it as a variable, and print the new variable, plus $1 and $2 to the new file.

thanks in advance (related question here)

Why don't you do all the tasks with python? Awk is not the only language which can split columns. — tshiono
– tshiono, Commented Aug 17, 2021 at 7:44

MarcoLucidi · Accepted Answer · 2021-08-19 19:39:21Z

the answer to the "related question" you linked (and the one posted in the comments) actually solve your problem, it just need to be adapted to your specific case.

cat original.list | awk -F" " ' {`/homes/script.py $1`}'

cat is useless here because awk can open and read the file by itself
you don't need -F" " because awk will split fields by spaces by default
backticks `` wont run your script, that's a shell (discouraged) feature, doesn't work in awk

we can use command | getline var to execute a command and store its (first line of) output in a variable. from man awk:

command | getline var

pipes a record from command into var.

using your example file:

$ cat original
Col1  Col2
d     2
e     4
f     6
$

and a dummy script.py:

$ cat script.py
#!/bin/python

print("output")
$

we can do something like this:

$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original
Col1  Col2 Col3
d     2 output
e     4 output
f     6 output
$

the first action runs on the first line of input, adds Col3 to the header and avoids passing Col1 to the python script.

in the other action, we first build the command concatenating $1 to the script's path, then we run it and store its first line of output in out variable (I'm assuming your python script output is just one line). close(cmd) is important because after getline, the pipe reading from cmd's output would remain open and doing this for many records could lead to errors like too many open files. at the end we print $0 and cmd's output.

third's column formatting looks a bit off, you can improve it either from awk using printf or with an external program like column, e.g:

$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original | column -t
Col1  Col2  Col3
d     2     output
e     4     output
f     6     output
$

lastly, doing all this on a 150k rows file means calling the python script 150k times etc.., it probably will be a slow task, I think performance could be improved by doing everything directly in the python script as already suggested in the comments, but whether or not it is applicable to your specific case, goes beyond the scope of this question/answer.

Collectives™ on Stack Overflow

Calling a python script from awk

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related