0

I have a file without about 150k rows, and two columns. I need to to run a a python script on the first field, and save its output as a third column, such that the change looks like this:

Original File:

Col1  Col2 
d     2
e     4
f     6

New file:


Col1  Col2  Col3
d     2     output
e     4     output
f     6     output

I'm not able to run the script from inside awk.

cat original.list | awk -F" " ' {`/homes/script.py $1`}'

If I were able to, I would then want to save it as a variable, and print the new variable, plus $1 and $2 to the new file.

thanks in advance (related question here)

2
  • See this: Assigning system command's output to variable Commented Aug 17, 2021 at 6:53
  • Why don't you do all the tasks with python? Awk is not the only language which can split columns. Commented Aug 17, 2021 at 7:44

1 Answer 1

0

the answer to the "related question" you linked (and the one posted in the comments) actually solve your problem, it just need to be adapted to your specific case.

cat original.list | awk -F" " ' {`/homes/script.py $1`}'
  • cat is useless here because awk can open and read the file by itself
  • you don't need -F" " because awk will split fields by spaces by default
  • backticks `` wont run your script, that's a shell (discouraged) feature, doesn't work in awk

we can use command | getline var to execute a command and store its (first line of) output in a variable. from man awk:

command | getline var

pipes a record from command into var.

using your example file:

$ cat original
Col1  Col2
d     2
e     4
f     6
$

and a dummy script.py:

$ cat script.py
#!/bin/python

print("output")
$

we can do something like this:

$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original
Col1  Col2 Col3
d     2 output
e     4 output
f     6 output
$

the first action runs on the first line of input, adds Col3 to the header and avoids passing Col1 to the python script.

in the other action, we first build the command concatenating $1 to the script's path, then we run it and store its first line of output in out variable (I'm assuming your python script output is just one line). close(cmd) is important because after getline, the pipe reading from cmd's output would remain open and doing this for many records could lead to errors like too many open files. at the end we print $0 and cmd's output.

third's column formatting looks a bit off, you can improve it either from awk using printf or with an external program like column, e.g:

$ awk '
NR == 1 { print $0, "Col3" }
NR > 1 { cmd="./script.py " $1; cmd | getline out; close(cmd); print $0, out }
' original | column -t
Col1  Col2  Col3
d     2     output
e     4     output
f     6     output
$

lastly, doing all this on a 150k rows file means calling the python script 150k times etc.., it probably will be a slow task, I think performance could be improved by doing everything directly in the python script as already suggested in the comments, but whether or not it is applicable to your specific case, goes beyond the scope of this question/answer.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.