Using both GNU Utils with Mac Utils in bash

Question

I am working with plotting extremely large files with N number of relevant data entries. (N varies between files).

In each of these files, comments are automatically generated at the start and end of the file and would like to filter these out before recombining them into one grand data set.

Unfortunately, I am using MacOSx, where I encounter some issues when trying to remove the last line of the file. I have read that the most efficient way was to use head/tail bash commands to cut off sections of data. Since head -n -1 does not work for MacOSx I had to install coreutils through homebrew where the ghead command works wonderfully. However the command,

tail -n+9 $COUNTER/test.csv | ghead -n -1 $COUNTER/test.csv  >> gfinal.csv

does not work. A less than pleasing workaround was I had to separate the commands, use ghead > newfile, then use tail on newfile > gfinal. Unfortunately, this will take while as I have to write a new file with the first ghead.

Is there a workaround to incorporating both GNU Utils with the standard Mac Utils?

Thanks, Keven

mklement0 · Accepted Answer · 2015-11-12 04:29:21Z

4

The problem with your command is that you specify the file operand again for the ghead command, instead of letting it take its input from stdin, via the pipe; this causes ghead to ignore stdin input, so the first pipe segment is effectively ignored; simply omit the file operand for the ghead command:

tail -n+9 "$COUNTER/test.csv" | ghead -n -1 >> gfinal.csv

That said, if you only want to drop the last line, there's no need for GNU head - OS X's own BSD sed will do:

tail -n +9 "$COUNTER/test.csv" | sed '$d' >> gfinal.csv

$ matches the last line, and d deletes it (meaning it won't be output).

Finally, as @ghoti points out in a comment, you could do it all using sed:

sed -n '9,$ {$!p;}' file

Option -n tells sed to only produce output when explicitly requested; 9,$ matches everything from line 9 through (,) the end of the file (the last line, $), and {$!p;} prints (p) every line in that range, except (!) the last ($).

edited Nov 12, 2015 at 4:29

answered Nov 12, 2015 at 3:30

mklement0

452k68 gold badges728 silver badges988 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

mklement0 Over a year ago

@ghoti: Excellent point, thank you; I've updated the answer, though I've chosen a variant that in my mind better expresses the intent.

ghoti Over a year ago

Ah, I deleted my comment in order to expand it into an answer. :) Your new sed script expresses the goal of the OP with greater poetry, but I don't think it does so more clearly. This way, it says "print lines that meet these criteria" rather than just "delete these ranges of lines from the stream". I would suggest that they are simply different perspectives on the problem. (But +1 for your great explanation, as usual.)

mklement0 Over a year ago

@ghoti: Thanks; point taken re different perspectives. I will say, though, that my sed command more closely resembles the OP's approach.

kevqnp Over a year ago

Hi, thanks for both of your responses! If I were not using the MacOSx, I would be allowed to use "tail | head". Why is it a different case for ghead? Also, in the case where the files have ~1 million entires, is there a method of removing the last line without actually reading the data? From what I have read on other stackexchange posts that head | tail method is the most efficient.

kevqnp Over a year ago

Holy, I just realised the error. Thanks! I've been staring at the code for ages wondering why it did not work! (such a embarrassing rookie mistake)

|

ghoti · Accepted Answer · 2015-11-12 04:40:08Z

2

I realize that your question is about using head and tail, but I'll answer as if you're interested in solving the original problem rather than figuring out how to use those particular tools to solve the problem. :)

One method using sed:

sed -e '1,8d;$d' inputfile

At this level of simplicity, GNU sed and BSD sed both work the same way. Our sed script says:

1,8d - delete lines 1 through 8,
$d - delete the last line.

If you decide to generate a sed script like this on-the-fly, beware of your quoting; you will have to escape the dollar sign if you put it in double quotes.

Another method using awk:

awk 'NR>9{print last} NR>1{last=$0}' inputfile

This works a bit differently in order to "recognize" the last line, capturing the previous line and printing after line 8, and then NOT printing the final line.

This awk solution is a bit of a hack, and like the sed solution, relies on the fact that you only want to strip ONE final line of the file.

If you want to strip more lines than one off the bottom of the file, you'd probably want to maintain an array that would function sort of as a buffered FIFO or sliding window.

awk -v striptop=8 -v stripbottom=3 '
  { last[NR]=$0; }
  NR > striptop*2 { print last[NR-striptop]; }
  { delete last[NR-striptop]; }
  END { for(r in last){if(r<NR-stripbottom+1) print last[r];} }
' inputfile

You specify how much to strip in variables. The last array keeps a number of lines in memory, prints from the far end of the stack, and deletes them as they are printed. The END section steps through whatever remains in the array, and prints everything not prohibited by stripbottom.

edited Nov 12, 2015 at 4:40

answered Nov 12, 2015 at 3:59

ghoti

47.2k8 gold badges71 silver badges108 bronze badges

3 Comments

mklement0 Over a year ago

Good point re double-quoting; your awk command should say NR>9, and could be optimized by replacing NR>1 with NR>=9, or, more generally: n=9; awk "NR>$n{print last} NR>=$n{last=\$0}" inputfile - but, as you state, this is a bit hacky.

ghoti Over a year ago

Thanks, fixed the awk one-liner, and yes, that would be an optimization. Re your general approach, notwithstanding other considerations, I don't think I would ever use double quotes to contain an awk script -- I'm afraid of variable expansions inside scripts like that. I would be more inclined to: awk -v n="$n" 'NR>n{print last} ....

mklement0 Over a year ago

Yes, good point re variable passing - using -v is the way to go. I just took a shortcut in this simple case.

Collectives™ on Stack Overflow

Using both GNU Utils with Mac Utils in bash

2 Answers 2

7 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related