emulate SAS' datastep statement FIRST using linux command line tools

Question

Let's say I have the first column of the following dataset in a file and I want to emulate the flag in the second column so I export only that row tied to a flag = 1 (dataset is pre-sorted by the target column):

I could run awk 'NR==1 {print; next} seen[$1]++ {print}' dataset but would run into a problem for very large files (seen keeps growing). Is there an alternative to handle this without tracking every single unique value of the target column (here column #1)? Thanks.

MrFlick · Accepted Answer · 2014-05-07 03:42:47Z

1

So you only have the first column? And would like to generate the second? I think a slightly different awk command could work

awk '{if (last==$1) {flag=0} else {last=$1; flag=1}; print $0,flag}' file.txt

Basically you just check if the first field matches the last one you've seen. Since it's sorted, you don't have to keep track of everything you've seen, only the last one to know if the value is different.

answered May 7, 2014 at 3:42

MrFlick

209k19 gold badges300 silver badges324 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user2105469 Over a year ago

Simple, straighforward solution, thanks. My data had missings in $1, and so I had to initialize last to -1: if (NR==1) {last=-1} ...

Austin Hastings · Accepted Answer · 2014-05-07 02:30:08Z

0

Seems like grep would be fine for this:

$ grep " 1" dataset

answered May 7, 2014 at 2:30

Austin Hastings

6274 silver badges13 bronze badges

Collectives™ on Stack Overflow

emulate SAS' datastep statement FIRST using linux command line tools

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related