Awk create a new array of unique values from another array

Question

I have my array:

array = [1:"PLCH2", 2:"PLCH1", 3:"PLCH2"]

I want to loop on array to create a new array unique of unique values and obtain:

unique = [1:"PLCH2", 2:"PLCH1"]

how can I achieve that ?

EDIT: as per @Ed Morton request, I show below how my array is populated. In fact, this post is the key solution to my previous post.

in my file.txt, I have:

PLCH2:A1007int&PLCH1:D987int&PLCH2:P977L
INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P

I use split to obtain array:

awk '{
    split($0,a,"&")
    for ( i in a ) {
        split(a[i], b, ":");
        array[i] = b[1];
    }
}' file.txt

edit your question to include a small, complete script that shows how your current array is populated because what you have posted so far could be interpreted in several different ways. — Ed Morton
– Ed Morton, Commented Feb 10, 2020 at 19:54
for ( i in a ) will re-arrange your values into a random (hash) order. That's often undesirable which is why I use for ( i=1; i in a; i++ ) instead to ensure I visit the array indices in the same order they appeared in the string that was split into the array. See gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array. — Ed Morton
– Ed Morton, Commented Feb 11, 2020 at 14:24
@EdMorton wow I didn't know that, thanks for explaining the different in the loop — user324810
– user324810, Commented Feb 11, 2020 at 16:27
@EdMorton Huh, I usually use for (i = 1; i <= length(a); i++) to do the same (assuming the array a has numeric indices, which is true for arrays generated by the split() function). I wasn't sure if the i in a condition is in proper order - do you know if it works with all awks? (Well, I mostly care about BWK AWK, mawk, and gawk.) — jena
– jena, Commented Dec 30, 2022 at 13:27
@jena calling length(a) on each iteration of the loop is time consuming vs a hash lookup and is non-portable (per POSIX, length() operates on strings, not arrays). i in a in this context has nothing (directly) to do with order, it's just a hash lookup testing if i is an index of a which will fail when i has the value 1 more than the max index. Yes, it works in all awks, unlike length(a) which will only work in awks that support calling length on an array. — Ed Morton
– Ed Morton, Commented Dec 30, 2022 at 13:43

Ed Morton · Accepted Answer · 2020-02-11 14:52:22Z

2

This might be what you're trying to do:

$ cat tst.awk
BEGIN {
    split("PLCH2 PLCH1 PLCH2",array)

    printf "array ="
    for (i=1; i in array; i++) {
        printf " %s:\"%s\"", i, array[i]
    }
    print ""

    for (i=1; i in array; i++) {
        if ( !seen[array[i]]++ ) {
            unique[++j] = array[i]
        }
    }

    printf "unique ="
    for (i=1; i in unique; i++) {
        printf " %s:\"%s\"", i, unique[i]
    }
    print ""
}

$ awk -f tst.awk
array = 1:"PLCH2" 2:"PLCH1" 3:"PLCH2"
unique = 1:"PLCH2" 2:"PLCH1"

EDIT: given your updated question, here's how I'd really approach that:

$ cat tst.awk
BEGIN { FS="[:&]" }
{
    numVals=0
    for (i=1; i<NF; i+=2) {
        vals[++numVals] = $i
    }

    print "vals =" arr2str(vals)

    delete seen
    numUniq=0
    for (i=1; i<=numVals; i++) {
        if ( !seen[vals[i]]++ ) {
            uniq[++numUniq] = vals[i]
        }
    }

    print "uniq =" arr2str(uniq)
}

function arr2str(arr,    str, i) {
    for (i=1; i in arr; i++) {
        str = str sprintf(" %s:\"%s\"", i, arr[i])
    }
    return str
}

$ awk -f tst.awk file
vals = 1:"PLCH2" 2:"PLCH1" 3:"PLCH2"
uniq = 1:"PLCH2" 2:"PLCH1"
vals = 1:"INTS11" 2:"INTS11" 3:"INTS11" 4:"INTS11" 5:"INTS11"
uniq = 1:"INTS11" 2:"PLCH1"

edited Feb 11, 2020 at 14:52

answered Feb 10, 2020 at 20:33

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

user324810 Over a year ago

Thanks for proposing an answer, I will try it out and come back to this. Btw, is seen a function ?

Ed Morton Over a year ago

No, seen[] is just a name of an array that I'm using to store the values from array[] as indices so I can identify which of those values has been seen before by it's associated count be zero or non-zero (which I'm using as false or true in the if). seen[] COULD be named anything but that's the idiomatic name given to an array used for that purpose.

Ed Morton Over a year ago

@Zen others had already rejected your proposed edit before I saw it in the queue but it had several issues anyway so I just posted how to really do this using the sample input you provided read from a file.

Ed Morton Over a year ago

@AS seen[x]++ tests if the index x has been seen before. The first time seen[x]++ is tested the result is false since the value of seen[x] is zero, but the second (and subsequent times) it's true because the ++ that was executed after the condition was evaluated the first time set seen[x] to 1. Any time you see an array named seen[] THAT is what it's being used for (assuming the person who wrote the code understands the idiom).

Ed Morton Over a year ago

@AS you're welcome. I expect so - post a new question specifically about whatever it is you want to know about and leave me a comment and I'll take a look. Just make sure to make it about performing a specific task and not "can someone explain this code" or similar and include a minimal reproducible example so it doesn't get downvoted and closed as off topic.

|

Collectives™ on Stack Overflow

Awk create a new array of unique values from another array

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related