1

I have my array:

array = [1:"PLCH2", 2:"PLCH1", 3:"PLCH2"]

I want to loop on array to create a new array unique of unique values and obtain:

unique = [1:"PLCH2", 2:"PLCH1"]

how can I achieve that ?

EDIT: as per @Ed Morton request, I show below how my array is populated. In fact, this post is the key solution to my previous post.

in my file.txt, I have:

PLCH2:A1007int&PLCH1:D987int&PLCH2:P977L
INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P

I use split to obtain array:

awk '{
    split($0,a,"&")
    for ( i in a ) {
        split(a[i], b, ":");
        array[i] = b[1];
    }
}' file.txt
6
  • 1
    edit your question to include a small, complete script that shows how your current array is populated because what you have posted so far could be interpreted in several different ways. Commented Feb 10, 2020 at 19:54
  • 2
    for ( i in a ) will re-arrange your values into a random (hash) order. That's often undesirable which is why I use for ( i=1; i in a; i++ ) instead to ensure I visit the array indices in the same order they appeared in the string that was split into the array. See gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array. Commented Feb 11, 2020 at 14:24
  • 1
    @EdMorton wow I didn't know that, thanks for explaining the different in the loop Commented Feb 11, 2020 at 16:27
  • @EdMorton Huh, I usually use for (i = 1; i <= length(a); i++) to do the same (assuming the array a has numeric indices, which is true for arrays generated by the split() function). I wasn't sure if the i in a condition is in proper order - do you know if it works with all awks? (Well, I mostly care about BWK AWK, mawk, and gawk.) Commented Dec 30, 2022 at 13:27
  • 1
    @jena calling length(a) on each iteration of the loop is time consuming vs a hash lookup and is non-portable (per POSIX, length() operates on strings, not arrays). i in a in this context has nothing (directly) to do with order, it's just a hash lookup testing if i is an index of a which will fail when i has the value 1 more than the max index. Yes, it works in all awks, unlike length(a) which will only work in awks that support calling length on an array. Commented Dec 30, 2022 at 13:43

1 Answer 1

2

This might be what you're trying to do:

$ cat tst.awk
BEGIN {
    split("PLCH2 PLCH1 PLCH2",array)

    printf "array ="
    for (i=1; i in array; i++) {
        printf " %s:\"%s\"", i, array[i]
    }
    print ""

    for (i=1; i in array; i++) {
        if ( !seen[array[i]]++ ) {
            unique[++j] = array[i]
        }
    }

    printf "unique ="
    for (i=1; i in unique; i++) {
        printf " %s:\"%s\"", i, unique[i]
    }
    print ""
}

$ awk -f tst.awk
array = 1:"PLCH2" 2:"PLCH1" 3:"PLCH2"
unique = 1:"PLCH2" 2:"PLCH1"

EDIT: given your updated question, here's how I'd really approach that:

$ cat tst.awk
BEGIN { FS="[:&]" }
{
    numVals=0
    for (i=1; i<NF; i+=2) {
        vals[++numVals] = $i
    }

    print "vals =" arr2str(vals)

    delete seen
    numUniq=0
    for (i=1; i<=numVals; i++) {
        if ( !seen[vals[i]]++ ) {
            uniq[++numUniq] = vals[i]
        }
    }

    print "uniq =" arr2str(uniq)
}

function arr2str(arr,    str, i) {
    for (i=1; i in arr; i++) {
        str = str sprintf(" %s:\"%s\"", i, arr[i])
    }
    return str
}

$ awk -f tst.awk file
vals = 1:"PLCH2" 2:"PLCH1" 3:"PLCH2"
uniq = 1:"PLCH2" 2:"PLCH1"
vals = 1:"INTS11" 2:"INTS11" 3:"INTS11" 4:"INTS11" 5:"INTS11"
uniq = 1:"INTS11" 2:"PLCH1"
Sign up to request clarification or add additional context in comments.

10 Comments

Thanks for proposing an answer, I will try it out and come back to this. Btw, is seen a function ?
No, seen[] is just a name of an array that I'm using to store the values from array[] as indices so I can identify which of those values has been seen before by it's associated count be zero or non-zero (which I'm using as false or true in the if). seen[] COULD be named anything but that's the idiomatic name given to an array used for that purpose.
@Zen others had already rejected your proposed edit before I saw it in the queue but it had several issues anyway so I just posted how to really do this using the sample input you provided read from a file.
@AS seen[x]++ tests if the index x has been seen before. The first time seen[x]++ is tested the result is false since the value of seen[x] is zero, but the second (and subsequent times) it's true because the ++ that was executed after the condition was evaluated the first time set seen[x] to 1. Any time you see an array named seen[] THAT is what it's being used for (assuming the person who wrote the code understands the idiom).
@AS you're welcome. I expect so - post a new question specifically about whatever it is you want to know about and leave me a comment and I'll take a look. Just make sure to make it about performing a specific task and not "can someone explain this code" or similar and include a minimal reproducible example so it doesn't get downvoted and closed as off topic.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.