2

Recently, I had to sort several files according to records' ID; the catch was that there can be several types of records, and in each of those the field I had to use for sorting is on a different position. The fields, however, are easily identifiable thanks to key=value structure. To show a simple sample of the general structure:

fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

I came up with a pipeline as follows, which did the job:

awk -F'[|=]' '{for(i=1; i<=NF; i++) {if($i ~ "id") {i++; print $i"?"$0} }}' tester.txt | sort -n | awk -F'?' '{print $2}'

In other words the algorithm is as follows:

  1. Split the record by both field and key-value separators (| and =)
  2. Iterate through the elements and search for the id key
  3. Print the next element (value of id key), a separator, and the whole line
  4. Sort numerically
  5. Remove prepended identifier to preserve records' structure

Processing the sample gives the output:

fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

Is there a way, though, to do this task using single awk command?

4
  • 1
    Could you please do post samples of expected output in your question to make it better, thank you. Commented May 18, 2022 at 13:59
  • Thanks for the suggestion - added the result of processing the sample Commented May 18, 2022 at 14:07
  • Thanks for edit, could you please do explain what is the logic of getting expected output more, thank you. Commented May 18, 2022 at 14:19
  • 1
    I had to sort the records according to the value of the id field (which doesn't have a fixed position), so I extracted said value by searching for a key, added it to the record, sorted the output and removed prepended identifier to get clean records; I've added my algorithm to the question, please check if it helps Commented May 18, 2022 at 14:34

4 Answers 4

1

You may try this gnu-awk code to to this in a single command:

awk -F'|' '{
   for(i=1; i<=NF; ++i)
      if ($i ~ /^id=/) {
         a[gensub(/^id=/, "", 1, $i)] = $0
         break
      }
}
END {
   PROCINFO["sorted_in"] = "@ind_num_asc"
   for (i in a)
      print a[i]
}' file

fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

We are using | as field delimiter and when there is a column name starting with id= we store it in array a with index as text after = and value as the full record.

Using PROCINFO["sorted_in"] = "@ind_num_asc" we sort array a using numerical value of index and then in for loop we print value part to get the sorted output.

Sign up to request clarification or add additional context in comments.

1 Comment

Does the PROCINFO["sorted_in"] parameter affect all the arrays within current awk command?
1

Using GNU awk for the 3rd arg to match() and sorted_in:

$ cat tst.awk
match($0,/(^|\|)id=([0-9]+)/,a) {
    ids2vals[a[2]] = $0
}
END {
    PROCINFO["sorted_in"] = "@ind_num_asc"
    for ( id in ids2vals ) {
        print ids2vals[id]
    }
}

$ awk -f tst.awk file
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

Comments

1

Try Perl: perl -e 'print map { s/^.*? //; $_ } sort { $a <=> $b } map { ($id) = /id=(\d+)/; "$id $_" } <>' file

Some explanation of the code I use:

print #print the resulting list of lines
    map {
        s/^.*? //;
        $_
    } #remove numeric id from start of line
    sort { $a <=> $b } #sort numerically
    map {
        ($id) = /id=(\d+)/;
        "$id $_"
    } # capture id and place it in start of line
    <> # read all lines from file

Or try sed and sort: sed 's/^\(.*id=\([0-9][0-9]*\).*\)$/\2 \1/' file | sort -n | sed 's/^[^ ][^ ]* //'

Comments

0

With your shown samples only, please try following(awk + sort + cut) solution, written and tested in GNU awk, should work in any awk.

awk '
match($0,/id=[0-9]+/){
  print substr($0,RSTART,RLENGTH)";"$0
}
' Input_file | sort -t'=' -k2n | cut -d';' -f2-

Explanation: Adding detailed explanation for above code.

awk '                                   ##Starting awk program from here.
match($0,/id=[0-9]+/){                  ##Using awk match function to match id= followed by digits.
  print substr($0,RSTART,RLENGTH)";"$0  ##printing sub string of matched value followed by current line along with semi-colon in it.
}
' Input_file    |                       ##Mentioning Input_file here and passing awk output as a standard input to next command.
sort -t'=' -k2n |                       ##Sorting output with delimiter of = and by 2nd field then passing output to next command as an input.
cut -d';' -f2-                          ##Using cut command making delimiter as ; and printing everything from 2nd field onwards.     

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.