10

I'm using awk to urldecode some text.

If I code the string into the printf statement like printf "%s", "\x3D" it correctly outputs =. The same if I have the whole escaped string as a variable.

However, if I only have the 3D, how can I append the \x so printf will print the = and not \x3D?

I'm using busybox awk 1.4.2 and the ash shell.

6 Answers 6

4

I don't know how you do this in awk, but it's trivial in perl:

echo "http://example.com/?q=foo%3Dbar" | 
    perl -pe 's/\+/ /g; s/%([0-9a-f]{2})/chr(hex($1))/eig'
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, but perl isn't available.
@zwol This only works on Perl 5 if you escape the + with a backslash! BTW, works fine for me with sample URLs without the s/\+/ /g part at all! The second regex alone will do the trick already.
@syntaxerror You're quite right about the + needing to be escaped, don't know how I missed that. I think the ?q=phrase+separated+by+plus+signs notation has gotten less common since I wrote this but it's still part of the spec for application/x-www-form-urlencoded escaping of form submissions.
Oh, you're right, I forgot about those form submissions. However, since my main aim is fixing "garbled" download links, the most important thing is to get rid of all this %20, %3D and %3F (et al) stuff in the first place.
3

GNU awk

#!/usr/bin/awk -fn
@include "ord"
BEGIN {
  RS = "%.."
}
{
  printf RT ? $0 chr("0x" substr(RT, 2)) : $0
}

Or

#!/bin/sh
awk -niord '{printf RT?$0chr("0x"substr(RT,2)):$0}' RS=%..

Decoding URL encoding (percent encoding)

1 Comment

This garbles e.g. UTF-8-encoded non-ASCII characters
2

Since you're using ash and Perl isn't available, I'm assuming that you may not have gawk.

For me, using gawk or busybox awk, your second example works the same as the first (I get "=" from both) unless I use the --posix option (in which case I get "x3D" for both).

If I use --non-decimal-data or --traditional with gawk I get "=".

What version of AWK are you using (awk, nawk, gawk, busybox - and version number)?

Edit:

You can coerce the variable's string value into a numeric one by adding zero:

~/busybox/awk 'BEGIN { string="3D"; pre="0x"; hex=pre string; printf "%c", hex+0}'

2 Comments

You'r right, it does work. I asked the wrong question - I'll amend it. (I'm using busybox awk, version 1.4.2)
Took me quite awhile to realize this one-liner is for one variable only, no whole urlencoded string (e. g. a web address filled up with %20 and %3F stuff)
1

This relies on gnu awk's extension of the split function, but this works:

gawk '{ numElems = split($0, arr, /%../, seps);
        outStr = ""
        for (i = 1; i <= numElems - 1; i++) {
            outStr = outStr arr[i]
            outStr = outStr sprintf("%c", strtonum("0x" substr(seps[i],2)))
        }
        outStr = outStr arr[i]
        print outStr
      }'

Comments

1

To start with, I'm aware this is an old question, but none of the answers worked for me (restricted to busybox awk)

Two options. To parse stdin:

awk '{for (y=0;y<127;y++) if (y!=37) gsub(sprintf("%%%02x|%%%02X",y,y), y==38 ? "\\&" : sprintf("%c", y));gsub(/%25/, "%");print}'

To take a command line parameter:

awk 'BEGIN {for (y=0;y<127;y++) if (y!=37) gsub(sprintf("%%%02x|%%%02X",y,y), y==38 ? "\\&" : sprintf("%c", y), ARGV[1]);gsub(/%25/, "%", ARGV[1]);print ARGV[1]}' parameter

Have to do %25 last because otherwise strings like %253D get double-parsed, which shouldn't happen.

The inline check for y==38 is because gsub treats & as a special character unless you backslash it.

Comments

1

This one is the fastest of them all by a large margin and it doesn't need gawk:

#!/usr/bin/mawk -f

function decode_url(url,            dec, tmp, pre, mid, rep) {
    tmp = url
    while (match(tmp, /\%[0-9a-zA-Z][0-9a-zA-Z]/)) {
        pre = substr(tmp, 1, RSTART - 1)
        mid = substr(tmp, RSTART + 1, RLENGTH - 1)
        rep = sprintf("%c", ("0x" mid) + 0)
        dec = dec pre rep
        tmp = substr(tmp, RSTART + RLENGTH)
    }
    return dec tmp
}

{
    print decode_url($0)
}

Save it as decode_url.awk and use it like you normally would. E.g:

$ ./decode_url.awk <<< 'Hello%2C%20world%20%21'
Hello, world !

But if you want an even faster version:

#!/usr/bin/mawk -f

function gen_url_decode_array(      i, n, c) {
    delete decodeArray
    for (i = 32; i < 64; ++i) {
        c = sprintf("%c", i)
        n = sprintf("%%%02X", i)
        decodeArray[n] = c
        decodeArray[tolower(n)] = c
    }
}

function decode_url(url,            dec, tmp, pre, mid, rep) {
    tmp = url
    while (match(tmp, /\%[0-9a-zA-Z][0-9a-zA-Z]/)) {
        pre = substr(tmp, 1, RSTART - 1)
        mid = substr(tmp, RSTART, RLENGTH)
        rep = decodeArray[mid]
        dec = dec pre rep
        tmp = substr(tmp, RSTART + RLENGTH)
    }
    return dec tmp
}

BEGIN {
    gen_url_decode_array()
}

{
    print decode_url($0)
}

Other interpreters than mawk should have no problem with them.

1 Comment

Careful with your decode_url function. If you give it %2542 it'll convert it to %42 on the first pass and then to B on the second pass, which is not correct. %25 should be decoded to %, and the returned result should be %42. See also RFC 3986, section 2.4: "Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.