I'm processing large mbox file in order to analyse the mail traffic of [email protected]. File is already in csv with 11 columns. The number of required replacements is large (>25) and it works just fine with àwk gsub` function. But I just realised that replacement should be performed just on columns $3, $7 and $9 and I would like to find an optimal solution to do it.
CSV file is delimited with ;
Between delimiters newline can appear. Typically newline inside the field is indicated with ?= at the end of line and =? at the begining of the next line, for example this is a headers line, empty line and one row of data:
Message-ID;X-GM-THRID;X-Gmail-Labels;X-Google-Original-Date;Date;From;To;Subject;X-Spam-Flag;HasAttachment;AttachmentNames
<[email protected]>;1649279601489016232;"=?UTF-8?Q?Archived,Important,Opened,Category_?=
=?UTF-8?Q?Personal,kupci/cb-ac,naro=C4=8Dila-kupcev?=";;Mon, 4 Nov 2019 14:53:14 +0100;<[email protected]>;=?iso-8859-2?Q?acme_naro=E8ilo?= <[email protected]>;=?iso-8859-2?Q?NARO=C8ILO_7209661?=;;True;ACME 7096_2019.pdf
My task is to clean the data. Specifically, the row above should become:
Message-ID;X-GM-THRID;X-Gmail-Labels;X-Google-Original-Date;Date;From;To;Subject;X-Spam-Flag;HasAttachment;AttachmentNames
[email protected];1649279601489016232;Archived,Important,Opened,Category Personal,kupci/cb-ac,naročila-kupcev;;Mon, 4 Nov 2019 14:53:14 +0100;[email protected];acme naročilo [email protected];NAROČILO 7209661;;True;ACME 7096 2019.pdf
Currently I run the command:
awk -f replacements.awk email.csv > newEmail.csv
File replacements.awk looks like this:
{
gsub("_"," ");
gsub("20="," ");
gsub("=?","");
gsub(/\?=/,"");
gsub("_"," ");
gsub("<","");
gsub(">","");
gsub(/"/,"");
...
print
}
I would like to have replacements.awk written in way that I don't need to repeat gsub statements three times in order to replace strings on three columns.
Thanks
gsuband DO NOT mention any field number to it specifically it simply performs substitutions on whole line itself. So could you please confirm if you want to perform substitutions only on 3 fields(3rd, 7th and 9th) once?gsub("=C5=BE","ž")but later I realised 'ž' can be encoded in a shorter way as well:gsub("C5BE","ž"). Then I noticed that 'CB5E' could be part of ID-string, which means that I need to replace only on specific columns (fields). Bottom line: Yes, I need to perform substitution only on 3rd, 7th and 9th column (field).foo 2>=30=will turn due to gsub3 intofoo 2>0=and due to gsub6 intofoo 20=which then, if you run it again` would turn intofoodue to gsub2. While if you would run all the substitutions in one go (as is done by the examples), this would turn intofoo 2=. We need more information to be able to understand what you want.