0

I'm processing large mbox file in order to analyse the mail traffic of [email protected]. File is already in csv with 11 columns. The number of required replacements is large (>25) and it works just fine with àwk gsub` function. But I just realised that replacement should be performed just on columns $3, $7 and $9 and I would like to find an optimal solution to do it.

CSV file is delimited with ; Between delimiters newline can appear. Typically newline inside the field is indicated with ?= at the end of line and =? at the begining of the next line, for example this is a headers line, empty line and one row of data:

Message-ID;X-GM-THRID;X-Gmail-Labels;X-Google-Original-Date;Date;From;To;Subject;X-Spam-Flag;HasAttachment;AttachmentNames

<[email protected]>;1649279601489016232;"=?UTF-8?Q?Archived,Important,Opened,Category_?=
=?UTF-8?Q?Personal,kupci/cb-ac,naro=C4=8Dila-kupcev?=";;Mon, 4 Nov 2019 14:53:14 +0100;<[email protected]>;=?iso-8859-2?Q?acme_naro=E8ilo?= <[email protected]>;=?iso-8859-2?Q?NARO=C8ILO_7209661?=;;True;ACME 7096_2019.pdf

My task is to clean the data. Specifically, the row above should become:

Message-ID;X-GM-THRID;X-Gmail-Labels;X-Google-Original-Date;Date;From;To;Subject;X-Spam-Flag;HasAttachment;AttachmentNames

[email protected];1649279601489016232;Archived,Important,Opened,Category Personal,kupci/cb-ac,naročila-kupcev;;Mon, 4 Nov 2019 14:53:14 +0100;[email protected];acme naročilo [email protected];NAROČILO 7209661;;True;ACME 7096 2019.pdf

Currently I run the command:

awk -f replacements.awk email.csv > newEmail.csv

File replacements.awk looks like this:

{
  gsub("_"," ");
  gsub("20="," "); 
  gsub("=?","");   
  gsub(/\?=/,"");  
  gsub("_"," ");
  gsub("<","");
  gsub(">","");
  gsub(/"/,"");
  ...
  print
 }

I would like to have replacements.awk written in way that I don't need to repeat gsub statements three times in order to replace strings on three columns.

Thanks

9
  • You mentioned substitution should happen only on 3rd, 7th and 9th column but when you use gsub and DO NOT mention any field number to it specifically it simply performs substitutions on whole line itself. So could you please confirm if you want to perform substitutions only on 3 fields(3rd, 7th and 9th) once? Commented Nov 11, 2019 at 11:27
  • 1
    edit your question to show concise, testable sample input and expected output so we can help you. See How to Ask if that's not clear. Commented Nov 11, 2019 at 12:34
  • Initially I was replacing in the whole file, for instance diacritical characters like 'ž' are encoded in a specific way: gsub("=C5=BE","ž") but later I realised 'ž' can be encoded in a shorter way as well: gsub("C5BE","ž"). Then I noticed that 'CB5E' could be part of ID-string, which means that I need to replace only on specific columns (fields). Bottom line: Yes, I need to perform substitution only on 3rd, 7th and 9th column (field). Commented Nov 11, 2019 at 12:37
  • Fine. See my earlier comment for how to best get help. Commented Nov 11, 2019 at 12:48
  • 1
    You seem to be doing substitutions one by one. This actually has an important effect. Example, imagine your string to be foo 2>=30= will turn due to gsub3 into foo 2>0= and due to gsub6 into foo 20= which then, if you run it again` would turn into foo due to gsub2. While if you would run all the substitutions in one go (as is done by the examples), this would turn into foo 2=. We need more information to be able to understand what you want. Commented Nov 11, 2019 at 13:35

3 Answers 3

1

It sounds like this might be what you want:

awk '
BEGIN {
    split("3 7 9", tgts)
}
{
    for (i in tgts) {
        tgt = tgts[i]
        gsub(/_|20=/," ",$tgt)
        gsub(/=\?|\?=|[<>"]/,"",$tgt)
    }
    print
}
' file

but without sample input/output it's just an untested guess.

Sign up to request clarification or add additional context in comments.

2 Comments

Be aware that combining the substitutions might lead to different results then having separate substitutions. We do not know if the separate substitutions in the original problem were intended or not.
Right, as I mentioned without sample input/output it's just an untested guess
1

Since you haven't shown samples of your Input_file and expected output so couldn't test it. You have multiple Global substitutions which either substitute a regex/string to a space OR to NULL, so we can club both.

I have clubbed all regex for space together and all regex for NULL together as follows.

gsub(/_|20=/," ");gsub(/=\?|\\\?=|<|>|\"/,"")

You could use |(OR) for mentioning multiple regexp in gsub. I had taken all regexp from your shown samples, if you have some more then you could club them like I have done above too.



EDIT: Adding an example to remove perform multiple gsub operations on multiple fields, lets say following is the Input_file. This is just an example you need to adjust it as per your Input_file.

cat Input_file
1 23_?=??": bla bla bla
1 23_?=??": bla bla bla
1 23_?=??": bla bla bla
1 23_?=??": bla bla bla
1 23_?=??": bla bla bla
1 23_?=??": bla bla bla
1 23_?=??": bla bla bla

Now following is the solution.

awk '
function remove(field){
  num=split(field,array,",")
  for(i=1;i<=num;i++){
    gsub(/=\?|\\\?=|<|>|\"/,"",$i)
  }
}
remove("2,3")
1
' Input_file

In above I mentioned remove("2,3") which means I am calling function named remove and 2,3 means perform gsub operation on 2nd and 3rd fields, but this is only an example of substitution only, you need to adjust it in your code or you could take it as a starting step.

1 Comment

Be aware that combining the substitutions might lead to different results then having separate substitutions. We do not know if the separate substitutions in the original problem were intended or not.
1
  • consolidate multiple replacement patterns into a single combination using either regex alternation group ..|.. or character class [...]
  • move common substitutions to a custom function that will accept a column as an argument

function sub_col(col) {
    gsub(/[<>"]|\?=|=\?/, "", $col);
    gsub(/_|20=/, " ", $col);
}
{
    sub_col(3); sub_col(7); sub_col(9);  
}

1 Comment

Be aware that combining the substitutions might lead to different results then having separate substitutions. We do not know if the separate substitutions in the original problem were intended or not.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.