Process non common csv file with awk using field patterns

Question

My bank sends a non common CSV file using ; as field separator and a binary code (hexadecimal a0 or octal 240) to enclose the fields where a ; could occur, as below:

Input

Extrait;Date;Date valeur;Compte;Description;Montant;Devise
�2020/0001/0002�;29.02.2020;29.02.2020;-;�28/02/20 Some shop in Antwerp     A Antwerpen (BE)�;-16,50;EUR
�2020/0001/0001�;01.02.2020;01.02.2020;-;�31/01/20 Some shop in Zaventem    Z Zaventem (BE)�;-13,00;EUR

I need to process fields 2, 5 and 6 with AWK.

Desired output

{Date}{Description}{Montant}
{29.02.2020}{28/02/20 Some shop in Antwerp     A Antwerpen (BE)}{-16,50}
{01.02.2020}{31/01/20 Some shop in Zaventem    Z Zaventem (BE)}{-13,00}

Up to now, as long as the fields enclosed by � do not contain any ; the script below using the variable FPAT works:

#!/usr/bin/awk -f
BEGIN { 
  FS=";"
  FPAT="[^;]*"                        # this works but not in all cases
  #FPAT="([^;]*)|(\240[^\240]+\240)"  # this doesn't work
}
{ gsub (/\240/, "", $5)               # I wish I could skip this instruction too
  print "{" $2 "}{" $5 "}{" $6 "}" 
}

I found a similar case (see awk FPAT to ignore commas in csv) but changing the , into ; and the \" into \240 didn't do the trick.

I need help for implementing a FPAT pattern to scan correctly my CSV file in all cases.

Note that the csv format isn't a standard and even if the comma as separator and the double quote as protection character are more usual, there's nothing wrong to use a semi-colon and the non-breakable space. Also, take care the non-breakable space, and probably all the file, is written with an ISO8859-1 encoding and not in UTF-8. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Apr 14, 2020 at 11:10
@CasimiretHippolyte: OK, I will edit my question in orde to change "non standard" into "non common". I don't know if the file is encoded UTF-8 or ISO8859-1, because I see no letters with accents. — Pierre François
– Pierre François, Commented Apr 14, 2020 at 12:32
@CasimiretHippolyte: indeed, I saw inside of another file that my bank is encoding according to ISO8859-1. If I convert the file to UTF-8, I get the sequence \xc2\xa0 instead of \xa0, which I can't use in the FPAT proposed by anubhava. I will have to find a workaround... — Pierre François
– Pierre François, Commented Apr 14, 2020 at 14:16
Nothing forbids to encode the result of anubhava script to UTF-8 after. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Apr 14, 2020 at 19:53
Also if you choose to convert your file before, you can change FPAT to [^;\xc2]+(\xc2[^\xa0][^;\xc2]*)*|(\xc2[^\xa0][^;\xc2]*)+ (without the typo) — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Apr 15, 2020 at 11:52

anubhava · Accepted Answer · 2020-04-14 10:55:03Z

2

You may use this gnu awk with FPAT:

awk -v FPAT='[^;\xa0]+' '{printf "{%s}{%s}{%s}\n", $2, $5, $6}' file

{Date}{Description}{Montant}
{29.02.2020}{28/02/20 Some shop in Antwerp     A Antwerpen (BE)}{-16,50}
{01.02.2020}{31/01/20 Some shop in Zaventem    Z Zaventem (BE)}{-13,00}

-v FPAT='[^;\xa0]+' sets field pattern as 1+ of any character that is not ; and not \xa0.

answered Apr 14, 2020 at 10:55

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Pierre François Over a year ago

It works, thank you, but I still need to add a statement gsub (/\xa0/, "", $5) to get rid of the binary chars. I changed the + by * in FPAT for matching also empty fields.

Pierre François Over a year ago

After all, I think that setting FPAT to [^;\xa0]* will split a string enclosed into \xa0 in two parts when it contains a ;, which is what I wanted to avoid.

anubhava Over a year ago

Since regex is [^;\xa0]+ it will make a ;\xa0 a single delimiter not 2 delimiters. However I could not really figure out positions of \xa0 from your input in question.

Collectives™ on Stack Overflow

Process non common csv file with awk using field patterns

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related