1

My bank sends a non common CSV file using ; as field separator and a binary code (hexadecimal a0 or octal 240) to enclose the fields where a ; could occur, as below:

Input

Extrait;Date;Date valeur;Compte;Description;Montant;Devise
�2020/0001/0002�;29.02.2020;29.02.2020;-;�28/02/20 Some shop in Antwerp     A Antwerpen (BE)�;-16,50;EUR
�2020/0001/0001�;01.02.2020;01.02.2020;-;�31/01/20 Some shop in Zaventem    Z Zaventem (BE)�;-13,00;EUR

I need to process fields 2, 5 and 6 with AWK.

Desired output

{Date}{Description}{Montant}
{29.02.2020}{28/02/20 Some shop in Antwerp     A Antwerpen (BE)}{-16,50}
{01.02.2020}{31/01/20 Some shop in Zaventem    Z Zaventem (BE)}{-13,00}

Up to now, as long as the fields enclosed by do not contain any ; the script below using the variable FPAT works:

#!/usr/bin/awk -f
BEGIN { 
  FS=";"
  FPAT="[^;]*"                        # this works but not in all cases
  #FPAT="([^;]*)|(\240[^\240]+\240)"  # this doesn't work
}
{ gsub (/\240/, "", $5)               # I wish I could skip this instruction too
  print "{" $2 "}{" $5 "}{" $6 "}" 
}

I found a similar case (see awk FPAT to ignore commas in csv) but changing the , into ; and the \" into \240 didn't do the trick.

I need help for implementing a FPAT pattern to scan correctly my CSV file in all cases.

5
  • Note that the csv format isn't a standard and even if the comma as separator and the double quote as protection character are more usual, there's nothing wrong to use a semi-colon and the non-breakable space. Also, take care the non-breakable space, and probably all the file, is written with an ISO8859-1 encoding and not in UTF-8. Commented Apr 14, 2020 at 11:10
  • @CasimiretHippolyte: OK, I will edit my question in orde to change "non standard" into "non common". I don't know if the file is encoded UTF-8 or ISO8859-1, because I see no letters with accents. Commented Apr 14, 2020 at 12:32
  • @CasimiretHippolyte: indeed, I saw inside of another file that my bank is encoding according to ISO8859-1. If I convert the file to UTF-8, I get the sequence \xc2\xa0 instead of \xa0, which I can't use in the FPAT proposed by anubhava. I will have to find a workaround... Commented Apr 14, 2020 at 14:16
  • 1
    Nothing forbids to encode the result of anubhava script to UTF-8 after. Commented Apr 14, 2020 at 19:53
  • Also if you choose to convert your file before, you can change FPAT to [^;\xc2]+(\xc2[^\xa0][^;\xc2]*)*|(\xc2[^\xa0][^;\xc2]*)+ (without the typo) Commented Apr 15, 2020 at 11:52

1 Answer 1

2

You may use this gnu awk with FPAT:

awk -v FPAT='[^;\xa0]+' '{printf "{%s}{%s}{%s}\n", $2, $5, $6}' file

{Date}{Description}{Montant}
{29.02.2020}{28/02/20 Some shop in Antwerp     A Antwerpen (BE)}{-16,50}
{01.02.2020}{31/01/20 Some shop in Zaventem    Z Zaventem (BE)}{-13,00}

-v FPAT='[^;\xa0]+' sets field pattern as 1+ of any character that is not ; and not \xa0.

Sign up to request clarification or add additional context in comments.

3 Comments

It works, thank you, but I still need to add a statement gsub (/\xa0/, "", $5) to get rid of the binary chars. I changed the + by * in FPAT for matching also empty fields.
After all, I think that setting FPAT to [^;\xa0]* will split a string enclosed into \xa0 in two parts when it contains a ;, which is what I wanted to avoid.
Since regex is [^;\xa0]+ it will make a ;\xa0 a single delimiter not 2 delimiters. However I could not really figure out positions of \xa0 from your input in question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.