Ignoring commas within fields using AWK when there are multiple field separators

Question

I want to parse CSV records like the one below with awk or gawk.

The fields are separated by commas but the last field ($6) is special because it really consists of subfields. These subfields are separated by # as the field separator (or, to be precise, ". # "). This in itself is not a problem: I can use awk -F'(,)|(. # )' to set alternative field separators.

However, there are stray commas in this last field as well that need to be ignored.

Is there a way to solve this with awk, perhaps using FPAT?

Sample record:

  "http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab","http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002","EU:C:1985:443","61984CJ0239","Gerlach","Judgment of the Court (Third Chamber) of 24 October 1985. # Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken. # Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands. # Article 41 ECSC - Anti-dumping duties. # Case 239/84."

Doesn't work because when awk encounters a comma that's the end of the field. E.g. in the sample record, there will only be two subfields in $6 and then the comma after BV means there's suddenly a $7 etc. — Timothy Roes
– Timothy Roes, Commented Mar 10, 2021 at 13:19

anubhava · Accepted Answer · 2021-03-10 19:42:15Z

3

Using FPAT feature in gnu-awk, you may be able to do this. We use FPAT to match all double quoted fields or comma separated fields. Finally we split on last field using /\. # / regex pattern.

s='"http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab","http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002","EU:C:1985:443","61984CJ0239","Gerlach","Judgment of the Court (Third Chamber) of 24 October 1985. # Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken. # Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands. # Article 41 ECSC - Anti-dumping duties. # Case 239/84."'

awk -v FPAT='"[^"]*"|[^,]+' '{
   # loop through all fields except last one
   for (i=1; i<NF; ++i)
      print i, $i
   # split last field using /\. # / regex and print each token
   for (j=1; j<split($NF, a, /\. # /); ++j)
      print i+j-1, a[j]
}' <<< "$s"

1 "http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab"
2 "http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002"
3 "EU:C:1985:443"
4 "61984CJ0239"
5 "Gerlach"
6 "Judgment of the Court (Third Chamber) of 24 October 1985
7 Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken
8 Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands
9 Article 41 ECSC - Anti-dumping duties

edited Mar 10, 2021 at 19:42

answered Mar 10, 2021 at 15:35

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Timothy Roes Over a year ago

Thanks, this is very close except in $6 I need the split to happen at the . # and not at the comma after the word "BV", nor at the comma after the word "Expeditie". So $7 should come out as "Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken".

Timothy Roes Over a year ago

That was quick! However, I still get the same result. I'm on macOS GNU awk 5.1.0

anubhava Over a year ago

I am also on macOS using GNU Awk 5.1.0. Can you show your command and output in question?

Timothy Roes Over a year ago

I stand corrected, it does work now! Could you add a line of explanation, particularly how you use NF?

captain-yossarian from Ukraine Over a year ago

@Timothy Roes dont forget to upvote the answer

|

Collectives™ on Stack Overflow

Ignoring commas within fields using AWK when there are multiple field separators

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related