2

I want to parse CSV records like the one below with awk or gawk.

The fields are separated by commas but the last field ($6) is special because it really consists of subfields. These subfields are separated by # as the field separator (or, to be precise, ". # "). This in itself is not a problem: I can use awk -F'(,)|(. # )' to set alternative field separators.

However, there are stray commas in this last field as well that need to be ignored.

Is there a way to solve this with awk, perhaps using FPAT?

Sample record:

  "http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab","http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002","EU:C:1985:443","61984CJ0239","Gerlach","Judgment of the Court (Third Chamber) of 24 October 1985. # Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken. # Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands. # Article 41 ECSC - Anti-dumping duties. # Case 239/84."
1
  • Doesn't work because when awk encounters a comma that's the end of the field. E.g. in the sample record, there will only be two subfields in $6 and then the comma after BV means there's suddenly a $7 etc. Commented Mar 10, 2021 at 13:19

1 Answer 1

3

Using FPAT feature in gnu-awk, you may be able to do this. We use FPAT to match all double quoted fields or comma separated fields. Finally we split on last field using /\. # / regex pattern.

s='"http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab","http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002","EU:C:1985:443","61984CJ0239","Gerlach","Judgment of the Court (Third Chamber) of 24 October 1985. # Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken. # Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands. # Article 41 ECSC - Anti-dumping duties. # Case 239/84."'

awk -v FPAT='"[^"]*"|[^,]+' '{
   # loop through all fields except last one
   for (i=1; i<NF; ++i)
      print i, $i
   # split last field using /\. # / regex and print each token
   for (j=1; j<split($NF, a, /\. # /); ++j)
      print i+j-1, a[j]
}' <<< "$s"

1 "http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab"
2 "http://publications.europa.eu/resource/cellar/3befa3c3-a9af-4dac-baa2-92e95cb6e3ab.0002"
3 "EU:C:1985:443"
4 "61984CJ0239"
5 "Gerlach"
6 "Judgment of the Court (Third Chamber) of 24 October 1985
7 Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken
8 Reference for a preliminary ruling: College van Beroep voor het Bedrijfsleven - Netherlands
9 Article 41 ECSC - Anti-dumping duties
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks, this is very close except in $6 I need the split to happen at the . # and not at the comma after the word "BV", nor at the comma after the word "Expeditie". So $7 should come out as "Gerlach & Co. BV, Internationale Expeditie, v Minister van Economische Zaken".
That was quick! However, I still get the same result. I'm on macOS GNU awk 5.1.0
I am also on macOS using GNU Awk 5.1.0. Can you show your command and output in question?
I stand corrected, it does work now! Could you add a line of explanation, particularly how you use NF?
@Timothy Roes dont forget to upvote the answer
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.