Parse multiple line CSV using PySpark , Python or Shell

Question

Input (2 columns) :

col1 , col2
David, 100
"Ronald
Sr, Ron , Ram" , 200
Harry
potter
jr" , 200
Prof.
Snape" , 100

Note: Harry and Prof. does not have starting quotes

Output (2 columns)

col1 | col2
David | 100
Ronald Sr , Ron , Ram| 200
Harry potter jr| 200 
Prof. Snape| 100

What I tried (PySpark) ?

df = spark.read.format("csv").option("header",True).option("multiLine",True).option("escape","\'")

Issue The above code worked fine where multiline had both start and end double quotes (For eg: row starting with Ronald)

But it didnt work with rows where we only have end quotes but no start quotes (like Harry and Prof)

Even if we add start quotes with Harry and Prof that will solve the issue

Any idea using Pyspark , Python or Shell , etc are welcome !!

Can your input contain | characters?

Ed Morton
– Ed Morton

2022-05-17 01:49:18 +00:00
Commented May 17, 2022 at 1:49 — Ed Morton
– Ed Morton, Commented May 17, 2022 at 1:49
@EdMorton No - Data wont have contain "|" (pipe)

Debaditya
– Debaditya

2022-05-17 01:51:55 +00:00
Commented May 17, 2022 at 1:51 — Debaditya
– Debaditya, Commented May 17, 2022 at 1:51

markp-fuso · Accepted Answer · 2022-05-17 13:32:36Z

1

Based solely on the small sample provided:

remove all double quotes
there are two comma-delimited fields; 1st field is a string, 2nd field is a number
the 1st field may contain commas and may be broken across multiple lines
replace the comma delimiter with a pipe (|)
OP's expected output is inconsistent with regards to spacing before the newly inserted pipe (|); sometimes a space is removed, sometimes a space is inserted; for now we won't worry about spacing

One awk idea:

awk -F, '
             { gsub(/"/,"") }                      # remove double quotes
FNR==1 ||                                          # if 1st line or last field is a number then ...
($NF+0)==$NF { print prev gensub(FS,"|",(NF-1))    # print any previous line(s) data plus current line, replacing last comma with a pipe
               prev=""                             # clear previous line(s) data
               next                                # skip to next line of input
             }
             { prev= prev $0 " " }                 # if we get here then this is a broken line so save contents for later printing
' sample.csv

This generates:

col1 | col2
David| 100
Ronald Sr, Ron , Ram | 200
Harry potter jr | 200
Prof. Snape | 100

edited May 17, 2022 at 13:32

answered May 16, 2022 at 23:20

markp-fuso

38.7k5 gold badges24 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Debaditya Over a year ago

Apology for my initial ask.. I changed my input / output . Actually , I have "comma" whenever the row starts with double quotes ...

Collectives™ on Stack Overflow

Parse multiple line CSV using PySpark , Python or Shell

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related