Input (2 columns) :
col1 , col2
David, 100
"Ronald
Sr, Ron , Ram" , 200
Harry
potter
jr" , 200
Prof.
Snape" , 100
Note: Harry and Prof. does not have starting quotes
Output (2 columns)
col1 | col2
David | 100
Ronald Sr , Ron , Ram| 200
Harry potter jr| 200
Prof. Snape| 100
What I tried (PySpark) ?
df = spark.read.format("csv").option("header",True).option("multiLine",True).option("escape","\'")
Issue The above code worked fine where multiline had both start and end double quotes (For eg: row starting with Ronald)
But it didnt work with rows where we only have end quotes but no start quotes (like Harry and Prof)
Even if we add start quotes with Harry and Prof that will solve the issue
Any idea using Pyspark , Python or Shell , etc are welcome !!
|characters?