I have a CSV file with multiple fields. There are few fields(string) for which data got spans to multiple lines. I want to aggregate those multiple lines into one line.
Input Data:
1, "asdsdsdsds", "John"
2, "dfdhifdkinf
dfjdfgkdnjgknkdjgndkng
dkfdkjfnjdnf", "Roy"
3, "dfjfdkgjfgn", "Rahul"
Expected Output:
1, "asdsdsdsds", "John"
2, "dfdhifdkinf dfjdfgkdnjgknkdjgndkng dkfdkjfnjdnf", "Roy"
3, "dfjfdkgjfgn", "Rahul"
The same question is asked in SO earlier. However the solution is achieved using power shell. Is it possible to achieve the same using python or pandas or pyspark.
Whenever the data spans multiple lines it will be in double quotes for sure.
What I tried
I can able to read the the data without any issues using pandas and pyspark even though there are fields whose got spanned to multiple lines.
Pandas:
pandas_df = pd.read_csv("file.csv")
PySpark
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true') \
.option("delimiter", ",").option("escape", '\\').option("escape", ':').\
option("parserLib", "univocity").option("multiLine", "true").load("file.csv")
Edit:
There can be n number of fields in the csv file and this data span can be in any field.