5

I have a CSV file with multiple fields. There are few fields(string) for which data got spans to multiple lines. I want to aggregate those multiple lines into one line.

Input Data:

1, "asdsdsdsds", "John"
2, "dfdhifdkinf
dfjdfgkdnjgknkdjgndkng
dkfdkjfnjdnf", "Roy"
3, "dfjfdkgjfgn", "Rahul"

Expected Output:

1, "asdsdsdsds", "John"
2, "dfdhifdkinf dfjdfgkdnjgknkdjgndkng dkfdkjfnjdnf", "Roy"
3, "dfjfdkgjfgn", "Rahul"

The same question is asked in SO earlier. However the solution is achieved using power shell. Is it possible to achieve the same using python or pandas or pyspark.

Whenever the data spans multiple lines it will be in double quotes for sure.

What I tried

I can able to read the the data without any issues using pandas and pyspark even though there are fields whose got spanned to multiple lines.

Pandas:

pandas_df = pd.read_csv("file.csv")

PySpark

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true') \
        .option("delimiter", ",").option("escape", '\\').option("escape", ':').\
    option("parserLib", "univocity").option("multiLine", "true").load("file.csv")

Edit:

There can be n number of fields in the csv file and this data span can be in any field.

2
  • "I can able to read the the data without any issues using pandas and pyspark even though there are fields whose got spanned to multiple lines." Then what exactly is the issue? Commented Feb 19, 2018 at 7:43
  • I want data(cleansed) in new csv file where multiple lines are into one line. Commented Feb 19, 2018 at 8:01

2 Answers 2

2
def weird_gen(s):
    s = [s]
    while s:
        *x, a = s[0].split(',', 2)
        y, *s = a.split('\n', 1)
        yield ', '.join(z.strip().replace('\n', ' ') for z in x + [y])

print('\n'.join(weird_gen(open('bad.csv').read())))

1, "asdsdsdsds", "John"
2, "dfdhifdkinf dfjdfgkdnjgknkdjgndkng dkfdkjfnjdnf", "Roy"
3, "dfjfdkgjfgn", "Rahul"
Sign up to request clarification or add additional context in comments.

2 Comments

One should not use .split(',') to parse CSV files, because a comma may be within a quoted field.
You are right. This is the best I came up with for now. If someone posts any reasonable answer, I'll likely delete this.
0

This might help. I am using a simple for loop and negative indexing to get your required result.

s = """1, "asdsdsdsds", "John"
2, "dfdhifdkinf
dfjdfgkdnjgknkdjgndkng
dkfdkjfnjdnf", "Roy"
3, "dfjfdkgjfgn", "Rahul"""

res = []

for i in s.split("\n"):
    if i[0].isdigit():
        res.append(i)
    else:
        res[-1] = res[-1] + " " + I

for i in res:
    print(i)

Output:

1, "asdsdsdsds", "John"
2, "dfdhifdkinf dfjdfgkdnjgknkdjgndkng dkfdkjfnjdnf", "Roy"
3, "dfjfdkgjfgn", "Rahul

4 Comments

Hi Rakesh, Thanks for the suggested answer. I updated the question. There can be n number of fields and this data span can be in any field. Could you please suggest the solution.
Can you provide an example data?
1, "asdsdsdsds", "John",3,4,5,"Hi","This is new line data" 2, "dfdhifdkinf dfjdfgkdnjgknkdjgndkng dkfdkjfnjdnf", "Roy",6,7,8,"Hellooo ooooooo ooooooo", "This" 3, "dfjfdkgjfgn", "Rahul",1,2,3,"Hi this is new line data","This is another new line data"
the assumption that the first character in every line should be a digit might fail if the string that got spanned on multiple lines contains digits. for example. 1, "some line\n1 plus one equals 2"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.