Pyspark/Python : Converting csv file which has Multiline rows file to single line row file

Question

I have a csv file which has records are in multiline like this

id1,id2,id3,id4,id5,id6,id7
1,2,3,4,5,6,7

1,2,3,4

,5,6,

7

1,2

3,4

,5,6,


7

I want to change the file like below -

id1,id2,id3,id4,id5,id6,id7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7

I know pyspark can read such file with multiline :True option but I want to convert this file to single line rows which is the business use case. How can I do it. Technologies to be used are either Pyspark or Python (Pandas). Thanks in advance

Lukas S · Accepted Answer · 2020-10-12 22:13:31Z

3

Did you have something like this in mind?

import re

items  = re.findall("[^ ,\n]+", """id1,id2,id3,id4,id5,id6,id7
1,2,3,4,5,6,7

1,2,3,4

,5,6,

7

1,2

3,4

,5,6,


7""")

rows = [items[i:i+7] for i in range(0,len(items),7)]
pd.DataFrame(rows[1:], columns=rows[0])

Output:

  id1 id2 id3 id4 id5 id6 id7
0   1   2   3   4   5   6   7
1   1   2   3   4   5   6   7
2   1   2   3   4   5   6   7

Since it has been requested here is a no loop version of the 2nd part:

rows = np.array(items).reshape(len(items)//7,7)
pd.DataFrame(rows[1:], columns=rows[0])

I have tested if it actually saves time by using jupter's %%timeit: it turns out:

the regular expression part takes 6.66 µs ± 43.8 ns,
the old loop part of then turning it into a dataframe takes 759 µs ± 2.81 µs
and the new numpy version of the same takes 149 µs ± 4.82 µs

edited Oct 12, 2020 at 22:13

answered Oct 12, 2020 at 16:07

Lukas S

3,6332 gold badges16 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Codegator Over a year ago

Can there be a better solution which use no loop and column names. I have like 156 columns and its a large file of 1 M records. Doing this for that file would be difficult. This is just a sample data file I gave.

Aditya Vikram Singh Over a year ago

will you be having strings with doubles quotes in the records as well ?

Lukas S Over a year ago

@Codegator I have replaced the loop like you requested. I do not understand why you would not want to tell pandas what the column names are. That should not significantly improve performance.

Collectives™ on Stack Overflow

Pyspark/Python : Converting csv file which has Multiline rows file to single line row file

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related