0

I have a csv file which has records are in multiline like this

id1,id2,id3,id4,id5,id6,id7
1,2,3,4,5,6,7

1,2,3,4

,5,6,

7

1,2

3,4

,5,6,


7

I want to change the file like below -

id1,id2,id3,id4,id5,id6,id7
1,2,3,4,5,6,7
1,2,3,4,5,6,7
1,2,3,4,5,6,7

I know pyspark can read such file with multiline :True option but I want to convert this file to single line rows which is the business use case. How can I do it. Technologies to be used are either Pyspark or Python (Pandas). Thanks in advance

1 Answer 1

3

Did you have something like this in mind?

import re

items  = re.findall("[^ ,\n]+", """id1,id2,id3,id4,id5,id6,id7
1,2,3,4,5,6,7

1,2,3,4

,5,6,

7

1,2

3,4

,5,6,


7""")

rows = [items[i:i+7] for i in range(0,len(items),7)]
pd.DataFrame(rows[1:], columns=rows[0])

Output:

  id1 id2 id3 id4 id5 id6 id7
0   1   2   3   4   5   6   7
1   1   2   3   4   5   6   7
2   1   2   3   4   5   6   7

Since it has been requested here is a no loop version of the 2nd part:

rows = np.array(items).reshape(len(items)//7,7)
pd.DataFrame(rows[1:], columns=rows[0])

I have tested if it actually saves time by using jupter's %%timeit: it turns out:

  • the regular expression part takes 6.66 µs ± 43.8 ns,
  • the old loop part of then turning it into a dataframe takes 759 µs ± 2.81 µs
  • and the new numpy version of the same takes 149 µs ± 4.82 µs
Sign up to request clarification or add additional context in comments.

3 Comments

Can there be a better solution which use no loop and column names. I have like 156 columns and its a large file of 1 M records. Doing this for that file would be difficult. This is just a sample data file I gave.
will you be having strings with doubles quotes in the records as well ?
@Codegator I have replaced the loop like you requested. I do not understand why you would not want to tell pandas what the column names are. That should not significantly improve performance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.