Ignoring multiple commas while reading csv in pandas

Question

I m trying to read multiple files whose names start with 'site_%'. Example, file names like site_1, site_a. Each file has data like :

Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php

I need two columns in my pandas df: Login_id and Web.

I am facing error when I try to read records like 2.

df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)

I am facing the following error : ValueError: Columns must be same length as key.

Please let me know where I am doing some serious mistake and any good approach to solve the problem. Thanks

You may want to take a look at: stackoverflow.com/questions/52428968/… — Pedro Lobito
– Pedro Lobito, Commented Jul 22, 2019 at 15:14
Thanks Pedro. I checked. It is different problem. I am trying to read multiple files with more commas in second column. — usr_lal123
– usr_lal123, Commented Jul 22, 2019 at 15:16
Hi rafaelc, Thought of reading it as single column and then split it into two columns. — usr_lal123
– usr_lal123, Commented Jul 22, 2019 at 15:18
Hm I see, so you have an uneven csv file. How do you believe the data frame should look like when you have more than one website in a row? — rafaelc
– rafaelc, Commented Jul 22, 2019 at 15:19

jottbe · Accepted Answer · 2019-07-23 06:37:21Z

1

Solution 1: use split with argument n=1 and expand=True.

result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']

That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method).

EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function:

result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)

This splits the field and uses the names of the matching groups to create columns with their content. The output is:

  Login_id                       URL
0        1         http://www.x1.com
1        2  http://www.x1.com,as.php

Solution 3: convetional version with regex: You could do something customized, e.g with a regex:

import re
sp_re= re.compile('([^,]*),(.*)')

aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]

The result on your example data is:

                Login_id, Web Login_id                       URL
0         1,http://www.x1.com        1         http://www.x1.com
1  2,http://www.x1.com,as.php        2  http://www.x1.com,as.php

Now you could drop the column 'Login_id, Web'.

edited Jul 23, 2019 at 6:37

answered Jul 22, 2019 at 15:33

jottbe

4,5464 gold badges19 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Giacomo Alzetta Over a year ago

Are you sure of that regex? Shouldn't the first * be inside the group? Otherwise it would match only the last character of the first field... >>> re.match(r'([^,])*,(.*)', 'abc,123').groups() ('c', '123') while having * inside the first group gives the expected ('abc', '123').

usr_lal123 Over a year ago

Thank you @jottbe. I am trying to read multiple files with million records in each. Is there any optimized solution like having a customized function in read_csv itself so that it can be computationally efficient.

jottbe Over a year ago

I am not aware of another method. Have you tried this method? I guess because it is implemented in a pandas function it might be implemented in C or C++. If it is not fast enough maybe you could change the process that writes the file, so it uses ; or | as separator, so the separator does not interfere with the commas in the URL column? If you can't change the writing process, you could still try to preprocess the file using commands like sed (if you work on a unix-like system). e.g. sed -e's/^\([^,]*\),/\1;/g' yourfile > newfile (version for just two columns). sed is quite fast.

jottbe Over a year ago

Oh, I just found another mehod :-) Somehow I got it wrong what split does if n is passed at first (thought it works from the end of the string, but it actually works from the beginning how we want it), I update my answer. So now you have a lot of methods, from which you can choose the fastest.

Collectives™ on Stack Overflow

Ignoring multiple commas while reading csv in pandas

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related