1

I m trying to read multiple files whose names start with 'site_%'. Example, file names like site_1, site_a. Each file has data like :

Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php

I need two columns in my pandas df: Login_id and Web.

I am facing error when I try to read records like 2.

df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)

I am facing the following error : ValueError: Columns must be same length as key.

Please let me know where I am doing some serious mistake and any good approach to solve the problem. Thanks

14
  • You may want to take a look at: stackoverflow.com/questions/52428968/… Commented Jul 22, 2019 at 15:14
  • Thanks Pedro. I checked. It is different problem. I am trying to read multiple files with more commas in second column. Commented Jul 22, 2019 at 15:16
  • 1
    Why do you use | as separator? Commented Jul 22, 2019 at 15:17
  • Hi rafaelc, Thought of reading it as single column and then split it into two columns. Commented Jul 22, 2019 at 15:18
  • Hm I see, so you have an uneven csv file. How do you believe the data frame should look like when you have more than one website in a row? Commented Jul 22, 2019 at 15:19

1 Answer 1

1

Solution 1: use split with argument n=1 and expand=True.

result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']

That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method).

EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function:

result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)

This splits the field and uses the names of the matching groups to create columns with their content. The output is:

  Login_id                       URL
0        1         http://www.x1.com
1        2  http://www.x1.com,as.php

Solution 3: convetional version with regex: You could do something customized, e.g with a regex:

import re
sp_re= re.compile('([^,]*),(.*)')

aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]

The result on your example data is:

                Login_id, Web Login_id                       URL
0         1,http://www.x1.com        1         http://www.x1.com
1  2,http://www.x1.com,as.php        2  http://www.x1.com,as.php

Now you could drop the column 'Login_id, Web'.

Sign up to request clarification or add additional context in comments.

4 Comments

Are you sure of that regex? Shouldn't the first * be inside the group? Otherwise it would match only the last character of the first field... >>> re.match(r'([^,])*,(.*)', 'abc,123').groups() ('c', '123') while having * inside the first group gives the expected ('abc', '123').
Thank you @jottbe. I am trying to read multiple files with million records in each. Is there any optimized solution like having a customized function in read_csv itself so that it can be computationally efficient.
I am not aware of another method. Have you tried this method? I guess because it is implemented in a pandas function it might be implemented in C or C++. If it is not fast enough maybe you could change the process that writes the file, so it uses ; or | as separator, so the separator does not interfere with the commas in the URL column? If you can't change the writing process, you could still try to preprocess the file using commands like sed (if you work on a unix-like system). e.g. sed -e's/^\([^,]*\),/\1;/g' yourfile > newfile (version for just two columns). sed is quite fast.
Oh, I just found another mehod :-) Somehow I got it wrong what split does if n is passed at first (thought it works from the end of the string, but it actually works from the beginning how we want it), I update my answer. So now you have a lot of methods, from which you can choose the fastest.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.