str.split() issue with pandas DF

Question

I've seen other posts about this but I'm running into an issue trying to follow the solutions. I am trying to split a column of scores (as strings) that are listed like this:

1-0
2-3
0-3
...

The code I'm trying to use:

df[['Home G', 'Away G']] = df['Score'].str.split('-', expand=True)

Error I am getting:

ValueError: Columns must be same length as key

Every game has a score though so the column length should match up? One thought I had is the 0's are giving some weird none values or something like that?

I have tried your code, defining df as df = pd.DataFrame({'Score': ['1-0', '2-3', '0-3']}) and it works for me. — Lorena Gil
– Lorena Gil, Commented Oct 27, 2020 at 14:55
Perhaps one of the rows doesn't have a '-' character? Try the solutions in this post. — Collin Heist
– Collin Heist, Commented Oct 27, 2020 at 14:57
Make sure df[~df['Score'].str.contains('-')] is an empty DataFrame — Collin Heist
– Collin Heist, Commented Oct 27, 2020 at 15:02
@CollinHeist I think not having a "-" character should not be an issue. See, for example: df = pd.DataFrame({'Score': ['1-0', '2-3', '0-3', np.NaN, '32', 3]}) and then df.Score.str.split('-', expand=True) (which returns 2 columns). But having multiple "-" characters could be problematic if you don't specify how many splits to make. — tania
– tania, Commented Oct 27, 2020 at 15:10
@LorenaGil That would require that I manually type in the scores of every single game as the season goes one and it not a very practical option in my case due to time and space it will require — kevin41
– kevin41, Commented Oct 27, 2020 at 16:41

tania · Accepted Answer · 2020-10-27 16:58:00Z

2

This most likely happens if you have more than 1 possible split in a string. For example, perhaps you have a value somewhere like:

"1-2-3"

So, the expansion in this case would return 3 columns, but you would be trying to assign them to 2 columns ('Home G', 'Away G').

To fix it, specify explicitly the number of splits you should perform on each string to 1 by using the n argument, as explained in the Pandas documentation:

df[['Home G', 'Away G']] = df['Score'].str.split(pat='-', n=1, expand=True)

By default, n=-1, which means "split as many times as possible". By setting it to 1, you only split once.

EDIT

An alternative solution, if you are unsure of the number or type of hyphens or other symbols, is to extract with regex the two groups of numbers from each string. For example:

df[['Home G', 'Away G']] = pd.DataFrame(df['Score'].str.findall("([0-9]+)").tolist(), index=df.index)

So, for data that looks like

0   12‒0
1   2–3
2   0–3

You will end up with a df like

    Score   Home G  Away G
0   12‒0    12      0
1   2–3     2       3
2   0–3     0       3

edited Oct 27, 2020 at 16:58

answered Oct 27, 2020 at 15:05

tania

2,33514 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

kevin41 Over a year ago

This still resulted in the same error. I can see all the values of the column, the table isn't very big, and all of the scores are in the same format and length so I'm not sure what else may be causing the issue.

tania Over a year ago

@kr419 so when you just apply df['Score'].str.split("-", n=1), does every list returned only have 2 elements?

kevin41 Over a year ago

When I just apply df['Score'].str.split("-", n=1) it returns the score just like in the DF except each is now a list like this: [1-0] [2-3] [0-3]

kevin41 Over a year ago

df.Score.dtype says Object. type(df.Score[0]) says String

tania Over a year ago

@kr419 is it possible that your "hyphen" is actually an en-dash (–) or a figure-dash (‒)? Those are different symbols and would not be picked up by splitting on the common hyphen ("-"). Perhaps that's why the option you suggested below with just taking the 1st and 3rd element of the string works.

|

Cliff Chew · Accepted Answer · 2020-10-27 15:43:07Z

0

Seems like your data needs some cleaning. If I were you, I would consider running some checks to see where the problem lets. Seems like you will either hit a situation where there are too many -s or no -s in your rows. I would run the following

df['check'] = [len(i) for i in df['Score'].str.findall(r'(-)')]
df[df['check] != 1]

The code calculates the number of - you have in each row, and flags out any row where - isn't 1. Hope this helps flag out your issues.

answered Oct 27, 2020 at 15:43

Cliff Chew

9761 gold badge7 silver badges15 bronze badges

1 Comment

kevin41 Over a year ago

When I run this it returns all 57 rows again. I can see all 57 rows though and none of them are missing the -

kevin41 · Accepted Answer · 2020-10-27 16:26:01Z

0

Got it working using this:

df['Home G'] = 0
df['Away G'] = 0
for index,row in df.iterrows():
    df['Home G'][index] = row['Score'][0]
    df['Away G'][index] = row['Score'][2]

Though I'm sure there is still a better way to do it.

answered Oct 27, 2020 at 16:26

kevin41

6786 silver badges29 bronze badges

1 Comment

tania Over a year ago

Please see the alternative solution I added to my answer, inspired by this solution and avoiding having to deal with the hyphens or relying on score length.

Collectives™ on Stack Overflow

str.split() issue with pandas DF

3 Answers 3

8 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related