2

I've got a dataframe and want to loop through all cells within column c2 and count how many times each entire string appears in another column c1, if it exists. Then print the results.

Example df:

id     c1                c2
0      luke skywalker    han solo
1      leia organa       r2d2
2      darth vader       finn
3      han solo          the emporer
4      han solo          c3po
5      finn              leia organa
6      r2d2              darth vader

Example printed result:

han solo      2
r2d2          1
finn          1
the emporer   0
c3po          0
leia organa   1
darth vader   1

I'm using Jupyter notebook with python and pandas. Thanks!

1
  • I have NaN values in c2 which changes some of the solutions below as indicated by @wen. Commented Feb 16, 2018 at 3:45

3 Answers 3

3

You can use some Numpy magic.
Use count and broadcasting to compare each combination.

from numpy.core.defchararray import count

c1 = df.c1.values.astype(str)
c2 = df.c2.values.astype(str)

pd.Series(
    count(c1, c2[:, None]).sum(1),
    c2
)

han solo       2
r2d2           1
finn           1
the emporer    0
c3po           0
leia organa    1
darth vader    1
dtype: int64
Sign up to request clarification or add additional context in comments.

3 Comments

Out of curiosity, do you know what the "def" stands for? Searching for numpy.core.defchararray throws up all the methods, numpy.core.chararray throws up all those methods, and searching for the difference is obscured by the fact that this is a thing :(
@roganjosh hah! No, you've got me there. I have no idea.
I have NaN values in c2. Is there a way to remove them AFTER the pd.Series is created with this method? I can't remove the rows before because those rows in c1 could contain the strings for which I'm trying to match.
2

You can pass them as category and using value_counts

df.c1.astype('category',categories=df.c2.tolist()).value_counts(sort=False)
Out[572]: 
han solo       2
r2d2           1
finn           1
the emporer    0
c3po           0
leia organa    1
darth vader    1
Name: c1, dtype: int64

Or you can do

pd.crosstab(df.c2,df.c1).sum().reindex(df.c2,fill_value=0)
Out[592]: 
c2
han solo       2
r2d2           1
finn           1
the emporer    0
c3po           0
leia organa    1
darth vader    1

6 Comments

When i try the 'category' option I get an error: ValueError: Categorial categories cannot be null
@mapk it works fine on my side do you have nan in c2?
Also on the second option, the fill_value=0 seems to just fill them all with 0.
@mapk if you have nan then it is totally different story
I understand. Can you help me through it?
|
0
df[c3] = pd.Series([df[c1].count(n) for n in df[c2]])

1 Comment

This is a syntax error since you don't complete your list comprehension. It's also not a good use of pandas at all since you create a whole new DataFrame rather than work with the one that exists. Two good answers were posted 20 mins prior to this; it would be worth reading those and understanding how they used the libraries before posting an answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.