2

I have a (61000L, 2L) numpy.ndarray that is made up of strings. As in, the items inside the numpy.ndarray are strings.

I split the string, so that it outputs each word in a string as a list, within the numpy.ndarray, with the following code:

words_data = np.char.split(string_data)

I tried to make a double for-loop that counts the unique words found within each list.

from collections import Counter
counts = Counter()
for i in range(words_data.shape[0]):
    for j in range(words_data[1]):
        counts.update(words_data[i])

counts

The output error for the code above is the following:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-680a0105eebd> in <module>()
      1 counts = Counter()
      2 for i in range(words_data.shape[0]):
----> 3     for j in range(words_data[1]):
      4         counts.update(words_data[i])
      5 

TypeError: only size-1 arrays can be converted to Python scalar

Here is a the first 8 rows of my data:

 x = np.array([["hello my name is nick", "hello my name is Nick", "hello my name is Carly", "hello my name is Ashley, "hello my name is Java", "hello my name is C++", "hello my name is Ruby", "hello my name is Python"" ],["hello my name is Java", "hello my name is C++", "hello my name is Ruby", "hello my name is Python", "hello my name is nick", "hello my name is Nick", "hello my name is Carly", "hello my name is Ashley]])

 x =  x.transpose()

1 Answer 1

2

No loops required here. Here is one solution:

from collections import Counter
from itertools import chain
import numpy as np

string_data = np.array([["hello my name is nick", "hello my name is Nick", "hello my name is Carly",
                         "hello my name is Ashley", "hello my name is Java", "hello my name is C++",
                         "hello my name is Ruby", "hello my name is Python"],
                         ["hello my name is Java", "hello my name is C++", "hello my name is Ruby",
                          "hello my name is Python", "hello my name is nick", "hello my name is Nick",
                          "hello my name is Carly", "hello my name is Ashley"]])

word_count = Counter(' '.join(chain.from_iterable(string_data)).split())

# Counter({'Ashley': 2,
#          'C++': 2,
#          'Carly': 2,
#          'Java': 2,
#          'Nick': 2,
#          'Python': 2,
#          'Ruby': 2,
#          'hello': 16,
#          'is': 16,
#          'my': 16,
#          'name': 16,
#          'nick': 2})
Sign up to request clarification or add additional context in comments.

10 Comments

When I run the code above, it gives me a TypeError: unhashable type: 'list' @jp_data_analysis
@JayganeshKalla, works for me on python 3.6 / numpy 1.11. Are you testing exactly what I've posted?
I tried your code on a separate jupyter notebook and it worked, but when I tried to use my data it outputs the error I mentioned. I am using Python 2.7.14/ numpy 1.14.0, @jp_data_analysis
in that case, you have to show a sample of your data so we have a reproducible example.
i have added the first 8 rows
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.