1

I am currently using this method to convert a set of integers (the variable words) to a numpy array:

    wordMask = np.asarray( [ int(x not in words)  for x in xrange(0,nwords) ] ).reshape(nwords,1)

Here nwords can be as large as 10000.

Instead of recomputing wordMask every time I could keep it as a separate variable and whenever I add/remove an element from words I make the corresponding change to wordMask, but I'm wondering if there is a reasonably efficient way to recompute wordMask.

Edit: My main concern is that the list comprehension:

[ x not in words for x in xrange(0,nwords) ]

will be slow, and I'm looking for a faster way to perform that iteration in the context of creating a numpy array.

4
  • What does words look like? Why do you need .reshape()? Any reason not to us a numpy.ma.array? Commented Aug 12, 2017 at 16:38
  • x is an integer. What is words? x not in words will just be False for everything right? Commented Aug 12, 2017 at 16:39
  • words is a set() of integers. I could use the intset module for better performance I guess, but I'm not doing that at the moment. e.g. words = set(); a.add(3); a.add(6) Commented Aug 12, 2017 at 17:48
  • @AChampion - I'll look into numpy.ma.array. I need a column vector for wordMask. AFAIK, .reshape() is very performant in that it doesn't make a copy of the array. Commented Aug 12, 2017 at 17:50

2 Answers 2

1

Let's take this sample code to begin:

import numpy as np

words = set([1, 3, 5, 7, 9])
nwords = len(words)
wordMask = np.asarray([ [int(x not in words)] for x in xrange(0, nwords) ])

I changed the asarray code to automatically reshape it from the start. Now that you have the word mask, it is simple to update whenever you add or remove a word.

Adding a word:

newWord = 4
words.add(newWord)
nwords = len(words)
wordMask = np.concatenate((wordMask, [[1]])) # Resize the mask
if newWord < nwords:                         # Update the mask
    wordMask[newWord][0] = 0 

Removing a word:

oldWord = 3
if oldWord < nwords:                         # Update the mask
    wordMask[oldWord][0] = 1
words.discard(oldWord)
nwords = len(words)
wordMask = np.resize(wordMask, (nwords, 1))  # Resize the mask

And there you have it!

I would have personally used a dictionary, as they are easier to deal with. I'm guessing you chose to use NumPy for memory efficiency? NumPy always copies arrays on resize, so the concatenate() and resize() methods will not necessarily be quick. One option is to set the mask to a fixed size (upper bound) and update any time something changes. That would be much more efficient.

If it is guaranteed that the max size is 10000, then remove the size updates from the above code, and set nwords to 10000.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks - I already know how to keep wordMask and words in sync. I'm really asking if there is a good way to recompute wordMask from words on the fly. My main concern is that the list comprehension will be slow.
If you're only running the list comprehension once, what is your concern for speed? 10000 will run quite quickly either way.
1

You could try using numpy.bincount. Here's an example usage:

>>> nwords = 10
>>> words = {1, 5, 3, 8, 6}
>>> import numpy
>>> numpy.bincount(list(s), minlength=nwords)  # words present
array([0, 1, 0, 1, 0, 1, 1, 0, 1, 0])
>>> 1 - numpy.bincount(list(s), minlength=nwords)  # words absent
array([1, 0, 1, 0, 1, 0, 0, 1, 0, 1])

Unfortunately this still involves a Python-level iteration over the elements of s (implicit in the list call). I couldn't find a quicker way to get NumPy to interpret the set here than converting it to a list first.

Some timings (with Python 3.6, NumPy 1.13.1) show that this solution is faster, but not incredibly so:

In [13]: %timeit numpy.asarray( [ int(x not in words)  for x in range(0,nwords) ] )
6.17 µs ± 25.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit 1 - numpy.bincount(list(s), minlength=nwords)
3.97 µs ± 228 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.