4

I want to keep only numbers from an numpy array of strings, which are not necessarily valid. My code looks looks like the following:

age = train['age'].to_numpy() # 200k values
set(age)
# {'1', '2', '3', '7-11', np.nan...} 

age  = np.array(['1', '2', '3', '7-11', np.nan])

Desired output: np.array([1, 2, 3]). Ideally, '7-11' would be 7, however, that's not simple and is a tolerable loss.

np.isfinite(x) gives "ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"

x = [num for num in age if isinstance(num, (int, float))] returns []

4
  • This is in essence a follow-up to the following question, stackoverflow.com/q/11620914/5728614 Commented Jan 21, 2022 at 2:11
  • Please provide the desired output, which I assume is an array with just [1, 2] in it? Should return any integers and floating point numbers, but what about a string that is a number, e.g. '1'. Should that be included as an integer type? A little more context about edge cases, etc would be helpful Commented Jan 21, 2022 at 2:26
  • Maybe this SO answer is helpful? If you are going to have to evaluate x like shown, with mixed types including strings, then the output x array will all be strings, and you can use isnumeric to filter the array for numbers, see docs Commented Jan 21, 2022 at 2:29
  • 1
    @frederick-douglas-pearce I have made a large edit and corrected the data types. I attempted one of the solutions to the answer you linked. I will give isnumeric a try Commented Jan 21, 2022 at 2:58

2 Answers 2

2

Here's an option that will split strings on '-' first, and only take the first value, so '7-11' is converted to 7:

age = np.array(['1', '2', '3', '7-11', np.nan])
age_int = np.array([int(x[0]) for x in np.char.split(age, sep='-') if x[0].isdecimal()])

Output: array([1, 2, 3, 7])

There is a more efficient way to do this if you don't care about cases like '7-11':

age_int2 = age[np.char.isdecimal(age)].astype(int)

Output2: array([1, 2, 3])

Sign up to request clarification or add additional context in comments.

Comments

1

You could do something like the following

for pos, val in enumerate(age):
    try:
        new_val = int(val)
    except:
        new_val = np.nan
    age[pos] = new_val

age = age[age!="nan"].astype(int)

print(age)
> array([1, 2, 3])

2 Comments

After a small change, which I edited in, this code solves my problem. Thank you! I am going to wait a few days before accepting this, in case there is a nice one-liner or 'better' solution.
@frederick-douglas-pearce has a great one-liner!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.