0

I have an rdd with string columns, but I want to know if a string column has numeric values. Looking for a very inexpensive way to do this, I have many tables with millions of records.

For example, I've tried casting the column to int, float, etc, but I get all null values, so count is always zero:

spark.sql('''select count('Violations')
             from tmp
             where cast('Violations' as int) is not null''').show()

returns values unchanged in the column. I know for a fact that this column contains the string '9', in one of its rows at least. I've tried variations of this with the count() function and cast() before the from statement. Is this a pipe dream?

I saw the the stack post with the udf using isdigit, but it looks awfully expensive.

1 Answer 1

1

If your code is correct, then you're literally using the string literal "Violations" instead of referring to a column named Violations. Try removing the single quotes around Violations.

Sign up to request clarification or add additional context in comments.

2 Comments

I spent too long on this, I knew this worked 4 days ago! Do you think there is an even less expensive way to produce the same result? I am wondering if I used 'select count(cast(Violations as int)) from tmp' is inherently faster. I'll definitely test this, but I didn't see answer on stack.
Hi @spacedustpi I'm not sure if it would be faster, but definitely worth testing it out.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.