1

I loaded a JSON document in Spark, roughly, it looks like:

root
 |-- datasetid: string (nullable = true)
 |-- fields: struct (nullable = true)
...
 |    |-- type_description: string (nullable = true)

My DF is turning it into:

df = df.withColumn("desc", df.col("fields.type_description"));

All fine, but type_description's value looks like: "1 - My description type".

Ideally, I'd like my df to contain only the textual part, e.g. "My description type". I know how to do that, but how can I make it through Spark?

I was hoping some along the line of:

df = df.withColumn("desc", df.col("fields.type_description").call(/* some kind of transformation class / method*/));

Thanks!

2
  • So what exactly are you looking for? Regexp? Substring? Could you update the question to reflect that? Commented Jul 4, 2016 at 17:44
  • Ideally it could be anything... in this situation, i would manage with a substring and trim (there is never more than 2 digits)... but I have other situations that are more interesting like parsing, concatenation of values between columns, call to joda times, etc. Commented Jul 4, 2016 at 17:54

1 Answer 1

2

In general Spark provides a broad set of SQL functions which vary from basic string processing utilities, through date / time processing functions, to different statistical summaries. This are part of o.a.s.sql.functions. In this particular case you probably want something like this:

import static org.apache.spark.sql.functions.*;

df.withColumn("desc",
  regexp_replace(df.col("fields.type_description"), "^[0-9]*\\s*-\\s*", "")
);

Generally speaking these functions should be your first choice when working with Spark SQL. There are backed by Catalyst expressions and typically provide codegen utilities. It means you can fully benefit from different Spark SQL optimizations.

Alternative, but less efficient approach, is to implement custom UDF. See for example Creating a SparkSQL UDF in Java outside of SQLContext

Sign up to request clarification or add additional context in comments.

1 Comment

Awesome - I saw we could do UDF with Python, but I am really glad we can do that with Java too! Tx!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.