3

I need to cast the column of the data frame containing values as all string to a defined schema data types. While doing the casting we need to put the corrupt records (records which are of wrong data types) into a separate column

Example of Dataframe

+---+----------+-----+
|id |name      |class|
+---+----------+-----+
|1  |abc       |21   |
|2  |bca       |32   |
|3  |abab      | 4   |
|4  |baba      |5a   |
|5  |cccca     |     |
+---+----------+-----+

Json Schema of the file:

 {"definitions":{},"$schema":"http://json-schema.org/draft-07/schema#","$id":"http://example.com/root.json","type":["object","null"],"required":["id","name","class"],"properties":{"id":{"$id":"#/properties/id","type":["integer","null"]},"name":{"$id":"#/properties/name","type":["string","null"]},"class":{"$id":"#/properties/class","type":["integer","null"]}}}

In this row 4 is corrupt records as the class column is of type Integer So only this records has to be there in corrupt records, not the 5th row

6
  • Do you fix number of Columns? Are you decide schema of column dynamically or it is fix such as Class column will Int only? Commented Jun 21, 2019 at 7:58
  • can you bit more clear regarding schema of input data frame, means initially all will be String and How you will decide its schema on the basis of column ? Commented Jun 21, 2019 at 8:08
  • 1
    We have a json schema from which we are extracting the column names and data types of it. Data Types of columns Id : Integer, Name : String, Class : Integer Commented Jun 21, 2019 at 9:21
  • You can write simple UDF that will do type check for each row on the basis of your business rule, if it get corrupt data then it will return false other wise true. Run that udf with df and keep all value in new column. The column thats have false value that means there is something incorrect in that row. If you can provide json file then I can try to write it. :) Commented Jun 21, 2019 at 9:28
  • JSON Schema: {"definitions":{},"$schema":"json-schema.org/draft-07/…":{"id":{"$id":"#/properties/id","type":["integer","null"]},"name":{"$id":"#/properties/name","type":["string","null"]},"class":{"$id":"#/properties/class","type":["integer","null"]}}} Commented Jun 21, 2019 at 9:46

1 Answer 1

2

Just check if value is NOT NULL before casting and NULL after casting

import org.apache.spark.sql.functions.when

df
  .withColumn("class_integer", $"class".cast("integer"))
  .withColumn(
    "class_corrupted", 
    when($"class".isNotNull and $"class_integer".isNull, $"class"))

Repeat for each column / cast you need.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.