Getting values of Fields of a Row of DataFrame - Spark Scala

Question

I have a DataFrame which contains several records,

I want to iterate each row of this DataFrame in order to validate the data of each of its columns, doing something like the following code:

val validDF = dfNextRows.map {
    x => ValidateRow(x)
}

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    // Stuff to validate the data field of each row
    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

But, doing some tests, if I want to print the value of the val nC (to be sure that I'm sending the corret information to each functions), it doesn't bring me anything:

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    println(nC)

    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

How can I know that I'm sending the correct information to each function (that I'm reading the data of each column of the row correctly)?

Regards.

Anahcolus · Accepted Answer · 2017-06-21 02:39:01Z

3

Spark dataframe function should give you a good start.

If your validate functions are simple enough (like checking for null values), then you can embed the functions as

dfNextRows.withColumn("num_cta", when(col("num_cta").isNotNull, col("num_cta").otherwise(lit(0)) ))

You can do the same for other columns in the same manner just by using appropriate spark dataframe functions

If your validation rules are complex then you can use udf functions as

def validateNC = udf((num_cta : Long)=> {
   //define your rules here
})

You can call the udf function using withColumn as

dfNextRows.withColumn("num_cta", validateNC(col("num_cta")))

You can do so for your rest of the validate rules.

I hope to see your problem get resolved soon

answered Jun 21, 2017 at 2:39

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Erik Barajas Over a year ago

And with the approach that you are mentioning, how could I save the rows that doesn't satisfy the UDF used? Sorry for this dummy questions. Im sow newbie with these Spark and scala things

Anahcolus Over a year ago

you can define if else statement in the udf function. if for the rules and else for not satisfying the rules.

0x6C38 · Accepted Answer · 2017-06-20 23:34:01Z

2

map is a transformation, you need to apply an action, for instance you could do dfNextRows.map(x => ValidaLinea(x)).first. Spark operates lazily, much like the Stream class on the standard collections.

edited Jun 20, 2017 at 23:34

answered Jun 20, 2017 at 23:25

0x6C38

7,1464 gold badges37 silver badges49 bronze badges

2 Comments

TZubiri Over a year ago

Hello Mr D, I found that in your profile you mentioned that you are interested in collaborating with someone, but you didn't leave an email or other form of contact. Could you add one, please? There's a team I think you would be a good fit in. Thanks!

0x6C38 Over a year ago

@Tomas Zubiri I've updated my profile to include a contact email.

Collectives™ on Stack Overflow

Getting values of Fields of a Row of DataFrame - Spark Scala

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related