1

I have a DataFrame which contains several records,

enter image description here

Image in order to show that the DF contains data

I want to iterate each row of this DataFrame in order to validate the data of each of its columns, doing something like the following code:

val validDF = dfNextRows.map {
    x => ValidateRow(x)
}

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    // Stuff to validate the data field of each row
    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

But, doing some tests, if I want to print the value of the val nC (to be sure that I'm sending the corret information to each functions), it doesn't bring me anything:

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    println(nC)

    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

enter image description here

How can I know that I'm sending the correct information to each function (that I'm reading the data of each column of the row correctly)?

Regards.

2 Answers 2

3

Spark dataframe function should give you a good start.

If your validate functions are simple enough (like checking for null values), then you can embed the functions as

dfNextRows.withColumn("num_cta", when(col("num_cta").isNotNull, col("num_cta").otherwise(lit(0)) ))

You can do the same for other columns in the same manner just by using appropriate spark dataframe functions

If your validation rules are complex then you can use udf functions as

def validateNC = udf((num_cta : Long)=> {
   //define your rules here
})

You can call the udf function using withColumn as

dfNextRows.withColumn("num_cta", validateNC(col("num_cta")))

You can do so for your rest of the validate rules.

I hope to see your problem get resolved soon

Sign up to request clarification or add additional context in comments.

2 Comments

And with the approach that you are mentioning, how could I save the rows that doesn't satisfy the UDF used? Sorry for this dummy questions. Im sow newbie with these Spark and scala things
you can define if else statement in the udf function. if for the rules and else for not satisfying the rules.
2

map is a transformation, you need to apply an action, for instance you could do dfNextRows.map(x => ValidaLinea(x)).first. Spark operates lazily, much like the Stream class on the standard collections.

2 Comments

Hello Mr D, I found that in your profile you mentioned that you are interested in collaborating with someone, but you didn't leave an email or other form of contact. Could you add one, please? There's a team I think you would be a good fit in. Thanks!
@Tomas Zubiri I've updated my profile to include a contact email.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.