I have a pandas dataframe with almost 56 columns and 120000 row.
I would like to implement validation only on some columns and not for all of them.
I followed article at https://tmiguelt.github.io/PandasSchema/
When i did like something below function, it throws an error as
"Invalid number of columns. The schema specifies 2, but the data frame has 56"
def DoValidation(self, df):
null_validation = [CustomElementValidation(lambda d: d is not np.nan, 'this field cannot be null')]
schema = pandas_schema.Schema([Column('ItemId', null_validation)],
[Column('ItemName', null_validation)])
errors = schema.validate(df)
if (len(errors) > 0):
for error in errors:
print(error)
return False
return True
Am i doing something wrong ?
What is the correct way to validate specific column in a dataframe ?
Note: I have to implement different type of validations like decimal, length, null check validations etc on different columns and not just null check validation as show in function above.
schemaonly has two columns in the list, likepysparkyou need to define all 56 of the columns into the schema's before passing in the function.