Validating pandas dataframe columns

Question

I have a dataframe with columns as below -

u'wellthie_issuer_identifier', u'issuer_name', u'service_area_identifier', u'hios_plan_identifier', u'plan_year', u'type'

I need to validate values in each column and finally have a dataframe which is valid.

For example, I need to check if plan_year column satisfies below validation

presence: true, numericality: true, length: { is: 4 }

hios_plan_identifier column satisfies below regex.

          format: /\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\z/,
          presence: true, length: { minimum: 10 },

type column contains,

in: ['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan']

There are lot of columns which I need to validate. I have tried to give an example data.

I am able to check regex with str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z', regex=True)

Similary I can check other validation as well individually. I am confused as to how to put all the validation together. Should I put all in a if loop with and conditions. Is there a easy way to validate the dataframe columns ? Need help here

Kevin Fang · Accepted Answer · 2018-11-14 06:50:37Z

6

There are multiple pandas functions you could use of. Basically the syntax you could use to filter your dataframe by content is:

df = df[(condition1) & (condition2) & ...] # filter the df and assign to the same df

Specifically for your case, you could replace condition with following functions(expressions):

df[some_column] == some_value 
df[some_column].isin(some_list_of_values) # This check whether the value of the column is one of the values in the list
df[some_column].str.contains() # You can use it the same as str.contains()
df[some_column].str.isdigit() # Same usage as str.isdigit(), check whether string is all digits, need to make sure column type is string in advance
df[some_column].str.len() == 4 # Filter string with length of 4

Finally, if you want to reset the index, you could use df = df.reset_index(drop=True) to reset your output df index to 0,1,2,...

Edit: To check for NaN, NaT, None values you could use

df[some_column].isnull()

For multiple columns, you could use

df[[col1, col2]].isin(valuelist).all(axis=1)

edited Nov 14, 2018 at 6:50

answered Nov 14, 2018 at 6:05

Kevin Fang

2,0124 gold badges18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

user1896796 Over a year ago

How to check if particualar column is not empty? using is not null?

Kevin Fang Over a year ago

df[some_column].isnull() can check for NaN, NaT, and None, but cannot check for inf and empty string, for empty string you can check the length instead

user1896796 Over a year ago

Hi Kevin, ((df['plan_year'].notnull().any()) & (df['plan_year'].str.isdigit()) & (df['plan_year'].str.len() == 4)) . is this the correct way to do it or can I group it somehow . Also this is throwing me Can only use .str accessor with string values, error.

Kevin Fang Over a year ago

@user1896796 You don't need the .any(), and you have to make sure your column type is string before using .str.isdigit(), if you're not sure you can do .astype(str).str.isdigit()

user1896796 Over a year ago

Thanks Kevin . One more help pls. I have same set of validations on multiple columns . Can I somehow group all columns together or I will have to validate again and again for all?

|

Vinicius Magalhães · Accepted Answer · 2024-01-24 17:56:59Z

follow the python script:

import pandas as pd
from typing import Optional, List, Tuple, Dict, Union

def validate_dataframe(df: pd.DataFrame, 
                        n_cols: Optional[int] = None, 
                        n_rows: Optional[Tuple[int, int]] = None, 
                        columns: Optional[List[str]] = None, 
                        column_types: Optional[Dict[str, type]] = None,
                        check_duplicates: Optional[bool] = False,
                        check_null_values: Optional[bool] = False,
                        unique_columns: Optional[List[str]] = None, 
                        column_ranges: Optional[Dict[str, Tuple[Union[int, float], Union[int, float]]]] = None,
                        date_columns: Optional[List[str]] = None,
                        categorical_columns: Optional[Dict[str, List[Union[str, int, float]]]] = None
                        ) -> Tuple[bool, str]:
    """
    Validates a Pandas DataFrame based on specified criteria.

    Parameters:
    - df (pd.DataFrame): The DataFrame to be validated.
    - n_cols (int, optional): Number of expected columns in the DataFrame.
    - n_rows (tuple, optional): Tuple (min_rows, max_rows) specifying the expected range of rows.
    - columns (list, optional): List of column names that should be present in the DataFrame.
    - column_types (dict, optional): Dictionary mapping column names to the expected data types.
    - check_duplicates (bool, optional): Check for the presence of duplicate rows in the DataFrame.
    - check_null_values (bool, optional): Check for the presence of null values in the DataFrame.
    - unique_columns (list, optional): List of columns that should have only unique values.
    - column_ranges (dict, optional): Dictionary mapping numeric columns to the allowed ranges.
    - date_columns (list, optional): List of columns containing date values to validate the format.
    - categorical_columns (dict, optional): Dictionary mapping categorical columns to allowed values.

    Returns:
    - tuple: (bool, str) indicating success or failure, and an optional description of the problem.
    """
    
    # Validate number of columns
    if n_cols is not None and len(df.columns) != n_cols:
        return (False, f"Error: Expected {n_cols} columns, but found {len(df.columns)} columns.")

    # Validate row range
    if n_rows is not None:
        min_rows, max_rows = n_rows
        if not (min_rows <= len(df) <= max_rows):
            return (False, f"Error: Number of rows should be between {min_rows} and {max_rows}.")

    # Validate columns
    if columns is not None:
        if not set(columns).issubset(df.columns):
            missing_columns = set(columns) - set(df.columns)
            return (False, f"Error: Missing columns: {missing_columns}.")

    # Validate column types
    if column_types is not None:
        if set(column_types.keys()).issubset(df.columns):
            for col, expected_type in column_types.items():
                if not df[col].dtype == expected_type:
                    return (False, f"Error: Column '{col}' should have type {expected_type}.")
        else:
            return (False, "Error: Some columns in column_types are not present in the DataFrame.")

    # Validate duplicates ££ ONLY SPECIFIC COLUMNS
    if check_duplicates and df.duplicated().any():
        return (False, "Duplicates found in the DataFrame.")

    # Validate null values ££ ONLY SPECIFIC COLUMNS
    if check_null_values and df.isnull().any().any():
        return (False, "DataFrame contains null values.")

    # Validate unique values in specific columns
    if unique_columns is not None:
        if set(unique_columns).issubset(df.columns):
            if df[unique_columns].nunique().ne(df[unique_columns].count()).any():
                return (False, "Some columns should have only unique values.")

    # Validate values in a specific range
    if column_ranges is not None:
        for col, value_range in column_ranges.items():
            if col in df.columns and not df[col].between(*value_range).all():
                return (False, f"Values in '{col}' should be between {value_range[0]} and {value_range[1]}.")

    # Validate date format (assuming 'date_columns' are date columns)
    if date_columns is not None:
        for col in date_columns:
            if col in df.columns:
                try:
                    pd.to_datetime(df[col], errors='raise')
                except ValueError:
                    return (False, f"'{col}' should be in a valid date format.")

    # Validate categorical values
    if categorical_columns is not None:
        for col, allowed_values in categorical_columns.items():
            if col in df.columns and not df[col].isin(allowed_values).all():
                return (False, f"Values in '{col}' should be {allowed_values}.")

    # If all validations pass, return True
    return (True, "DataFrame has passed all validations.")

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

Validating pandas dataframe columns

2 Answers 2

7 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related