0

CONTEXT

I have a DataFrame I'm trying to filter using multiple different conditions / "rulesets". There are 3 different rules with a set of conditions as follows:

Rule_DF    Variable    Operator    Value
Rule1      Gender      ==          M
Rule1      DOB         >           01/01/2000
Rule1      Height      >           72    <-- "inches"
Rule2      Gender      ==          F
Rule2      DOB         >=          06/10/1998
Rule2      Height      <           70
etc...

EXAMPLE DATA

Name    Gender    DOB          Height(in)
Bob     M         01/01/2001   74
Jane    F         01/01/1999   64
etc...

EXPECTED RESULT

rule1_df should contain a list of all the names of people that fit rule 1's conditions. rule2_df should contain a list of all the names of people that fit rule 2's conditions.

CURRENT CODE

I am at the point where the rules are all grouped by the rule name and the query is written as such:

for item in grouped_rule_df:
    query = ' & '.join(f'{i} {j} {repr(k)}' for i, j, k in zip(column, equal, condition))
    query = "Gender == 'M' & Height > 72 & DOB > '01/01/2000'"
    test = df.query(query)

PROBLEM

The query still returns a list of all the heights. If I run the query with just the height condition, it returns the correct list of people. Can you not combine data types in pandas .query? Or is PandaSQL/another package a better solution here?

1 Answer 1

2

Assuming rule_df the rules DataFrame and df the data one:

# first ensure the correct column names are used
# and that dates are in a YYYY-MM-DD format
tmp = (df.rename(columns={'Height(in)': 'Height'})
         .assign(DOB=lambda d: pd.to_datetime(d['DOB']).dt.strftime('%Y-%m-%d'))
      )

for rule, g in rule_df.groupby('Rule_DF'):
    # convert dates to YYYY-MM-DD
    # keep numbers as numeric, quote strings
    date = (pd.to_datetime(g['Value'], errors='coerce')
              .dt.strftime('%Y-%m-%d')
            )
    value = (pd.to_numeric(g['Value'], errors='coerce')
               .fillna('"'+date.fillna(g['Value'])+'"')
               .astype(str)
            )

    # craft query string and select
    query = ' & '.join('('+g['Variable']+g['Operator']+value+')')

    out[rule] = tmp.query(query)

Output:

{'Rule1':    Name Gender         DOB  Height
          0   Bob      M  2001-01-01      74,
 'Rule2':    Name Gender         DOB  Height
          1  Jane      F  1999-01-01      64,
}

Queries:

# Rule1
(Gender=="M") & (DOB>"2000-01-01") & (Height>72.0)

# Rule 2
(Gender=="F") & (DOB>="1998-06-10") & (Height<70.0)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the quick response but why did Rule2 not return the second person?
@KL_ because of the dates. They should be formatted as YYYY-MM-DD to allow proper lexicographic comparison. See update

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.