6

I have the following model defined with Flask-SQLAlchemy:

"""models.py"""

from flask_sqlalchemy import SQLAlchemy

db = SQLAlchemy()

skill_candidate = db.Table(
    'SkillCandidate',
    db.Column('skill_id', db.String, db.ForeignKey('skill.id')),
    db.Column('candidate_id', db.Integer, db.ForeignKey('candidate.id')))

class Candidate(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    skills = db.relationship("Skill", secondary=skill_candidate)

class Skill(db.Model):
    id = db.Column(db.String, primary_key=True)
    name = db.Column(db.String, nullable=False, unique=True)

What am trying to achieve is the following : I want to return all the candidates who possess skills provided in a list input (even ideally, a list of skill_id)

I tried the following :

def get_skilled_candidates(skill_ids):
    return Candidate.query.join(skill_candidate).\
       filter(and_(*[skill_candidate.c.skill_id == skill_id for skill_id in skill_ids])).\
            all()

The aim was to filter all candidates for every skill and compose it with a and_ statement

It works well if I use a list of 1 item (it returns all candidates that possess the skill) but does not if I add more skills in the input list (even tho I have candidates in base that fit the criteria)

5
  • Could you show as what exactly is constraint_item_candidate and constraint_item_candidate.c in your query? Commented Sep 24, 2019 at 20:51
  • my mistake, it's a typo. constraint_item_candidate is actually meant to be skill_candidate, the association table of Skill and Candidate. skill_candidate.c is the way of accessing column fields for a db.Table instance Commented Sep 25, 2019 at 11:05
  • 2
    You need relational division / "for all", which translates to "not exists skill id that not exists in skill_candidate". Some examples: stackoverflow.com/questions/49438529/…, stackoverflow.com/questions/42673699/… Commented Sep 25, 2019 at 20:18
  • Awesome pinpoint, but I cannot totally wrap my head around that double negation... I'll try to write an expression as soon as I handle it Commented Sep 26, 2019 at 9:12
  • I thought I had it, I tried first sending the raw query: SQL select * from SkillCandidate where not (exists (select * from SkillCandidate where SkillCandidate.skill_id not in (1, 2))) But it results in returning an empty result (1 and 2 are the ids of the required skills) Commented Sep 26, 2019 at 13:07

2 Answers 2

10

Be careful with this answer, shorter is not always better. The answer by @IljaEverilä using relational division will likely perform much better in many cases.

You could query all candidates with any of the skills in your list and then filter the result with a list comprehension. This will not be as performant as the relational division approach in many cases, but it certainly simplifies the query aspect.

skill_ids = ['id_1', 'id_2']
candidates = session.query(Candidate).\
    filter(Candidate.skills.any(Skill.id.in_(skill_ids)).\
    all()

candidates = [
    c for c in candidates
    if set(s.id for s in c.skills).issuperset(skill_ids)
]
Sign up to request clarification or add additional context in comments.

4 Comments

I could go for this, but I am concerned about the performance impact of letting python handling the filtering instead of the sql engine
@AugBar - yeah, just realized that I think I missed an important part of your question where you are wanting to get a list of only those candidates that possess all of the skills in the list, correct?
They have to possess at least all the skills specified in the list
@AugBar - edited to produce the desired results, but I would guess that the relational division approach mentioned in the question comments would be more performant.
6
+100

As noted in the comments, what you'd need is a FORALL operation (universal quantifier), or relational division.

FORALL x ( p(x) )

can be expressed as

NOT ( EXISTS x ( NOT ( p(x) ) ) )

which is a bit unwieldy and hard to reason about, if you don't know about FORALL and their relationship. Given your models it could look like:

def get_skilled_candidates(skill_ids):
    # Form a temporary derived table using unions
    skills = db.union_all(*[
        db.select([db.literal(sid).label('skill_id')])
        for sid in skill_ids]).alias()

    return Candidate.query.\
        filter(
            ~db.exists().select_from(skills).where(
                ~db.exists().
                    where(db.and_(skill_candidate.c.skill_id == skills.c.skill_id,
                                  skill_candidate.c.candidate_id == Candidate.id)).
                    correlate_except(skill_candidate))).\
        all()

There are of course other ways to express the same query, such as:

def get_skilled_candidates(skill_ids):
    return Candidate.query.\
        join(skill_candidate).\
        filter(skill_candidate.c.skill_id.in_(skill_ids)).\
        group_by(Candidate.id).\
        having(db.func.count(skill_candidate.c.skill_id.distinct()) ==
               len(set(skill_ids))).\
        all()

which essentially checks by count that all skill ids were matched.

If using Postgresql you could also do:

from sqlalchemy.dialects.postgresql import array_agg

def get_skilled_candidates(skill_ids):
    # The double filtering may seem redundant, but the WHERE ... IN allows
    # the query to use indexes, while the HAVING ... @> does the final filtering.
    return Candidate.query.\
        join(skill_candidate).\
        filter(skill_candidate.c.skill_id.in_(skill_ids)).\
        group_by(Candidate.id).\
        having(array_agg(skill_candidate.c.skill_id).contains(skill_ids)).\
        all()

This is somewhat equivalent with the partly Python solution from the other answer.

Also, the aggregate EVERY could be used:

def get_skilled_candidates(skill_ids):
    # Form a temporary derived table using unions
    skills = db.union_all(*[
        db.select([db.literal(sid).label('skill_id')])
        for sid in skill_ids]).alias()

    # Perform a CROSS JOIN between candidate and skills
    return Candidate.query.\
        join(skills, db.true()).\
        group_by(Candidate.id).\
        having(db.func.every(
            db.exists().
                where(db.and_(skill_candidate.c.skill_id == skills.c.skill_id,
                              skill_candidate.c.candidate_id == Candidate.id)).
                correlate_except(skill_candidate))).\
        all()

1 Comment

Excellent. That's what I was missing: the initial condition ~db.exists().select_from(skills) I was also considering the count condition part. I am gonna try to test all those solutions performance-wise

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.