0

I have a DuckDB table with an int32 column type and a custom (Python) function type_str that converts the enum value to a human-readable string.

This query is very fast:

select type_str(type) as name, type, count(*) as count from objects
group by type 
having count > 1000
order by count desc;

which means the type_str function is not called for every row.

However, this query is very slow:

select type_str(type) as name, type, count(*) as count from objects
group by type 
having count > 1000 and name[0:3] = 'CAN'
order by count desc;

The documentation of HAVING says

The HAVING clause can be used after the GROUP BY clause to provide filter criteria after the grouping has been completed.

So I don't understand why this second query is much slower. It shouldn't have to do more work.

2
  • 1
    EXPLAIN suggests that it is running the UDF on all rows first: FILTER (array_slice(type_str(..)) -> FILTER (count There seem to be a couple of similar issues on Github github.com/duckdb/duckdb/issues?q=udf+filter - but not this specific case. You may need to ask DuckDB directly if you don't get a reply here. Commented Nov 14, 2024 at 18:09
  • Even putting the whole first query as a subquery and then selecting from that is slow. I may have to materialize it as a new table in order to make this fast Commented Nov 15, 2024 at 9:52

1 Answer 1

0

Here is a work-around:

create temp table type_count as
select type_str(type) as name, type, count(*) as count from objects
group by type 
order by count desc;

select name, type, count from type_count
where name[0:3] = 'CAN'
order by count desc;

This is fast

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.