0

Right now I'm having a performance issue with this query :

select userid from table_x inner join table_y on array_contains(split(table_y.userids,','),cast(table_x.userid as string))

The userids on y is represented as a string of numbers "123, 134, 156" which actually means three userids, namely 123,134 and 156. Table_x has a userid columns which details the personal information of each user. I want to select the userid which is contained in the userids column in table_y.

Am I right in assuming that the reason for the perforamance issue is because I have to convert the userids in table_y to array of string using split(table_y.userids,',') and use array_contains for string. If so, is there anyone who knows how to convert the string of userids into array of integer?

Thank you!

1 Answer 1

1

It seems that you are doing a Cartesian product join. Hive cannot join on array_contains - it is applied after hive generates all possible combinations.

To truly join, you need to use explode(split(table_y.userids,',')) and then have a regular equality join:

select x.uid from (select cast(table_x.userid as string) as uid from table_x) x 
inner join 
(select explode(split(table_y.userids,',')) as uid from table_y) y on 
x.uid=y.uid;
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the help and explanation! This actually helps me to achieve a much faster performance

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.