3

I have basically two tables, Orders and Items. As these tables are imported from Google Cloud Datastore backup files, references are not made by a simple ID field, but a <STRUCT> for one-to-one relationship, where its id field represents the actual unique ID I want to match. For one-to-many relationship (REPEATED), the schema uses ARRAY of <STRUCT>.

I can query the one-to-one relationships with a LEFT OUTER JOIN, I also know how to join on a non-repeated struct and a repeated string or int, but I have trouble to achieve a similar join query with a repeated struct.

One Order with one item:

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, STRUCT(STRUCT(2 AS id, "default" AS ns) AS key) AS item UNION ALL 
  SELECT 2 AS __oid__, STRUCT(STRUCT(4 AS id, "default" AS ns) AS key) AS item UNION ALL 
  SELECT 3 AS __oid__, STRUCT(STRUCT(6 AS id, "default" AS ns) AS key) AS item
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,Order_item AS item
FROM Orders  

LEFT OUTER JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_item
ON Order_item.key.id = item.key.id

Result (works as expected):

+-----+---------+--------------+-------------+------------+
| Row | __oid__ |  item.key.id | item.key.ns | item.title |
+-----+---------+--------------+-------------+------------+
|   1 |       1 |            2 |     default |       #1.2 |
+-----+---------+--------------+-------------+------------+
|   2 |       2 |            4 |     default |       #1.4 |
+-----+---------+--------------+-------------+------------+
|   3 |       3 |            6 |     default |       #1.6 |
+-----+---------+--------------+-------------+------------+

Similar query, but this time one order with many items:

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,Order_items AS items
FROM Orders  

LEFT OUTER JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_items
ON Order_items.key.id IN (SELECT item.key.id FROM UNNEST(items) AS item)

Error: IN subquery is not supported inside join predicate.

I actually expected this result:

+-----+---------+--------------+-------------+------------+
| Row | __oid__ |  item.key.id | item.key.ns | item.title |
+-----+---------+--------------+-------------+------------+
|   1 |       1 |            1 |     default |       #1.1 |
|     |         |            2 |     default |       #1.2 |
+-----+---------+--------------+-------------+------------+
|   2 |       2 |            3 |     default |       #1.3 |
|     |         |            4 |     default |       #1.4 |
+-----+---------+--------------+-------------+------------+
|   3 |       3 |            5 |     default |       #1.5 |
|     |         |            6 |     default |       #1.6 |
+-----+---------+--------------+-------------+------------+

How do I change the second query to get the expected result?

2 Answers 2

5

Alternative option is to do CROSS JOIN instead of LEFT JOIN

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,ARRAY_AGG(Order_items) AS items
FROM Orders  

CROSS JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_items
WHERE Order_items.key.id IN (SELECT item.key.id FROM UNNEST(items) AS item)
GROUP BY __oid__
Sign up to request clarification or add additional context in comments.

1 Comment

Although the solution suggested by Elliott does return the same result, the CROSS JOIN approach performed significantly faster for this sample and with my production data. So I have marked this answer as the correct one.
1

The problem is that BigQuery can't hash-partition the join keys from the two sides (since the join is expressed as an IN condition). You can make this work by flattening the array on the left-hand side and then aggregating the items from the right:

#standardSQL
WITH Orders AS (
  SELECT 1 AS __oid__, ARRAY[STRUCT(STRUCT(1 AS id, "default" AS ns) AS key), STRUCT(STRUCT(2 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 2 AS __oid__, ARRAY[STRUCT(STRUCT(3 AS id, "default" AS ns) AS key), STRUCT(STRUCT(4 AS id, "default" AS ns) AS key)] AS items UNION ALL 
  SELECT 3 AS __oid__, ARRAY[STRUCT(STRUCT(5 AS id, "default" AS ns) AS key), STRUCT(STRUCT(6 AS id, "default" AS ns) AS key)] AS items
),
Items AS (
  SELECT STRUCT(1 AS id, "default" AS ns) AS key, "#1.1" AS title UNION ALL
  SELECT STRUCT(2 AS id, "default" AS ns) AS key, "#1.2" AS title UNION ALL
  SELECT STRUCT(3 AS id, "default" AS ns) AS key, "#1.3" AS title UNION ALL
  SELECT STRUCT(4 AS id, "default" AS ns) AS key, "#1.4" AS title UNION ALL
  SELECT STRUCT(5 AS id, "default" AS ns) AS key, "#1.5" AS title UNION ALL
  SELECT STRUCT(6 AS id, "default" AS ns) AS key, "#1.6" AS title
)

SELECT
   __oid__
  ,ARRAY_AGG(Order_items) AS items
FROM Orders,
UNNEST(items) AS item

LEFT OUTER JOIN(
  SELECT
     key
    ,title
  FROM Items
) Order_items
ON Order_items.key.id = item.key.id
GROUP BY __oid__

This looks like what you wanted in any case, since your original query would have had items just as a struct rather than as an array of structs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.