0

how can I aggregate to distinct repeated fields?

Imagine this data:

WITH data as (
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
union all 
 select '5a' as room_id, 'jane' as name_student, 14 as age_student , 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id,  'jane' as name_student, 14 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
)

I'd like to have the room id and two sets of repeated fields: students and teachers. But when I do the query below I get 4 and any try to plug in DISTINCT returns an error.

SELECT room_id, 
        struct(array_agg(name_student) as name, array_agg(age_student) as age) as students,
        struct(array_agg(name_teacher) as name, array_agg(id_teacher) as id) as teachers,

from data
group by 1

How could I achieve unique arrays for students and fo teachers?

Output should look like that enter image description here

Thanks!

4
  • When you say "two sets of repeated fields", do you mean you want two rows in your output? Thus, it would have a repeated the student name? Commented Feb 4, 2020 at 11:23
  • repeated field as defined in BQ table structures Commented Feb 4, 2020 at 11:42
  • Ok, I see. You can add another field to the group by aggregation. Also, it is possible to have another nesting level with struct, however I do not understand how you want to your output to look like. Can you elaborate on that in your question? Commented Feb 4, 2020 at 12:15
  • I see you updated your question, now it is clearer how the output should look like. However, in your output you are ignoring the following row: '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Climp' as name_teacher. Is it intentional? Also when student.name = mick, do you want to treat it as a new pice of data or nested inside room_id= 5a? Commented Feb 4, 2020 at 12:54

2 Answers 2

2

This answer is a little bit more verbose, but should work for your needs. I prefer to use ARRAY_AGG(STRUCT()) instead of STRUCT(ARRAY_AGG(),ARRAY_AGG()) to make sure you keep the 'George is 13' and 'Jane is 14' relationships (imagine adding a 14 year-old George to your list, how would you tell which is which?).

WITH data as (
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
union all 
 select '5a' as room_id, 'jane' as name_student, 14 as age_student , 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id,  'jane' as name_student, 14 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
),
students_distinct as (
  select distinct room_id, name_student as name, age_student as age from data
),
students_agg as (
  select room_id,array_agg(struct(name,age)) as student from students_distinct group by 1
),
teachers_distinct as (
  select distinct room_id, name_teacher as name, id_teacher as id from data
),
teachers_agg as (
  select room_id,array_agg(struct(name,id)) as teacher from teachers_distinct group by 1
)
select room_id, s.student, t.teacher
from students_agg s
inner join teachers_agg t using(room_id)
Sign up to request clarification or add additional context in comments.

Comments

0

I run your query adding distinct inside all the array_agg functions and works fine.

WITH data as (
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher,
union all 
 select '5a' as room_id, 'jane' as name_student, 14 as age_student , 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id,  'jane' as name_student, 14 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
)
SELECT room_id, 
        struct(array_agg(distinct name_student) as name, array_agg(distinct  age_student) as age) as students,
        struct(array_agg(distinct name_teacher) as name, array_agg(distinct  id_teacher) as id) as teachers
from data
group by 1

Although, I am not sure that this will work correctly on a real dataset if you are trying to have a list of student with their age and a list of teachers with their IDs. For example adding select '5a' as room_id, 'george' as name_student, 20 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher, in the data table show the issue, the tuple george, 20 is lost.

3 Comments

That's exactly the problem. george, 20 is broken up and lost
From the question was not clear that you wanted to do that. I guessed because it seems the most reasonable thing. @rtenha answer is perfect.
However you need a student_id because "name, age" is not very safe as unique id. Also age change from a year to an other. What would happen next year? You will have the same student with a different "key". If you need to run analytics across years this will make the result confusing. Especially, if age is update on the actual birthday day. In this case, even query across months or weeks can return confusing results.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.