In BigQuery, remove duplicate strings in string column

Question

Here is a short example of a messy CSV file that is loaded into Bigquery each day:

select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Main League' as league union all
select 'BasketballBaseball' as sport, 'Second LeagueSecond League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'BasketballFootball' as sport, 'Third LeagueThird League' as league union all
select 'Basketball' as sport, 'Third League' as league

The issue is that when a player is involved in multiple sports, the sports and leagues are concatenated into the same column. Assuming we know that Basketball would always be the first sport, we can check the sport column for extra sports in a few ways:

right(10) = 'basketball'
length(sport) = 10

What is tougher is to clean up the league column. The lack of space between the duplicate league names is tricky to deal with. Our desired output is:

select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'Basketball' as sport, 'Third League' as league

Yes, and the 3rd sport would be shown and the league name would be duplicated 3 times — Canovice
– Canovice, Commented Mar 3, 2022 at 10:36

Daniel Zagales · Accepted Answer · 2022-03-03 12:12:05Z

Try the following:

with sample_data as (
    select 'Basketball' as sport, 'Main League' as league union all
    select 'Basketball' as sport, 'Main League' as league union all
    select 'BasketballBaseball' as sport, 'Second LeagueSecond League' as league union all
    select 'Basketball' as sport, 'Second League' as league union all
    select 'Basketball' as sport, 'Third League' as league union all
    select 'BasketballFootball' as sport, 'Third LeagueThird League' as league union all
    select 'BasketballFootballSoccer' as sport, 'Third LeagueThird LeagueFourth League' as league union all
    select 'Basketball' as sport, 'Third League' as league
)

select 
    *
from sample_data
, UNNEST(regexp_extract_all(sport, r'([A-Z][a-z]+)')) as split_sport with offset as ss_offset
left join UNNEST(regexp_extract_all(league, r'([A-Z][a-z]+ League)')) as split_league with offset as sl_offset
 on ss_offset=sl_offset

Using regex functions allows you to get the elements you want into an array. Then you can join on the offsets to match the sport with the league.

This produces the following with the sample data provided, as a note i added a 3 sport scenario.

At this point you can add additional filter criteria to include just the 0 offset, a specific sport, or league.

Collectives™ on Stack Overflow

In BigQuery, remove duplicate strings in string column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related