1

Here is a short example of a messy CSV file that is loaded into Bigquery each day:

select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Main League' as league union all
select 'BasketballBaseball' as sport, 'Second LeagueSecond League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'BasketballFootball' as sport, 'Third LeagueThird League' as league union all
select 'Basketball' as sport, 'Third League' as league

The issue is that when a player is involved in multiple sports, the sports and leagues are concatenated into the same column. Assuming we know that Basketball would always be the first sport, we can check the sport column for extra sports in a few ways:

  • right(10) = 'basketball'
  • length(sport) = 10

What is tougher is to clean up the league column. The lack of space between the duplicate league names is tricky to deal with. Our desired output is:

select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'Basketball' as sport, 'Third League' as league
2
  • Can someone be involved in 3 sports? Commented Mar 3, 2022 at 10:22
  • Yes, and the 3rd sport would be shown and the league name would be duplicated 3 times Commented Mar 3, 2022 at 10:36

1 Answer 1

2

Try the following:

with sample_data as (
    select 'Basketball' as sport, 'Main League' as league union all
    select 'Basketball' as sport, 'Main League' as league union all
    select 'BasketballBaseball' as sport, 'Second LeagueSecond League' as league union all
    select 'Basketball' as sport, 'Second League' as league union all
    select 'Basketball' as sport, 'Third League' as league union all
    select 'BasketballFootball' as sport, 'Third LeagueThird League' as league union all
    select 'BasketballFootballSoccer' as sport, 'Third LeagueThird LeagueFourth League' as league union all
    select 'Basketball' as sport, 'Third League' as league
)

select 
    *
from sample_data
, UNNEST(regexp_extract_all(sport, r'([A-Z][a-z]+)')) as split_sport with offset as ss_offset
left join UNNEST(regexp_extract_all(league, r'([A-Z][a-z]+ League)')) as split_league with offset as sl_offset
 on ss_offset=sl_offset

Using regex functions allows you to get the elements you want into an array. Then you can join on the offsets to match the sport with the league.

This produces the following with the sample data provided, as a note i added a 3 sport scenario.

enter image description here

At this point you can add additional filter criteria to include just the 0 offset, a specific sport, or league.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.