Here is a short example of a messy CSV file that is loaded into Bigquery each day:
select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Main League' as league union all
select 'BasketballBaseball' as sport, 'Second LeagueSecond League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'BasketballFootball' as sport, 'Third LeagueThird League' as league union all
select 'Basketball' as sport, 'Third League' as league
The issue is that when a player is involved in multiple sports, the sports and leagues are concatenated into the same column. Assuming we know that Basketball would always be the first sport, we can check the sport column for extra sports in a few ways:
right(10) = 'basketball'length(sport) = 10
What is tougher is to clean up the league column. The lack of space between the duplicate league names is tricky to deal with. Our desired output is:
select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'Basketball' as sport, 'Third League' as league
