0

I have to make multiple sql queries of entire tables, and concatenate them into one big data table.

I have a dictionary where the key is a team name, and the value serves as an acronym where the acronym is the prefix to mySQL data tables

    engine = create_engine('mysql+mysqlconnector://%s:%s@%s/%s' % (mysql_user, mysql_password, mysql_host, mysql_dbname), echo=False, pool_recycle=1800)
    
    mysql_conn = engine.connect()
    
    team_dfs = []
    
    nba_dict = {'New York Knicks': 'nyk',
        'Boston Celtics': 'bos',
        'Golden State Warriors': 'gsw',
        'New York Knicks': 'nyk'}
    
        for name, abbr in nba_dict.items()
    
            query = f'''
            SELECT *
            from {abbr}_record
            '''
            df = pd.read_sql_query(query, mysql_conn)

            df['team_name'] = name
    
            team_dfs.append(df)
    
    team_dfs = pd.concat(team_dfs)

Is there a better way to refactor this code and make it more efficient?

3
  • Don't separate records per team if you are going to query them together. Avoid select *. Don't try to preemptively optimize your code around the data. Get the database with structured data and the performance and simple code will flow from that. Commented Dec 9, 2021 at 3:49
  • 100% agree. I'm not the DB admin, and the goal of the above task is to create a gold layer table where all this data is combined in one place. Commented Dec 9, 2021 at 3:58
  • Since this is a one-time task to fix a broken schema, does efficiency matter? Commented Dec 10, 2021 at 6:12

2 Answers 2

1

Your database layout, with a separate table for each team, is doomed to inefficiency whenever you need to retrieve data for more than one team at a time. You would be much much better off putting all that data in one table, giving the table a column mentioning the team associated with each row.

Why inefficient? More tables: more work. And, more queries: more work.

I suggest you push back, hard, on the designer of this database table structure. Its design is, bluntly, wrong.

If you must live with this structure, I suggest you create the following view. It will fake the single-table approach and give you your "gold layer". You get away with this because pro sports franchises don't come and go that often. You do this just once in your database.

CREATE OR REPLACE VIEW teams_record AS
SELECT 'nyk' team, * FROM nyk_record
UNION ALL
SELECT 'bos' team, * FROM bos_record
UNION ALL
SELECT 'gsw' team, * FROM gsw_record
UNION ALL .... the other teams

Then you can do SELECT * FROM teams_record ORDER BY team to get all your data.

Sign up to request clarification or add additional context in comments.

Comments

0

If your nba_dict is fixed, you can use UNION to manually combine the result table of SQL.

abbr = list(nba_dict.values())

query = f'''
SELECT *
from {abbr[0]}_record UNION
SELECT *
from {abbr[1]}_record UNION
SELECT *
from {abbr[2]}_record UNION
SELECT *
from {abbr[3]}_record UNION
'''
df = pd.read_sql_query(query, mysql_conn)
df['team_name'] = list(nba_dict.keys())

4 Comments

Is this a more efficient method, performance wise to go about this? Since in this case, you're making one big query, rather than 7 separate queries.
What are your definitions of efficiency and performance???
My definition is speed
UNION deduplicates. UNION ALL doesn't. Deduplication is probably not the right way to go here.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.