SQL query optimization (nested subqueries)

Question

I need to write a query:

Find the difference between the average rating of movies released before 1980 and the average rating of movies released after 1980. (Make sure to calculate the average rating for each movie, then the average of those averages for movies before 1980 and movies after. Don't just calculate the overall average rating before and after 1980.)

The schema is as follows:

Movie ( mID, title, year, director )
English: There is a movie with 
ID number mID, a title, a release year, and a director.

Reviewer ( rID, name )
English: The reviewer with ID number rID has a certain name.

Rating ( rID, mID, stars, ratingDate )
English: The reviewer rID gave the movie mID a 
number of stars rating (1-5) on a certain ratingDate.

The following is the query I came up with. The result is correct but is definitely not a very good query:

    select t1.p1-t2.p2 from
    (select avg(average) as p1  from 
    (select g.mid,g.average, year from
    (select mid, avg(stars) as average from rating
    group by mid) g, movie
    where g.mid=movie.mid) j 
    where year >= 1980) t1,

    (select avg(average) as p2  from 
    (select g.mid,g.average, year from
    (select mid, avg(stars) as average from rating
    group by mid) g, movie
    where g.mid=movie.mid) j 
    where year < 1980) t2;

The following is how I arrived at this query. First of all, I wrote this subquery that retrieves movie id, average rating for that movie, the year of the movie:

    select g.mid,g.average, year from
    (select mid, avg(stars) as average from rating
    group by mid) g, movie
    where g.mid=movie.mid

Now I need to use the same subquery to create two tables where the first table contains average of rating for movies after 1980. The second contains the average of rating for movies before 1980. In the top level query, I subtract these 2 values.

The problem is I am duplicating the same code in two places. Can you please help optimize the code from a code duplication standpoint, as well as performance?

What do you want to optimize, efficiency or elegance? If you want efficiency, the DBMS you are using is need. Not all SQL are alike. — ypercubeᵀᴹ
– ypercubeᵀᴹ, Commented Dec 2, 2012 at 20:15

Laurence · Accepted Answer · 2012-12-02 20:45:55Z

2

You can do it without the duplication like this:

Select
  Avg(Case When m.Year >= 1980 Then a.stars Else Null End) -
  Avg(Case When m.Year < 1980 Then a.stars Else Null End)
From (
    Select
      mid,
      avg(stars) stars
    From 
      rating
    Group By
      mid
  ) a 
    inner join
  movie m
    on m.mid = a.mid

You might want to move the inner query to a view or a common table expression (CTE). Depending on which dbms you are using, you might need to cast stars to a decimal type, or you might get everything in integer arithmetic.

An index on (mid, stars) for the rating table will help on the performance side.

Example Fiddle

edited Dec 2, 2012 at 20:45

answered Dec 2, 2012 at 20:28

Laurence

11k1 gold badge28 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ian Yates · Accepted Answer · 2012-12-02 20:30:34Z

Taking a punt and assuming SQL Server, there are a couple of things. Indices are pretty important, as is the way the query's written.

Some CREATE TABLE statements

create table Movie ( mID int primary key clustered, title varchar(100), year int, director varchar(100) ) 

create table Reviewer ( rID int primary key clustered, name varchar(100) ) 

create table Rating ( rID int, mID int, stars int, ratingDate datetime , primary key clustered (rID, mID) )

I've clustered on the mID in the Movie table and have clustered, poorly for your query, on the rID and mID fields in the rating table.

Indexing: SQL needs to get all ratings for a movie, so a better clustered key for the Rating table would be create table Rating ( rID int, mID int, stars int, ratingDate datetime , primary key clustered (mID, rID) )

If you can't change such things, then at least create a covering index that indexes by mID and includes the stars column.

Next, your query... There are a few ways to write it - best to look at the query plan output. Here's one way of writing the query

with 
    MovieAverage as (
        select mID, movieAvgStars = avg(stars)
        from Rating
        group by mID
        ),

    Pre1980 as (
        select MovieAvgStars = avg(  movieAvgStars )
        from MovieAverage
            inner join Movie
                on MovieAverage.mID = Movie.mID
        where Movie.year < 1980
        ),

    IncAndPost1980 as (
        select MovieAvgStars = avg(  movieAvgStars )
        from MovieAverage
            inner join Movie 
                on MovieAverage.mID = Movie.mID
        where Movie.year >= 1980
        )

select IncAndPost1980.MovieAvgStars - Pre1980.MovieAvgStars
from IncAndPost1980 cross JOIN Pre1980

There are probably other ways of tweaking, but without sample data, etc it's hard to judge properly.

ypercubeᵀᴹ · Accepted Answer · 2012-12-02 20:34:01Z

0

Without any efficency consideration, nor any particular DBMS in mind (very few have both NATURAL joins and CTEs anyway):

; WITH g AS
    ( SELECT mid, AVG(stars) AS average 
      FROM rating
      GROUP BY mid
    ) 
  , j AS
    ( SELECT mid, average, year 
      FROM g NATURAL JOIN movie
    )
  , t1 AS
    ( SELECT AVG(average) AS p1 
      FROM j
      WHERE year >= 1980
    )
  , t2 AS
    ( SELECT AVG(average) AS p2 
      FROM j
      WHERE year < 1980
    )
  SELECT t1.p1 - t2.p2 AS result
  FROM t1 CROSS JOIN t2 
;

edited Dec 2, 2012 at 20:34

answered Dec 2, 2012 at 20:28

ypercubeᵀᴹ

116k19 gold badges181 silver badges249 bronze badges

Collectives™ on Stack Overflow

SQL query optimization (nested subqueries)

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related