0

I wrote the following UPDATE command, but there's redundancy in the sub-selects. I'm not an expert in SQL and would appreciate help in making this query more efficient. Thanks ahead of time.

update trips
  set origin = 
  (select stop_name 
    from stops 
    inner join stop_times
    on stops.stop_id = stop_times.stop_id
    where stop_times.trip_id = trips.trip_id
    order by stop_sequence asc
    limit 1) 
  ,
  destination = 
  (select stop_name 
    from stops 
    inner join stop_times
    on stops.stop_id = stop_times.stop_id
    where stop_times.trip_id = trips.trip_id
    order by stop_sequence desc
    limit 1)
  ,
  starts = 
  (select arrival_time
    from stop_times
    where stop_times.trip_id = trips.trip_id
    order by stop_sequence asc
    limit 1) 
  ,
  ends = 
  (select arrival_time
    from stop_times
    where stop_times.trip_id = trips.trip_id
    order by stop_sequence desc
    limit 1)
;

Below are the relevant table definitions. There are approximately 72K trips, 8K stops, and 2 million stop_times. Maybe an average of 20? stops per trip (just guessing).

create table stop_times (
  trip_id varchar(255),
  arrival_time time,
  stop_id varchar(255),
  stop_sequence int unsigned,
) type=MyISAM;

alter table stop_times add index stop_id (stop_id(5));
alter table stop_times add index trip_id (trip_id(5));

create table stops (
  stop_id varchar(255),
  stop_name varchar(255),
  stop_lat float,
  stop_lon float,
  primary key (stop_id)
) type=MyISAM;

create table trips (
  route_id varchar(255),
  trip_id varchar(255), /* primary key is here */
  /* denormalized fields */
  origin varchar(255),
  destination varchar(255),
  starts time,
  ends time,
  primary key(trip_id)
) type=MyISAM;
alter table trips add index route_id (route_id(5));
3
  • 1
    how have you measured that it is inefficient? Commented Jan 14, 2011 at 23:49
  • 1
    Please post your table definitions, as well as how many stops each trip would have. Commented Jan 14, 2011 at 23:53
  • I have no idea whether it is inefficient compared to a better solution, if there is one. I'm running this over a large dataset and it is taking many minutes. As far as SQL goes, does it look OK to you? Commented Jan 14, 2011 at 23:54

1 Answer 1

1

First add a index on stop_times to include the trip_id and the stop_sequence columns

ALTER TABLE stop_times ADD PRIMARY KEY(trip_id, stop_sequence)

Then, try running this update:

update trips t JOIN (
    SELECT trip_id, MIN(stop_sequence) minS, MAX(stop_sequence) maxS 
    FROM stop_times
    GROUP BY trip_id
) tg ON t.trip_id = tg.trip_id
JOIN stop_times stFirst ON tg.trip_id = stFirst.trip_id AND stFirst.stop_sequence = tg.minS
JOIN stop_times stLast ON tg.trip_id = stLast.trip_id AND stLast.stop_sequence = tg.maxS
JOIN stops stFirstStop ON stFirst.stop_id = stFirstStop.stop_id
JOIN stops stLastStop ON stLast.stop_id = stLastStop.stop_id
SET t.origin = stFirstStop.stop_name,
    t.destination = stLastStop.stop_name,
    t.starts = stFirst.arrival_time,
    t.ends = stLast.arrival_time

Note: changing trip_id to a INT will give you better performance

Also, the trips table should store the origin_id and destination_id, which can later be joined to the stops table to find the name, instead of storing the name in all the rows

Sign up to request clarification or add additional context in comments.

5 Comments

Is that JOIN an inner join or a left outer join? Sorry if this is a beginner's question. Could you also explain a little bit why this might be faster?
I can't change trip_id to INT because some of the ids might be real strings. Could you offer a short explanation why, aside from the additional index, using these joins is faster than the original 4 subselects? Thanks.
@dan The Sub-queries will execute for each row, JOINS will execute at the beginning and will be kept in memory or a temp table. There are multiple advantages using joins, less index lookups, less random IO, and the sub queries you were using (with the limit 1) have multiple table in them which are expensive.
I haven't tried it yet. I'm still executing my original update command and timing it. That command is already running over 15 minutes.
@scrummeister it worked great. It took 20 seconds, compared to over 2 hours for the original query. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.