SQL Server: How improve insert query?

Question

I have one database that's not normalized:

disciplinabd.movies:

CREATE TABLE dbo.movies
    (
    movieid      VARCHAR (20) NULL,
    title        VARCHAR (400) NULL,
    mvyear       VARCHAR (100) NULL,
    actorid      VARCHAR (20) NULL,
    actorname    VARCHAR (250) NULL,
    sex          CHAR (1) NULL,
    as_character VARCHAR (1500) NULL,
    languages    VARCHAR (1500) NULL,
    genres       VARCHAR (100) NULL
    )

And i have my database: labbd11 , where i'm gonna normalize those data from disciplinabd. So i'm trying to execute this query:

INTO labbd11..movie_actor(idMovie, idActor, idCharacter) 
SELECT CASE 
         WHEN IsNumeric(movies.movieid+ '.0e0') <> 1  THEN NULL 
         ELSE CAST (movies.movieid AS INT) 
       END, 
       CASE WHEN IsNumeric(movies.actorid+ '.0e0') <> 1  THEN NULL 
            ELSE CAST (movies.actorid AS INT) 
       END, 
      (SELECT id FROM actor_character WHERE character = movies.as_character) 
FROM disciplinabd..movies

It executes normally, but is a huge number of data where I have to do this , like 14 million of rows in disciplinabd.movies.

My questions are:

There's how to improve my insert ?
Can i do a insert something like insert (1, 1000) ... after finished i just change the values like insert( 1001, 2000) .. and go on. What i'm saying is , if there's any chance insert in my database little by little ? This way i can avoid the rollback operation if the connection broke. Yesterday this insert query runs for 16 hours then the connection broke and i lost all the work.

UPDATE

CREATE TABLE movie(
    id INT PRIMARY KEY,
    title VARCHAR(400) NOT NULL,
    year INT
)

CREATE TABLE actor (
    id INT PRIMARY KEY,
    name VARCHAR(250) NOT NULL, 
    sex CHAR(1) NOT NULL
)

CREATE TABLE actor_character(
    id INT PRIMARY KEY IDENTITY,
    character VARCHAR(1000)
)

CREATE TABLE movie_actor(
    idMovie INT,
    idActor INT,
    idCharacter INT,
    CONSTRAINT fk_movie_actor_1 FOREIGN KEY (idMovie) REFERENCES movie(id) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT fk_movie_actor_2 FOREIGN KEY (idActor) REFERENCES actor(id) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT fk_movie_actor_3 FOREIGN KEY (idCharacter) REFERENCES actor_character(id) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT pk_movie_actor PRIMARY KEY (idMovie,idActor, idCharacter)
)

What is the definition of the labbd11..movie_actor table? I would have thought inserting 14 million rows of datatype int,int,int should take much less than an hour even on my laptop. Do you have an index on actor_character.character? — Martin Smith
– Martin Smith, Commented Apr 14, 2011 at 11:01
I have a feeling poor hardware and a combination of doing that join to possibly a very large actor_characture table could make what should take a short amount of time take forever. He would be performing that query 14 million times. Valter do you have a query plan? — JStead
– JStead, Commented Apr 14, 2011 at 11:20
@JStead, i'm using SQLDBX Personal as client, i'm not sure if it generate the query plan JStead . — Valter Silva
– Valter Silva, Commented Apr 14, 2011 at 11:33
@Martin, i update my post there it will show you that i'm trying to store only the id's of the movie, actor and character of this actor in that movie. — Valter Silva
– Valter Silva, Commented Apr 14, 2011 at 11:37

Taylor Gerring · Accepted Answer · 2011-04-14 13:31:48Z

1

You don't saw what RDBMS you're using, which may help us answer your question more accurately, but to answer your second question, you can most likely limit your SELECT query to affect the amount of data inserted. For example,

INSERT INTO labbd11..movie_actor(idMovie, idActor, idCharacter) 
SELECT CASE 
         WHEN IsNumeric(movies.movieid+ '.0e0') <> 1  THEN NULL 
         ELSE CAST (movies.movieid AS INT) 
       END, 
       CASE WHEN IsNumeric(movies.actorid+ '.0e0') <> 1  THEN NULL 
            ELSE CAST (movies.actorid AS INT) 
       END, 
      (SELECT id FROM actor_character WHERE character = movies.as_character) 
FROM disciplinabd..movies
WHERE movieid >= 1000 and movieid < 2000

If you don't have a continuous ID range, you could possibly generate one, but the method will depend on the particular database you're using.

As for your initial question on how to improve performance, I would start by moving the subselect out to a JOIN and ensure there's a proper index on in actor_character. For example:

INTO labbd11..movie_actor(idMovie, idActor, idCharacter) 
SELECT CASE 
         WHEN IsNumeric(movies.movieid+ '.0e0') <> 1  THEN NULL 
         ELSE CAST (movies.movieid AS INT) 
       END, 
       CASE WHEN IsNumeric(movies.actorid+ '.0e0') <> 1  THEN NULL 
            ELSE CAST (movies.actorid AS INT) 
       END, 
      actor_character.id 
FROM disciplinabd..movies
LEFT JOIN disciplinabd..actor_characture ON movies.as_character = actor_characture.character
WHERE movieid >= 1000 and movieid < 2000

Again, if you can explicitly state which database you're using, we can provide more a more tailored answer. If I were writing something similar, I wouldn't expect 14 million rows to take more than a few minutes to execute on server-class hardware.

edited Apr 14, 2011 at 13:31

answered Apr 14, 2011 at 10:57

Taylor Gerring

1,8351 gold badge12 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Martin Smith Over a year ago

Moving the sub select into a JOIN might change the semantics. It would need to be an OUTER JOIN to preserve rows in disciplinabd..movies that don't join to any in actor_characture. In general it might add additional rows as well that need to be removed with distinct but I assume the cardinality is at most 1 matching row or the OP's original query would fail. The OP is on SQL Server from dbo and isnumeric.

HLGEM Over a year ago

Moving to a join is exactly the right thing to do, you change from something that runs row-by-row (which is why correlated subqueries are bad) to a set-based solution.

JStead · Accepted Answer · 2011-04-14 15:56:00Z

0

16 hours seems like an awful long amount of time for inserting only 14 million rows. I don't know what your hardware is like so I will just answer the question at hand. With 14 million rows it is going to be much slower if you open up a connection for every 1000 so I would suggest a more variable number.

I also suggest adding an index to movieid if you can.

create nonclustered index IX_movies on movies(movieid)

You can use a while loop to accomplish what you are looking for.

Declare @loopMax int,@bottomRange int,@topRange int,@rangeSize int
select @loopMax = MAX(movies.movieid) from disciplinabd..movies
set @rangeSize = @loopMax/20
set @bottomRange = 0
set @topRange = @rangeSize
while @topRange < @loopMax
begin
    INSERT INTO labbd11..movie_actor(idMovie, idActor, idCharacter) 
    SELECT CASE 
        WHEN IsNumeric(movies.movieid+ '.0e0') <> 1  THEN NULL 
        ELSE CAST (movies.movieid AS INT) 
   END, 
   CASE WHEN IsNumeric(movies.actorid+ '.0e0') <> 1  THEN NULL 
        ELSE CAST (movies.actorid AS INT) 
   END, 
   actor_character.id 
   FROM disciplinabd..movies
   LEFT JOIN actor_character ON movies.as_character = actor_character.character
   WHERE movieid >= @bottomRange and movieid < @topRange
   set @bottomRange = @topRange
   set @topRange = @topRange + @rangeSize
end

edited Apr 14, 2011 at 15:56

answered Apr 14, 2011 at 11:05

JStead

1,73011 silver badges12 bronze badges

2 Comments

Valter Silva Over a year ago

i try to run you SP, but there's something wrong with it. I update my post, show my table, could you update your store procedure please ?

JStead Over a year ago

Give it a shot now I had a different table name for actor_character

Collectives™ on Stack Overflow

SQL Server: How improve insert query?

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related