0

I have one database that's not normalized:

disciplinabd.movies:

CREATE TABLE dbo.movies
    (
    movieid      VARCHAR (20) NULL,
    title        VARCHAR (400) NULL,
    mvyear       VARCHAR (100) NULL,
    actorid      VARCHAR (20) NULL,
    actorname    VARCHAR (250) NULL,
    sex          CHAR (1) NULL,
    as_character VARCHAR (1500) NULL,
    languages    VARCHAR (1500) NULL,
    genres       VARCHAR (100) NULL
    )

And i have my database: labbd11 , where i'm gonna normalize those data from disciplinabd. So i'm trying to execute this query:

INTO labbd11..movie_actor(idMovie, idActor, idCharacter) 
SELECT CASE 
         WHEN IsNumeric(movies.movieid+ '.0e0') <> 1  THEN NULL 
         ELSE CAST (movies.movieid AS INT) 
       END, 
       CASE WHEN IsNumeric(movies.actorid+ '.0e0') <> 1  THEN NULL 
            ELSE CAST (movies.actorid AS INT) 
       END, 
      (SELECT id FROM actor_character WHERE character = movies.as_character) 
FROM disciplinabd..movies

It executes normally, but is a huge number of data where I have to do this , like 14 million of rows in disciplinabd.movies.

My questions are:

  1. There's how to improve my insert ?
  2. Can i do a insert something like insert (1, 1000) ... after finished i just change the values like insert( 1001, 2000) .. and go on. What i'm saying is , if there's any chance insert in my database little by little ? This way i can avoid the rollback operation if the connection broke. Yesterday this insert query runs for 16 hours then the connection broke and i lost all the work.

UPDATE

CREATE TABLE movie(
    id INT PRIMARY KEY,
    title VARCHAR(400) NOT NULL,
    year INT
)

CREATE TABLE actor (
    id INT PRIMARY KEY,
    name VARCHAR(250) NOT NULL, 
    sex CHAR(1) NOT NULL
)

CREATE TABLE actor_character(
    id INT PRIMARY KEY IDENTITY,
    character VARCHAR(1000)
)

CREATE TABLE movie_actor(
    idMovie INT,
    idActor INT,
    idCharacter INT,
    CONSTRAINT fk_movie_actor_1 FOREIGN KEY (idMovie) REFERENCES movie(id) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT fk_movie_actor_2 FOREIGN KEY (idActor) REFERENCES actor(id) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT fk_movie_actor_3 FOREIGN KEY (idCharacter) REFERENCES actor_character(id) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT pk_movie_actor PRIMARY KEY (idMovie,idActor, idCharacter)
)
4
  • What is the definition of the labbd11..movie_actor table? I would have thought inserting 14 million rows of datatype int,int,int should take much less than an hour even on my laptop. Do you have an index on actor_character.character? Commented Apr 14, 2011 at 11:01
  • I have a feeling poor hardware and a combination of doing that join to possibly a very large actor_characture table could make what should take a short amount of time take forever. He would be performing that query 14 million times. Valter do you have a query plan? Commented Apr 14, 2011 at 11:20
  • @JStead, i'm using SQLDBX Personal as client, i'm not sure if it generate the query plan JStead . Commented Apr 14, 2011 at 11:33
  • @Martin, i update my post there it will show you that i'm trying to store only the id's of the movie, actor and character of this actor in that movie. Commented Apr 14, 2011 at 11:37

2 Answers 2

1

You don't saw what RDBMS you're using, which may help us answer your question more accurately, but to answer your second question, you can most likely limit your SELECT query to affect the amount of data inserted. For example,

INSERT INTO labbd11..movie_actor(idMovie, idActor, idCharacter) 
SELECT CASE 
         WHEN IsNumeric(movies.movieid+ '.0e0') <> 1  THEN NULL 
         ELSE CAST (movies.movieid AS INT) 
       END, 
       CASE WHEN IsNumeric(movies.actorid+ '.0e0') <> 1  THEN NULL 
            ELSE CAST (movies.actorid AS INT) 
       END, 
      (SELECT id FROM actor_character WHERE character = movies.as_character) 
FROM disciplinabd..movies
WHERE movieid >= 1000 and movieid < 2000

If you don't have a continuous ID range, you could possibly generate one, but the method will depend on the particular database you're using.

As for your initial question on how to improve performance, I would start by moving the subselect out to a JOIN and ensure there's a proper index on in actor_character. For example:

INTO labbd11..movie_actor(idMovie, idActor, idCharacter) 
SELECT CASE 
         WHEN IsNumeric(movies.movieid+ '.0e0') <> 1  THEN NULL 
         ELSE CAST (movies.movieid AS INT) 
       END, 
       CASE WHEN IsNumeric(movies.actorid+ '.0e0') <> 1  THEN NULL 
            ELSE CAST (movies.actorid AS INT) 
       END, 
      actor_character.id 
FROM disciplinabd..movies
LEFT JOIN disciplinabd..actor_characture ON movies.as_character = actor_characture.character
WHERE movieid >= 1000 and movieid < 2000

Again, if you can explicitly state which database you're using, we can provide more a more tailored answer. If I were writing something similar, I wouldn't expect 14 million rows to take more than a few minutes to execute on server-class hardware.

Sign up to request clarification or add additional context in comments.

2 Comments

Moving the sub select into a JOIN might change the semantics. It would need to be an OUTER JOIN to preserve rows in disciplinabd..movies that don't join to any in actor_characture. In general it might add additional rows as well that need to be removed with distinct but I assume the cardinality is at most 1 matching row or the OP's original query would fail. The OP is on SQL Server from dbo and isnumeric.
Moving to a join is exactly the right thing to do, you change from something that runs row-by-row (which is why correlated subqueries are bad) to a set-based solution.
0

16 hours seems like an awful long amount of time for inserting only 14 million rows. I don't know what your hardware is like so I will just answer the question at hand. With 14 million rows it is going to be much slower if you open up a connection for every 1000 so I would suggest a more variable number.

I also suggest adding an index to movieid if you can.

create nonclustered index IX_movies on movies(movieid)

You can use a while loop to accomplish what you are looking for.

Declare @loopMax int,@bottomRange int,@topRange int,@rangeSize int
select @loopMax = MAX(movies.movieid) from disciplinabd..movies
set @rangeSize = @loopMax/20
set @bottomRange = 0
set @topRange = @rangeSize
while @topRange < @loopMax
begin
    INSERT INTO labbd11..movie_actor(idMovie, idActor, idCharacter) 
    SELECT CASE 
        WHEN IsNumeric(movies.movieid+ '.0e0') <> 1  THEN NULL 
        ELSE CAST (movies.movieid AS INT) 
   END, 
   CASE WHEN IsNumeric(movies.actorid+ '.0e0') <> 1  THEN NULL 
        ELSE CAST (movies.actorid AS INT) 
   END, 
   actor_character.id 
   FROM disciplinabd..movies
   LEFT JOIN actor_character ON movies.as_character = actor_character.character
   WHERE movieid >= @bottomRange and movieid < @topRange
   set @bottomRange = @topRange
   set @topRange = @topRange + @rangeSize
end    

2 Comments

i try to run you SP, but there's something wrong with it. I update my post, show my table, could you update your store procedure please ?
Give it a shot now I had a different table name for actor_character

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.