How to check if record exists in database - fastest method

Question

I have a table where I store unique text strings and then I check if that string exists in the database by doing select

String checkIfAlreadyScanned = "SELECT id FROM \"STRINGS_DB\"  where STR ='" + mystring + "'";

then I check if value exists. My database has around 5mil records; can I improve my method?

Maybe there is a way of creating a new attribute (hashedSTR) for example and convert string into some unique numberical value and then getting these numbers, instead of strings? Will that work faster? (will that work at all?)

other than putting an index on the STR field, there's not much you can do. — Marc B
– Marc B, Commented Jul 12, 2012 at 19:08
what's up with the query string out of curiosity ..why is there a ]"STRINGS_DB\" — MethodMan
– MethodMan, Commented Jul 12, 2012 at 19:09
you also have a possible sql injection going on there, just saying. — Thousand
– Thousand, Commented Jul 12, 2012 at 19:11
If the strings table is not updated, then packing the pages (fill factor = 100%) and making the string the primary key (with the associated unique index) is the fastest method. In some databases, the exists () clauses mentioned below might be marginally faster in this case. — Gordon Linoff
– Gordon Linoff, Commented Jul 12, 2012 at 19:19
It is useless to check; the next time you try to insert (or update or delete) it might have changed. (even) when inside a tranaction, you could add a WHERE (NOT) EXISTS(...) term to the query. — wildplasser
– wildplasser, Commented Jul 12, 2012 at 22:55

Larry · Accepted Answer · 2012-07-12 19:43:51Z

4

To ensure the fastest processing, make sure:

The field you are searching on is indexed (you told about an "unique" string, so I suppose it is already the case. For this reason, "limit 1" is not necessary. Otherwise, it should be added)
You are using the ExecuteScalar() method of your Command object

edited Jul 12, 2012 at 19:43

answered Jul 12, 2012 at 19:15

Larry

18.1k9 gold badges83 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

wildplasser · Accepted Answer · 2012-07-13 16:52:19Z

2

Testing makes no sense, just include the "test" in the where clause:

INSERT INTO silly_table(the_text)
 'literal_text'
WHERE NOT EXISTS (
    SELECT *
    FROM silly_table
    WHERE the_text = 'literal_text'
    );

Now, you'll make the test only when it is needed: at the end of the statement the row will exist. There is no such thing as try.

For those that don't understand testing makes no sense: testing would make sense if the situation after the test would not be allowed to change after the test. That would need a test&lock scenario. Or, even worse: a test inside a transaction.

UPDATE: version that works (basically the same):

DROP TABLE exitsnot CASCADE;
CREATE TABLE exitsnot
        ( id SERIAL NOT NULL PRIMARY KEY
        , val INTEGER -- REFERENCES something
        , str varchar -- REFERENCES something
        );

INSERT INTO exitsnot (val)
SELECT 42
WHERE NOT EXISTS (
        SELECT * FROM exitsnot
        WHERE val = 42
        );
INSERT INTO exitsnot (str)
SELECT 'silly text'
WHERE NOT EXISTS (
        SELECT * FROM exitsnot
        WHERE str = 'silly text'
        );
SELECT version();

Output:

DROP TABLE
NOTICE:  CREATE TABLE will create implicit sequence "exitsnot_id_seq" for serial column "exitsnot.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "exitsnot_pkey" for table "exitsnot"
CREATE TABLE
INSERT 0 1
INSERT 0 1
                                           version                                            
----------------------------------------------------------------------------------------------
 PostgreSQL 9.1.2 on i686-pc-linux-gnu, compiled by gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 32-bit
(1 row)

edited Jul 13, 2012 at 16:52

answered Jul 12, 2012 at 23:07

wildplasser

44.5k9 gold badges72 silver badges116 bronze badges

6 Comments

wildplasser Over a year ago

I just added some text, since I appear to be a bit cryptic sometimes. The whole topic is a DBA vs application programmer kind of thing. (and you probably know the outcome of that ;-) BTW: this is a /2 speed improvement, as the test will be as costly as the I/U/D.

Andrew Over a year ago

are you sure that this is correct syntax? my postgre shows me an error at where

wildplasser Over a year ago

Sorry, I should remove the values(xxx) construct and just use the string literal. Even simpler ...

Andrew Over a year ago

@wildplasser, it stil gives me an incorrect syntax. Are you sure that this should work with postgre and this is not the stored procedure syntax?

Andrew Over a year ago

INSERT INTO table (text) (SELECT ('12345') as text WHERE NOT EXISTS (SELECT text FROM table WHERE text = '12345'))

|

Rab Khan · Accepted Answer · 2012-07-12 19:09:23Z

1

String checkIfAlreadyScanned = "SELECT 1 FROM \"STRINGS_DB\"  where STR ='" + mystring + "'";

If your result set contains a row then you have a record

answered Jul 12, 2012 at 19:09

Rab Khan

35.7k4 gold badges52 silver badges66 bronze badges

1 Comment

Clodoaldo Neto Over a year ago

This will return all the rows matching the condition which is not good to performance

Jonathan Hall · Accepted Answer · 2012-07-12 19:14:23Z

1

How long are these text strings? If they are very long, you might get a performance improvement by storing a hash of the strings (along with the original strings).

CREATE TABLE strings_db (
    id       PRIMARY KEY INT,
    text     TEXT,
    hash     TEXT
);

Your hash column could store MD5 sums, CRC32s, or any other hash algorithm you choose. And it should be indexed.

Then modify your query to something like:

SELECT id FROM strings_db WHERE hash=calculate_hash(?)

If the average size of your text fields is sufficiently larger than the size of your hashes, doing the search on the shorter field will help with disk I/O. This also means additional CPU overhead when inserting and selecting, to calculate the hash, and additional disk space to store the hash. So all of these factors must be taken into consideration.

P.S. Always use prepared statements to avoid SQL injection attacks!

answered Jul 12, 2012 at 19:14

Jonathan Hall

80.5k19 gold badges162 silver badges206 bronze badges

1 Comment

Jonathan Hall Over a year ago

@Clodoaldo: PostgreSQL B-Tree indexes don't do this (and I don't know of any lossless indexing engine that does, but there may be some). If disk I/O was instantaneous, there would be no advantage to my method over psql's default B-Tree, since the same number of index records must be traversed in the B-tree, however with each node in the B-tree containing less data, it is faster to read it off the disk. I disk I/O were free, there would be no advantage to this method.

Clodoaldo Neto · Accepted Answer · 2012-07-12 19:43:50Z

1

Limit the result set to 1:

String checkIfAlreadyScanned = @"
    SELECT id 
    FROM ""STRINGS_DB""  
    where STR ='" + mystring + @"'
    limit 1";

This, an index on that column, and the @Laurent suggestion for ExecuteScalar() will yield the best result.

Also if mystring has any chance to have been touched by the user then parametize the query to avoid sql injection.

A cleaner version:

String checkIfAlreadyScanned = @"
    SELECT id 
    FROM ""STRINGS_DB""  
    where STR = '@mystring'
    limit 1
    ".replace("@mystring", mystring);

edited Jul 12, 2012 at 19:43

answered Jul 12, 2012 at 19:12

Clodoaldo Neto

127k30 gold badges251 silver badges274 bronze badges

1 Comment

Erwin Brandstetter Over a year ago

First sentence in the question: "unique text strings". No multiples.

Erwin Brandstetter · Accepted Answer · 2012-07-12 22:24:59Z

Actually, there is just such a thing like you ask for. But it has some limitations. PostgreSQL supports a hash index type:

CREATE INDEX strings_hash_idx ON "STRINGS_DB" USING hash (str);

Works for simple equality searches with =, just like you have it. I quote the manual on the limitations:

Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with REINDEX after a database crash. They are also not replicated over streaming or file-based replication. For these reasons, hash index use is presently discouraged.

A quick test on a real life table, 433k rows, 59 MB total:

SELECT * FROM tbl WHERE email = '[email protected]'

-- No index, sequnence scan: Total runtime: 188 ms  
-- B-tree index (default):   Total runtime:   0.046 ms  
-- Hash index:               Total runtime:   0.032 ms

That's not huge, but something. The difference will be more substantial with longer strings than the email address in my test. Index creation was a matter of 1 or 2 sec. with either index.

D'Arcy Rittich · Accepted Answer · 2012-07-12 19:13:50Z

0

Assuming you don't actually need the id column, I think this gives the compiler the most chance to optimize:

select 1
where exists(
    select 1 
    from STRINGS_DB
    where STR = 'MyString'
)

answered Jul 12, 2012 at 19:13

D'Arcy Rittich

172k41 gold badges298 silver badges287 bronze badges

1 Comment

Erwin Brandstetter Over a year ago

An EXISTS semi-join could help with duplicates, but will not with "unique text strings".

Steve · Accepted Answer · 2012-07-12 19:15:35Z

0

While all the answer here have their merit, I wish to mention another aspect.

Building your query in this way and passing a string will not help the database engine to optimize your query. Instead you should write a stored procedure, call it passing a single parameter and let the database engine build a query plan and reuse your command.

Of course the field should be indexed

answered Jul 12, 2012 at 19:15

Steve

217k22 gold badges242 silver badges296 bronze badges

Comments

Brandon · Accepted Answer · 2012-07-12 19:27:40Z

0

[Edit] Limit results returned to return the first record it comes across that meets the criteria: For SqlServer: select TOP 1 ...; For mysql/postgres: select ... LIMIT 1;

If there can be multiples, perhaps adding a "TOP 1" to your select statement could return faster.

String checkIfAlreadyScanned = "SELECT TOP 1 id FROM \"STRINGS_DB\"  where STR ='" + mystring + "'";

That way, it only has to find the first instance of the string.

But, if you don't have multiples, you'll not likely see much benefit with this approach.

Like others have said, putting an index on it may help.

edited Jul 12, 2012 at 19:27

answered Jul 12, 2012 at 19:12

Brandon

7551 gold badge11 silver badges30 bronze badges

2 Comments

Brandon Over a year ago

just add "limit 1" at the end if you're using postgres: forums.devshed.com/postgresql-help-21/…

Erwin Brandstetter Over a year ago

First sentence in the question: "unique text strings". No multiples.

Collectives™ on Stack Overflow

How to check if record exists in database - fastest method

9 Answers 9

Comments

6 Comments

1 Comment

1 Comment

1 Comment

Comments

1 Comment

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

Comments

6 Comments

1 Comment

1 Comment

1 Comment

Comments

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related