4

I have a table where I store unique text strings and then I check if that string exists in the database by doing select

String checkIfAlreadyScanned = "SELECT id FROM \"STRINGS_DB\"  where STR ='" + mystring + "'";

then I check if value exists. My database has around 5mil records; can I improve my method?

Maybe there is a way of creating a new attribute (hashedSTR) for example and convert string into some unique numberical value and then getting these numbers, instead of strings? Will that work faster? (will that work at all?)

5
  • 5
    other than putting an index on the STR field, there's not much you can do. Commented Jul 12, 2012 at 19:08
  • what's up with the query string out of curiosity ..why is there a ]"STRINGS_DB\" Commented Jul 12, 2012 at 19:09
  • 3
    you also have a possible sql injection going on there, just saying. Commented Jul 12, 2012 at 19:11
  • If the strings table is not updated, then packing the pages (fill factor = 100%) and making the string the primary key (with the associated unique index) is the fastest method. In some databases, the exists () clauses mentioned below might be marginally faster in this case. Commented Jul 12, 2012 at 19:19
  • It is useless to check; the next time you try to insert (or update or delete) it might have changed. (even) when inside a tranaction, you could add a WHERE (NOT) EXISTS(...) term to the query. Commented Jul 12, 2012 at 22:55

9 Answers 9

4

To ensure the fastest processing, make sure:

  • The field you are searching on is indexed (you told about an "unique" string, so I suppose it is already the case. For this reason, "limit 1" is not necessary. Otherwise, it should be added)
  • You are using the ExecuteScalar() method of your Command object
Sign up to request clarification or add additional context in comments.

Comments

2

Testing makes no sense, just include the "test" in the where clause:

INSERT INTO silly_table(the_text)
 'literal_text'
WHERE NOT EXISTS (
    SELECT *
    FROM silly_table
    WHERE the_text = 'literal_text'
    );

Now, you'll make the test only when it is needed: at the end of the statement the row will exist. There is no such thing as try.

For those that don't understand testing makes no sense: testing would make sense if the situation after the test would not be allowed to change after the test. That would need a test&lock scenario. Or, even worse: a test inside a transaction.

UPDATE: version that works (basically the same):

DROP TABLE exitsnot CASCADE;
CREATE TABLE exitsnot
        ( id SERIAL NOT NULL PRIMARY KEY
        , val INTEGER -- REFERENCES something
        , str varchar -- REFERENCES something
        );

INSERT INTO exitsnot (val)
SELECT 42
WHERE NOT EXISTS (
        SELECT * FROM exitsnot
        WHERE val = 42
        );
INSERT INTO exitsnot (str)
SELECT 'silly text'
WHERE NOT EXISTS (
        SELECT * FROM exitsnot
        WHERE str = 'silly text'
        );
SELECT version();

Output:

DROP TABLE
NOTICE:  CREATE TABLE will create implicit sequence "exitsnot_id_seq" for serial column "exitsnot.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "exitsnot_pkey" for table "exitsnot"
CREATE TABLE
INSERT 0 1
INSERT 0 1
                                           version                                            
----------------------------------------------------------------------------------------------
 PostgreSQL 9.1.2 on i686-pc-linux-gnu, compiled by gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 32-bit
(1 row)

6 Comments

I just added some text, since I appear to be a bit cryptic sometimes. The whole topic is a DBA vs application programmer kind of thing. (and you probably know the outcome of that ;-) BTW: this is a /2 speed improvement, as the test will be as costly as the I/U/D.
are you sure that this is correct syntax? my postgre shows me an error at where
Sorry, I should remove the values(xxx) construct and just use the string literal. Even simpler ...
@wildplasser, it stil gives me an incorrect syntax. Are you sure that this should work with postgre and this is not the stored procedure syntax?
INSERT INTO table (text) (SELECT ('12345') as text WHERE NOT EXISTS (SELECT text FROM table WHERE text = '12345'))
|
1
String checkIfAlreadyScanned = "SELECT 1 FROM \"STRINGS_DB\"  where STR ='" + mystring + "'";

If your result set contains a row then you have a record

1 Comment

This will return all the rows matching the condition which is not good to performance
1

How long are these text strings? If they are very long, you might get a performance improvement by storing a hash of the strings (along with the original strings).

CREATE TABLE strings_db (
    id       PRIMARY KEY INT,
    text     TEXT,
    hash     TEXT
);

Your hash column could store MD5 sums, CRC32s, or any other hash algorithm you choose. And it should be indexed.

Then modify your query to something like:

SELECT id FROM strings_db WHERE hash=calculate_hash(?)

If the average size of your text fields is sufficiently larger than the size of your hashes, doing the search on the shorter field will help with disk I/O. This also means additional CPU overhead when inserting and selecting, to calculate the hash, and additional disk space to store the hash. So all of these factors must be taken into consideration.

P.S. Always use prepared statements to avoid SQL injection attacks!

1 Comment

@Clodoaldo: PostgreSQL B-Tree indexes don't do this (and I don't know of any lossless indexing engine that does, but there may be some). If disk I/O was instantaneous, there would be no advantage to my method over psql's default B-Tree, since the same number of index records must be traversed in the B-tree, however with each node in the B-tree containing less data, it is faster to read it off the disk. I disk I/O were free, there would be no advantage to this method.
1

Limit the result set to 1:

String checkIfAlreadyScanned = @"
    SELECT id 
    FROM ""STRINGS_DB""  
    where STR ='" + mystring + @"'
    limit 1";

This, an index on that column, and the @Laurent suggestion for ExecuteScalar() will yield the best result.

Also if mystring has any chance to have been touched by the user then parametize the query to avoid sql injection.

A cleaner version:

String checkIfAlreadyScanned = @"
    SELECT id 
    FROM ""STRINGS_DB""  
    where STR = '@mystring'
    limit 1
    ".replace("@mystring", mystring);

1 Comment

First sentence in the question: "unique text strings". No multiples.
1

Actually, there is just such a thing like you ask for. But it has some limitations. PostgreSQL supports a hash index type:

CREATE INDEX strings_hash_idx ON "STRINGS_DB" USING hash (str);

Works for simple equality searches with =, just like you have it. I quote the manual on the limitations:

Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with REINDEX after a database crash. They are also not replicated over streaming or file-based replication. For these reasons, hash index use is presently discouraged.


A quick test on a real life table, 433k rows, 59 MB total:

SELECT * FROM tbl WHERE email = '[email protected]'
-- No index, sequnence scan: Total runtime: 188 ms  
-- B-tree index (default):   Total runtime:   0.046 ms  
-- Hash index:               Total runtime:   0.032 ms  

That's not huge, but something. The difference will be more substantial with longer strings than the email address in my test. Index creation was a matter of 1 or 2 sec. with either index.

Comments

0

Assuming you don't actually need the id column, I think this gives the compiler the most chance to optimize:

select 1
where exists(
    select 1 
    from STRINGS_DB
    where STR = 'MyString'
)

1 Comment

An EXISTS semi-join could help with duplicates, but will not with "unique text strings".
0

While all the answer here have their merit, I wish to mention another aspect.

Building your query in this way and passing a string will not help the database engine to optimize your query. Instead you should write a stored procedure, call it passing a single parameter and let the database engine build a query plan and reuse your command.

Of course the field should be indexed

Comments

0

[Edit] Limit results returned to return the first record it comes across that meets the criteria: For SqlServer: select TOP 1 ...; For mysql/postgres: select ... LIMIT 1;

If there can be multiples, perhaps adding a "TOP 1" to your select statement could return faster.

String checkIfAlreadyScanned = "SELECT TOP 1 id FROM \"STRINGS_DB\"  where STR ='" + mystring + "'";

That way, it only has to find the first instance of the string.

But, if you don't have multiples, you'll not likely see much benefit with this approach.

Like others have said, putting an index on it may help.

2 Comments

just add "limit 1" at the end if you're using postgres: forums.devshed.com/postgresql-help-21/…
First sentence in the question: "unique text strings". No multiples.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.