I have this current setup:
product
product_id | product_name | category_id
category
category_id | category_name
vendor
vendor_id | vendor_name | vendor_status
vendor_price
vendor_id | product_id | vendor_price
As I understand it, according to the "rules" of normalization there should be 2 more tables declaring the relationship like this:
rel_product_vendor_price
product_id | vendor_price_id
rel_vendor_price_vendor
vendor_price_id | vendor_id
Then the above table called vendor_price would have product_id removed and added a vendor_price_id.
I fail to see the point in creating yet two more tables to keep things together as it will complicate queries. Especially the INSERTS are complicated and must be performed in transactions.
Currently the tables holds more than 300.000 products where each has several different vendors with different prices to each making it count as more than 1.5 million documents in Sphinx.
Am I wrong in my design, or would there be any advantage in changing it to a more normalized design?
UPDATE
I have a table more to hold all the product categories. I have updated the schema above, forgot that in the initial post.
Generally I split the queries based on category and I query each category for all the belonging products. When a user clicks a product I query all the prices for that particular product and display the prices in descending order.
Because a vendor can be suspended (vendor.vendor_status) all queries must be performed with several joins leading back to the vendor table.
In the inserts I delete everything in product from a particular vendor, all vendor prices from the same vendor gets deleted as well due to foreign key constraint. Then I insert a new into product and vendor_price.
Hope this makes sense.
UPDATE 2
Having run a lot of query testing this night, I have discovered that keeping the vendor_status in the vendor table REALLY slows things down a LOT.
Because the database has to join selects between vendor_price and vendor each time it is selecting a price, which has a great importance in getting for example:
MIN(vendor_price) AS min_vendor_price, MAX(vendor_price) AS max_vendor_price)
Keeping a duplicate of vendor_status in each vendor_price row would mean a LOT of redundant data, but it really speeds things up in selects.
From
Query took 7.8040 sec
To
Query took 3.1640 sec
When data sets get this large I guess it's a matter of balancing between optimizing queries and using a LOT of cache features. Normalization really gets in the way when it comes to speed even on todays hardware.