I need to design a database schema for the following problem. Consider this simplified grammatical 'analysis' of some example phrase:
- 'Extraordinarily incompetent taxi-driver'
- Extra-₁ ordinari₂ -ly₃
- in-₁ competent₂
- taxi-₁ driv₂ -er₃
In this model a sentence consists of an array of words, and a word is made up of an array of word parts/morphemes. Relational databases are – as I am learning, notoriously – not very happy about arrays of arrays.
I see two solutions and am unsure how to make the right decision. The first, 'dirty' solution: a single intermediary table which links sentences with morphemes, and stores the array indices. Lots of identical entries in columns.
CREATE TABLE word ( -- pseudo-SQL
sentence_id FOREIGN KEY,
sentence_order INTEGER,
morpheme_id FOREIGN KEY,
morpheme_order INTEGER );
The second, 'clean' solution: Three (!) intermediary tables, probably slow and uncomfortable to use? Note how the word table only serves up IDs for the two foreign key tables to use.
CREATE TABLE sentence_word (
sentence_id FOREIGN KEY,
word_id FOREIGN KEY,
order INTEGER );
CREATE TABLE word ( id );
CREATE TABLE morpheme_word (
morpheme_id INTEGER FOREIGN KEY,
word_id INTEGER FOREIGN KEY
order INTEGER );
I would normally prefer a clean solution but here the clean solution has a kludgy feel to it. I'm trying to do this with a web framework ORM, by the way (Django).