How to represent an array of arrays in a relational database

Question

I need to design a database schema for the following problem. Consider this simplified grammatical 'analysis' of some example phrase:

'Extraordinarily incompetent taxi-driver'
1. Extra-₁ ordinari₂ -ly₃
2. in-₁ competent₂
3. taxi-₁ driv₂ -er₃

In this model a sentence consists of an array of words, and a word is made up of an array of word parts/morphemes. Relational databases are – as I am learning, notoriously – not very happy about arrays of arrays.

I see two solutions and am unsure how to make the right decision. The first, 'dirty' solution: a single intermediary table which links sentences with morphemes, and stores the array indices. Lots of identical entries in columns.

CREATE TABLE word (           -- pseudo-SQL
  sentence_id FOREIGN KEY,
  sentence_order INTEGER,
  morpheme_id FOREIGN KEY,
  morpheme_order INTEGER );

The second, 'clean' solution: Three (!) intermediary tables, probably slow and uncomfortable to use? Note how the word table only serves up IDs for the two foreign key tables to use.

CREATE TABLE sentence_word (
  sentence_id FOREIGN KEY,
  word_id FOREIGN KEY,
  order INTEGER );
CREATE TABLE word ( id );
CREATE TABLE morpheme_word (
  morpheme_id INTEGER FOREIGN KEY,
  word_id INTEGER FOREIGN KEY
  order INTEGER );

I would normally prefer a clean solution but here the clean solution has a kludgy feel to it. I'm trying to do this with a web framework ORM, by the way (Django).

To be honest, my 2cts is that relational database is a poor starting point to anything with NLP (Natural Language Processing). Otherwise your second design makes sense (somehow). — Joannes Vermorel
– Joannes Vermorel, Commented Jan 24, 2012 at 21:27
Some relational databases (e.g. Oracle) have the concept of nested table (orafaq.com/wiki/NESTED_TABLE), which can be used to implement array of arrays. — Raihan
– Raihan, Commented Jan 24, 2012 at 21:40
Yes, some databases support array types; MongoDB allows similar stuff. But given the framework I am working with I'm looking for a more low-tech/truly relational solution (MySQL or PostgreSQL will do). — glts
– glts, Commented Jan 24, 2012 at 21:45

mklauber · Accepted Answer · 2012-01-24 22:09:46Z

2

Your second solution is the technically correct one. The kludge you're feeling is in fact not due to the array of arrays problem but to the fact that you have a many to many relationship between sentence and word, and between morpheme and word. (Any given sentence could be one or many words, and any word could be part of one or many sentences.) This is a normal kludge that is an (unfortunate?) side effect of SQL.

Since you mention django, django attempts to abstract some of this for you with many-to-many fields.

I think as a basic model for your django you're looking at something like this:

class Sentence(models.Model):
    words = models.ManyToManyField(Words, through=SentenceWord)

class Word(models.Model):
    morphenes = models.ManyToManyField(Morphene, through=MorpheneWord)

class Mophene(models.Model):
    pass


#--- Intermediate Tables ------------
class SentenceWord(models.Model):
    sentence = models.ForeignKey(Sentence)
    word = models.ForeignKey(Word)
    position = models.IntegerField()

class MorpheneWord(models.Model):
    word = models.ForeignKey(Word)
    morphene = models.ForeignKey(Morphene)
    position = models.IntegerField()

Admittedly, I just typed up those models here, but that should get you close to where to need to be.

EDIT: Introduced Word Model.

edited Jan 24, 2012 at 22:09

answered Jan 24, 2012 at 21:46

mklauber

1,14411 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ypercubeᵀᴹ Over a year ago

Your Django model seems a bit different than both OP's 1st and 2nd model. Your model does not have a Word relation.

glts Over a year ago

For a first design, going without a word relation would actually work in my case. Thanks! But I'm also thinking ahead: as soon as the 'word' concept would get fleshed out a bit it would certainly need its own table.

Branko Dimitrijevic · Accepted Answer · 2012-01-24 21:38:53Z

1

You'll have a hard time enforcing proper morpheme order in the first design, and for that reason I like the second design better. However, if performance is the issue, the first design might allow you to do less JOINing.

If you happen to use Oracle, you might be able to have your cake an eat it too, by marrying the second design for "cleanliness" with materialized views for performance.

answered Jan 24, 2012 at 21:38

Branko Dimitrijevic

52.3k12 gold badges104 silver badges174 bronze badges

1 Comment

ypercubeᵀᴹ Over a year ago

The other disadvantage of the 1st design is that you can't store a word without storing a sentence that it belongs too.

glts · Accepted Answer · 2012-01-29 15:26:40Z

If we take the data structure involved as an array of array values, there is a simple alternative solution that is clean, efficient, and intuitive to work with:

CREATE TABLE Sentence (...);
CREATE TABLE Word     ( sentence_id FOREIGN KEY,
                        order INTEGER );
CREATE TABLE Morpheme ( word_id FOREIGN KEY,
                        order INTEGER );

This is simply a 1-to-N relationship, twice. (With Django's ORM you could then simply say word.sentence to access the sentence a Word instance belongs to, or sentence.word_set.order_by('order') to get the ordered set of words in some sentence.)

The drawback of this design is that array items which occur many times, such as -ly in extraordinari-ly, are stored many times in the database, once for each occurrence. It is not possible to associate additional data with all -ly morphemes.

Collectives™ on Stack Overflow

How to represent an array of arrays in a relational database

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related