How to store tree structure in relational database.

2019-11-11 · updated 2019-12-15

Probably you can use non relational database, have graph databese, but say, you have mostly relational data, with some data tree like? For instance there is system with multi-level of categories.

Silly solutions

Table per each level of the tree

Let’s consider naive solution - every level in separate table. That is far from perfect - first every, additional level requires additional table. More - if we wish to have items in intermediate categories? Every level has to be referenced by foreign key in items’ table.

When you would consider it as valid? Well, let’s say it is company structure, and business processes are involved, every level of the structure has it’s own meaning, properties. In that case it looks like changing database is smallest problem, when changing the structure. On the other hand having this, particular structure has benefits - explicit constraints at database level. Sample structure with tree levels:

create table business_unit (
  id serial,
  name varchar(200) unique,
  primary key (id)
);

create table department (
  id serial,
  business_unit_id integer,
  name varchar(200) unique,
  primary key (id),
  foreign key (business_unit_id) references business_unit(id) deferrable initially deferred
);

create table employee (
  id serial,
  department_id integer,
  name varchar(200) unique,
  primary key (id),
  foreign key (department_id) references department(id) deferrable initially deferred
);

To populate with test data, you can use something like this:

insert into business_unit(name)
    (select md5(random()::text) as name from generate_series(1, 5));
insert into department(business_unit_id, name)
    (select b.id, md5(random()::text) as name from business_unit b, generate_series(1, 5));
insert into employee(department_id, name)
    (select d.id, md5(random()::text) as name from department d, generate_series(1, 50));

To check if it worked:

select department_id, count(*) from employee group by department_id order by department_id;

That is quite ok to have something like that to populate test database in order to perform integration tests. However from database perspective, amount of data is quite small. To see how this structure behaves much more data is needed. Sure, some zeros could be added to generate_series, but question is if that is still real case, and as programmer, I can rely on results.

Table with whole tree

Getting back to initial example. Much easier it would be to have single table - self referenced. Issue is that we need many queries. Starting from root, we need it’s children, and then for every child we need to fetch children, and that way till the end. Other option would be to query all the categories, and then build up the tree. But what if structure is much bigger - we have nestled posts like in our mailbox? Fetching all messages to see that only some of them are worth sending to the customer, because rest is not visible? Sounds like something wrong here.

create table category (
  id serial,
  name varchar(200) unique,
  parent_category_id integer,
  level integer,
  primary key (id)
);

insert into category(name, level)
    (select md5(random()::text), 1 from generate_series(1, 10));
insert into category(name, level, parent_category_id)
    (select md5(random()::text), 2, c.id from category c, generate_series(1, 10) where c.level = 1);
insert into category(name, level, parent_category_id)
    (select md5(random()::text), 3, c.id from category c, generate_series(1, 10) where c.level = 2);
insert into category(name, level, parent_category_id)
    (select md5(random()::text), 4, c.id from category c, generate_series(1, 10) where c.level = 3);

Something more sophisticated

We can store path of the posts as string / varchar. With separator, and index on that path - we can fetch what is needed. Well it sounds reasonably. There is one counter argument - database, in case of varchars indexing only first part of field - say 20 characters.

create table category (
  id serial,
  name varchar(200) unique,
  parent_category_id integer,
  primary key (id)
);

What if that is not enough?

There is another way. For every node in the tree we have two additional columns. First, let’s call it - left - will have smallest number in subtree, Second - right - will have biggest number in the tree. The numbers can’t repeat and are growing. In this case - having a node, we can get subtree in a very efficient way. Where is the catch? Well, changes requires more updates then in previous cases.

create table category (
  id serial,
  name varchar(200) unique,
  parent_category_id integer,
  primary key (id)
);

Could recursive query be used?

Yes. Modern databases can use recursive queries.

create table category (
  id serial,
  name varchar(200) unique,
  parent_category_id integer,
  primary key (id)
);