Most companies have a large set of structured data. That is why it is important to keep a close eye on the constancy of that data, to make sure you have good backups and the data is always accurate, available and actionable.
One of the most frustrating questions a support engineer faces is “but can you tell us what was there before?” Many people seem to think that structured data also implies that you can tell what the data was at any point in time. As admin wizards, we pull our backups, restore them quickly on a new machine and indicate what the value was at “that particular time” when the backup was made. But who made the change, when was it exactly made or worse, what if the update was outside the backup scope?…
Rather than using a “stop-gap” solution, an alternative can be provided with a database that also maintains a journal of all changes. Let’s have a look at how that exactly could work.
Building up a journal
First things first, let’s assume that we have a simple database running on Postgres and it contains a table of all our valued customers
compose=> \d contract
Column | Type | Nullable| Default
id | integer | not null | nextval(‘contact_seq’)
uid | integer | not null |
changed | timestamp without time zone | not null | now()
name | character varying | not null |
company | character varying | not null |
street | character varying | not null |
street2 | character varying | |
street3 | character varying | |
postalcode | character varying(16) | not null |
city | character varying | not null |
state | character varying | |
country_id | character varying(2) | not null |
phone | character varying(25) | |
fax | character varying(25) | |
vat | character varying | not null |
email | character varying | not null |
status | integer | not null |
As we are trying to create an overview of all changes and would like to be able to search through that data too, the obvious choice is to build a new table. This new table looks exactly the same as the original table but adds some additional columns. To maintain history, we need to define the action that changed the data, the period for which the data was valid and the interaction that created the change. So, we’ll start with an event table that describes the interaction.
Column | Type | Nullable | Default
id | bigint | not null | nextval(‘event_seq’)
who | bigint | not null |
remote_ip | character varying(50) | |
parameter | text | not null |
In short, the table contains a reference number (id), the user that interacted with the system, the IP address from where the action was started and the parameters that were passed at the moment the interaction took place. With this we can build our journal table copy of our contact table.
Column | Type | Nullable | Default
operation | character varying(10) | not null |
from | timestamp with time zone | not null | now()
event_id | bigint | |
id | integer | |
uid | integer | |
changed | timestamp without time zone | |
<COPY OF ADDITIONAL CONTRACT TABLE ROWS>
We add the operation which could either be ‘INSERT’, ‘UPDATE’ or ‘DELETE’. The ‘from’ field will contain the ‘from’ time of the at that point in time current record. The event_id references the id in the event table.
By doing so, we can track any data change, when it happened, how long it was valid and who made the changes. Added bonus, if at a certain point an interaction causes multiple rows in multiple tables to be updated, we will use the same event_id. This way we can see exactly what happened and link multiple database changes to the same event.
First thing to do is making sure an event gets created for every entry in the journal. This can easily be enforced by making the event_id cannot be null making sure it gets inserted for any journal entry. To make live a bit easier, we can set the default to the current value of the id of the latest value inserted in the event table. Installing a condition that the referenced value must exist makes it impossible to create a journal entry without referencing an existing event entry.
Column | Type | Nullable | Default
event_id | bigint | not null | currval(‘event_seq’)
“contact_event_fk” FOREIGN KEY (event_id) REFERENCES event(id)
Note that this is where the Postgres functionality starts kicking in. We are using a predefined sequence. By creating an event, the next value of the sequence is generated and can be accessed over and over again. Creating a new row in the table will automatically fill in the event_id and reference to the latest event. If no event has been created in the current session, inserting an entry in the table will be impossible as the sequence will not have a current value to access.
But that is only part of the story. Enforcing the creation of journal entries upon insert, update and delete has to be mandatory. Most modern databases including Postgres allow you to do exactly that. In Postgres this functionality is called “trigger functions”.
journal_contact AFTER INSERT OR DELETE OR UPDATE ON contact FOR EACH ROW EXECUTE PROCEDURE jnl_contact()
Under any of the INSERT or DELETE or UPDATE in the table, for each row the function jnl_contact is called. The function written in PL/pgSQL makes sure that with any defined action on the main table, the journal gets a copy of the data and the additional fields are filled.
BEGIN IF (TG_OP = 'DELETE') THEN INSERT INTO contact_j SELECT TG_OP, now(), user, currval('event_seq'), OLD.*; RETURN OLD; ELSIF (TG_OP = 'UPDATE') THEN INSERT INTO contact_j SELECT TG_OP, now(), user, currval('event_seq'), NEW.*; RETURN NEW; ELSIF (TG_OP = 'INSERT') THEN INSERT INTO contact_j SELECT TG_OP, now(), user, currval('event_seq'), NEW.*; RETURN NEW; END IF; RETURN NULL; END;
The trigger call sets the reference variable TG_OP to the current transaction type and the necessary content can be accessed using OLD.<fieldname> or NEW.<fieldname> variables to access existing or new content of the current row.
Now let’s have a look at the result of all our hard work. We have to create a shadow table contact_j next to our contact table. The shadow table contains all the data of our main table. As soon as we insert something in the main table, the entry is copied in the shadow table, operation set to ‘INSERT’ and a reference to the event table is filled. If an update is performed, the same function will be triggered but now an additional row with the new data will be stored in the shadow table with the operation set to ‘UPDATE’. Lastly, when a row is deleted, a last entry will be written in the shadow table referencing the existing data just before it gets eliminated from the database. In all cases, the reference to the id in the event table is set automatically. If such a reference does not exist, the addition to the shadow table will fail and hence the operation on the main table will fail forcing the entire operation to be undone by executing a roll back.
The result is that when you try to update the database you will have to create an entry in the event table first. Once that is done all transactions on the tables that have journal trigger function active automatically create copies. Every single insert, update and delete generates a shadow copy. All actions in a single transaction are grouped and linked to the particular event.
With minimal effort and using modern database technology it is possible to create a framework that protects programmers from unwanted database action. Furthermore, this framework allows detailed auditing. It defines exactly when, how and who made particular changes and why these changes were made. The framework can also be used to recreate a view on the data for any given point in time.
If you are interested in keeping track of your data or have specific auditory needs for your database, Nexperteam can help you bring your data to live or put it in its historic context.