Using Django's GeneratedField

Django (the python web framework) version 5.0, released in December 2023, introduced a new field type for models - the GeneratedField. This is potentially a very handy tool - it allows you to define fields that are based on other fields, but gives you the flexibility and speed that comes with a field belonging to the database, without the overhead of needing to store duplicate data.

I recently had an opportunity to use this field in a real life project, so I wanted to test how well it worked.

My use case

I work a lot with financial data that comes from regulators, and that data is often wrong or needs cleaning in some way. However, I'm also consuming a "firehose" of data from those regulators, with updates coming every day.

What I need is a way of storing both the original value that comes from the regulator, along with the ability to overwrite it with a manually checked value. That way I can keep storing the original value from regulator data without needing to re-enter the manual value every time.

So if I'm gathering an organisation's turnover in a given year I end up with two fields in the database:

  • turnover_regulator - the value of Turnover from the regulator's data. This could be null if not provided, but would typically be available for most organisations. This field is updated by a nightly data fetching process - any data is this field is overwritten every night.
  • turnover_manual - this is a field that is null by default but can be manually changed to a correct value if needed. This field isn't touched by the data fetching process so any changes will be kept.

Both these fields are integer, float or decimal depending on the exact need, and they can both have null values. So in a django model they would look something like:

from django.db import models

class FinancialYear(models.Model):
    organisation = ...
    financial_year_end = ...
    turnover_regulator = models.DecimalField(null=True)
    turnover_manual = models.DecimalField(null=True)

Once this data is stored, I need some way of finding the best available estimate of turnover for an organisation. This is found by checking whether turnover_manual has a value in it - if it does then use that, if not then use the value in turnover_regulator.

The old way

With this setup I would have two options for calculating the final turnover figure:

1. Django model property

I could set up a property on the model which fetches both values from the database and then uses Python to choose which one to use. It might look something like:

class FinancialYear(models.Model):
    ...

    @property
    def turnover(self):
        if self.turnover_manual is not None:
            return self.turnover_manual
        return self.turnover_regulator

This works fine for looking at a single record, and has the advantage of not needing unnecessary storage in the database. However, when dealing with thousands of rows it requires every row to be fetched from the database and then processed in Python, so you can't take advantage of the optimisations that come with calculating things in the database. For example, I wouldn't be able to create a sum of turnover for a set of organisations using the database.

2. Updating a separate field

The other way to store this is to create a separate turnover field on the model which you thne need to make sure is updated with the right value.

You could update it by patching into the models' save() method, although this isn't always trigged when the model is updated in bulk. Or you could have some kind of bulk update process that runs periodically - although this would mean that there's times when the field is out of sync. Or you may be able to set up a trigger to update the data in the database - but this would decouple the database code from Django and could have performance concerns.

Using GeneratedField instead

The new GeneratedField gives us a third option. Instead, we can use GeneratedField to define an SQL function, and then the database will make sure that the value is always available to use as if it was any normal field.

I did this using the following definition:

class FinancialYear(models.Model):
    ...
    turnover = models.GeneratedField(
        expression=models.functions.Coalesce("turnover_manual", "turnover_regulator"),
        output_field=models.DecimalField(),
        db_persist=True,
    )

What I'm doing here is defining a field called turnover that uses the Coalesce SQL function to choose the first non-null value from the two fields. It's a pretty simple use-case, but it is possible to do more complex expression there too.

I've also defined an output_field which tells Django/the database what type of data will be returned, and db_persist which sets whether it's permanently stored in the database or not (I only used this when it was set to True so I haven't seen what the performance impact might be).

I found this worked really well - I could access the value in Django, for example adding it as a (read-only) field in admin pages, but it was also available in more complex SQL queries that I was creating to aggregate and filter data. I haven't done any proper benchmarking but I didn't notice any big impacts on performance when using it. And it seemed to work - Django and the databases take care of the details of keeping the field up-to-date, so I didn't need to worry about out-of-date data being used.

Limitations

My experience was generally great, but there were a few limitations I found. I've also only tested it in a fairly narrow way - I only used it with Sqlite, and only with a narrow set of generated field expressions so there may be other issues.

Bugs

This is quite a new feature, so I imagine there are a few bugs around that might cause issues when using it.

I particularly found an issue with creating new objects when the primary key was defined rather than generated - this happened to match my exact use-case! Fortunately, this bug has now been fixed in version 5.0.5 (released this week), but as with any new features there may be others out there.

Generating migrations

At the moment, GeneratedField will work correctly when first created in a migration, but changing the GeneratedField will result in a migration that doesn't work. There is a simple workaround for this - simply replace the migrations.AlterField operation that is created with two new migrations - one that does migrations.RemoveField and one that does migrations.AddField (because the field is generated from other data you don't need to worry about losing data).

It isn't ideal to have to do this manually every time you change the GeneratedField, but it's not a big problem.

Using with relationships

As the docs explain, GeneratedFields can only access fields from the same model, so it's not possible to access fields in related tables. This does block of some of the potential use cases for a generated field in aggregating data from a related model. But again, not a major blocker.

Further reading