Why PostgreSQL + Django is my data pipeline of choice

This blog post is based partly on an interesting conversation at Open Data camp 9 in Manchester this year, where Giles from Open Innovations led a session called DIY FTW! What’s in YOUR toolbox?. As well as showcasing a bunch of tools that I'd not come across, it also made me think about what my typical workflow looks like.

A large proportion of my work at the moment involves creating data pipelines. These data pipelines start with "inputs" - generally some public data source, like the Charity Commission Register of Charities. There is then some cleaning or processing that happens - merging data together, manually adding or amending data. And then finally some "outputs" happen - this might be reports on an individual charity, or a set of aggregate data around a particular group of charities.

There's a few particular requirements for these pipelines - I generally need them to be repeatable, including re-running with updated data. This means I need to carefully separate the data cleaning and additions so that they are not lost when new data comes in.

For a small scale data pipeline - maybe running a quick piece of analyis from data fetched in an SQL query, I'll generally turn to a python Notebook. I've also experimented with new tools like DuckDB for fetching and manipulating data.

But for anything larger than an afternoon's work, I've found that I'm reaching for Django, the python web framework. This might seem like an unintuitive choice at first - Django is a big sprawling piece of software, and creating data pipelines is not what it is designed for. But I've found that its "batteries-included" approach means that it has all the pieces I need in place.

What you get out of the box

A big attraction for using Django is that it comes fully loaded - a lot of the components I might need in a data pipeline are included. Even if you don't need every feature for every project, there's a lot of value in having them at your fingertips anyway. I often find that I start a project thinking, for example, that I won't need user authentication - but then if the requirements change then it's easy to add.

The out of the box parts I find most useful are:

Data model

A django app is structured around the database models you create. It's easy to define your models in code, and there are powerful features for managing relationships between them.

A key positive is the flexibility of django migrations. These mean that I can change the data model, test the effect these changes will have on the data, and then deploy these changes.

If the rigid structure of a data model ends up being too prescriptive for the data you're dealing with, there are options to have something more flexible. I've recently been using a pattern of storing data in JSON fields, and then using Generated Fields to access the data.

Admin area

After you've defined your database models, Django makes it quick and easy to add an admin interface that can be accessed from the web to view the data. You can make some complex views with the built-in admin tools - adding filters, search, the ability to link records together. You can also add and edit data - though you do need to carefully think about how this works when overwriting a data import.

User access/authentication

When working with clients, it is invaluable to be able to set up user accounts out of the box, in order to let them see the data. This might be about giving them access to the admin area or setting up some views to show them what the outputs might look like, or filter and download the data themselves.

Adding user authentication, particularly when data can be edited in the admin area, means that I can also add an audit trail easily, to see who changed the data and why.

Management commands

Often a data pipeline has a set of reusable scripts that don't fit nicely with the website paradigm that Django works with. But this is where management commands come in - you can easily define python scripts that run on the data. These are useful for data management tasks like import, export or bulk cleaning processes.

I've started making use of django-click which I slightly prefer over the native django commands.

My one gripe with management commands is that a bit of work is needed in order to define logging, transaction management, etc - but it is possible to do that.

What I add with add-ons

Although Django comes pretty fully-featured out of the box, there are a few add-ons and other tools that I typically install at the same time. These add specific functionality that isn't present in standard Django, but add particular features around data science.

Django SQL Dashboard

Django SQL Dashboard is a great add on created by Simon Willison the creator of Datasette (and one of the co-creators of Django itself!)

The plugin basically lets you write custom SQL queries against your database, which are then displayed as webpages. These are great for making always-updated dashboards showing the latest state of your data, without needing to wrangle the Django models into the right shape for a custom view (creating aggregate queries with Django models can be a bit fiddly).

I use these extensively with findthatcharity.uk - for example, here's a dashboard showing how up-to-date the Charity register is.

I've actually created my own version of the plugin, which adds a bit more customisability, for example by allowing titles and descriptions for individual queries within a dashboard.

There's a bit of overhead to setting this up - you need to be using a PostgreSQL database, you need to create a read-only SQL user, etc. And it does have potential security implications if not done correctly.

Dokku for deployment

Obviously if you're running Django partly because it's a web server, you need to be able to deploy it to the web. I use dokku for this - it's like running your own heroku instance.

Dokku makes it easy to add things like a database (I generally use PostgreSQL) or even a search engine like elasticsearch. I can easily test changes locally, then deploy simply using git push.

I also use dokku's scheduling feature to run scripts periodically, for example setting up nightly data import.

Django Ninja API

This is slightly more rare, but it can be useful to have an API to access the data. I really like Django Ninja for defining APIs. It fits easily with the OpenAPI standard, which means you get nice documentation and testing out the box. I generally add some sort of API Key based authentication to aid with sharing with others.

What's missing / what are the issues

Although it's my tool of choice for data pipelines, Django isn't perfect for this purpose (which after all, it wasn't specifically designed for). There are a few gaps and downsides to this approach:

No easy ways of getting data out

Often the output of a data pipeline is a bulk output that needs to go somewhere else - a spreadsheet download, a set of tables for a report, etc. I usually end up producing some django views to output or download these, but it can quickly get complex or inflexible.

One approach that I've had some success with is adding functionality to export (and import) from admin views. There are some add-ons that do this, or it's relatively straightforward to create your own using custom admin templates.

Overheads are quite high

This approach does mean there are quite high setup costs, compared to something like a python Notebook. For a full deployment, you need to have a server, a domain name, a database. You need to manage SSL certificates, users, downtime etc. You need to be able to get command line access to run scripts, etc on the server. This is all doable, particularly with something like dokku, but it does require a bit of effort.

Dealing with large datasets

Django, particularly in terms of data models, is designed around manually creating and editing individual instances of a model. I find with a lot of my work I am working on bulk data imported from a source. Sometimes these can run to hundreds of thousands or even millions of rows.

While this is all possible to handle, and you can bypass Django to a certain extent, it can be tricky when running bulk updates & imports. For example, a pattern I've not quite cracked is bulk upserts - where data is either created or updated depending on whether it already exists. This is possible to do, but can be a bit fiddly.


I hope I've successfully shown how Django can work well as a tool for managing data pipelines. I have a cookiecutter template that forms the basis for a new django project, which shows some of this approach in practice.