The Fastest Way to Load Data Into Your Django Project using PostgreSQL
tl;dr: Load data up to 77x faster with django-postgres-copy and an in-memory csv. Go to results.
When starting a new Django project often my first step is to create a model and bulk load in some existing data. As I’ve learned more about Django and databases, I’ve learned a few ways to speed up the data loading process. In this post I’ll walk through progressively more efficient ways of loading data, and at the end of the post measure the performance of the methods using fake data generated by the wonderful Faker library.
Setup: Some Fake Data
We start with a Django model composed of several different types of data:
class Thing(models.Model):
char = models.CharField(max_length=100)
text = models.TextField()
integer = models.IntegerField()
float = models.FloatField()
boolean = models.BooleanField()
Using Faker
, we’re going to generate some fake data to populate our database of Things.
fake = Faker()
def fake_thing():
return dict(
char=fake.name(),
text=fake.text(),
integer=fake.pyint(),
float=fake.pyfloat(),
boolean=fake.pybool(),
)
n_things = 100
things = [fake_thing() for _ in range(n_things)]
This data generation process is run once at the beginning of the script, so the same data is being inserted by each method for a fair comparison.
Method 1: .save()
on everything
One of the first things I learned in Django is how to save to the database using .save()
on an object. With this initial knowledge, I used to save a bunch of things in the database with a loop, like this:
for thing in things:
t = Thing(**thing)
t.save()
This way totally works! But it’s slow. The issue here is that Django runs in autocommit mode by default, so it’s completing a transaction every time an object is saved.
An analogy might help. If we were going to deposit $100 into the bank, autocommit is like forcing the bank teller to deposit our money dollar by dollar, printing a confirmation receipt after each \$1 deposit and doing this 100 times. At a large enough scale, this behavior could be really helpful in preventing issues if thousands of people are making deposits of different amounts all at once. In our case we’re the only user of the bank and we want to make one large deposit, so it’s slowing us down.
Method 2: Using a Transaction
We want all of the objects to be saved at once, or not at all if there’s an error, which means we want a single database transaction to occur. With Django, our code doesn’t need to change much to use a transaction, we just wrap our same loop in the atomic transaction context manager:
from django.db import transaction
with transaction.atomic():
for thing in things:
t = Thing(**thing)
t.save()
Learning how to control transactions is a really powerful tool, so head to the Django docs to learn more.
Wrapping it in a transaction means that if there’s an error in saving any of the things in our list, none of them will be saved to the database.
Back to the analogy, this is like if we gave the teller instructions to deposit the money dollar by dollar, but only give us a single receipt at the end if the whole transaction succeeds. It’s more efficient, but we still want a single $100 deposit not 100 individual \$1 deposits.
Method 3: bulk_create
Wrapping everything in a single transaction helps, but there’s still more room for improvement. With Django models bulk_create
is an efficient way to create a bunch of objects all at once in a single query. To use it with Django, your code will look something like this:
thing_objects = []
for thing in things:
t = Thing(**thing)
thing_objects.append(t)
Thing.objects.bulk_create(thing_objects)
The workflow here is slightly different: you don’t call save on each object, but instead store all the objects you want to create in a list, then pass that list to the bulk_create
method on the model manager.
There are some caveats to bulk_create
to consider. The one I encounter most is that it doesn’t call the save method on your model or use any signals. If you already have an application with existing data, or some complex business logic that uses custom save
methods or signals, you probably want to think a little more about how your data will be created.
At the Bank of Django, bulk_create
is probably the closest metal model to how I imagine a deposit working. We give the teller \$100 and the teller adds \$100 to our account and gives us a receipt indicating the full amount has been deposited all at once.
Method 4: PostgreSQL’s COPY command with django-postgres-copy
If you aren’t using a PostgreSQL database, bulk_create
is usually the best you can get. If you are using PostgreSQL, it has a COPY
command, which is described well in the docs:
COPY
moves data between PostgreSQL tables and standard file-system files.
Thankfully, some amazing people wrote django-postgres-copy
to make using COPY
with Django super easy. Using this library requires a bit of setup and some helper functions to get started, but the import speedup is worth it.
After installing, we’ll need to add the CopyManager
to our Thing
model. This means our model now looks like this:
from postgres_copy import CopyManager
class Thing(models.Model):
char = models.CharField(max_length=100)
text = models.TextField()
integer = models.IntegerField()
float = models.FloatField()
boolean = models.BooleanField()
objects = CopyManager()
This allows us to use a from_csv
model manager method to load data from a csv. But things
isn’t a csv a csv, it’s a list of dictionaries. I don’t want to clog up my filesystem with a bunch of csv files just to import, so we’ll write a helper function to create an in-memory csv in order to use this method.
from io import StringIO
import pandas as pd
def in_memory_csv(data):
"""Creates an in-memory csv.
Assumes `data` is a list of dicts
with native python types."""
mem_csv = StringIO()
pd.DataFrame(data).to_csv(mem_csv, index=False)
mem_csv.seek(0)
return mem_csv
The project I was using already had pandas
as a dependency, so I used it for ease. If you’re thinking that using pandas to write a csv is overkill, it’s also easy to write to a csv using the built-in csv module.
Now that we have this helper function to create an in-memory csv of our things, let’s write the code to insert this data into the database.
from contextlib import closing
mem_csv = in_memory_csv(things)
with closing(mem_csv) as csv_io:
Thing.objects.from_csv(csv_io)
Here we used the closing context manager that closes our in-memory csv when we’re done with it. If we don’t close the object and call this multiple times, we’ll run out of memory.
This speedup does come at a cost, we’re skipping some validation (database constraints, indexing) that’s happening behind the scenes when we use save
or bulk_create
. The docs for django-postgres-copy
describe options that you can turn on and off when you import.
Our bank analogy breaks down a bit here, but I’ll give it a shot. Using COPY
in this way is like avoiding the teller altogether and telling your bank: “I am depositing $100. Here’s a copy of what my account will look like after I deposit my money. Make it look like this.”
Performance Comparison
How much faster is using COPY
via django-postgres-copy
? Anywhere from 34-77x faster than save()
, and 3.4-5.8x faster than bulk_create()
.
Detailed results are below.
THINGS: 1000
Method 01: 0.88s. Speedup: 1.00
Method 02: 0.56s. Speedup: 1.56
Method 03: 0.09s. Speedup: 10.03
Method 04: 0.03s. Speedup: 34.51
THINGS: 10000
Method 01: 8.56s. Speedup: 1.00
Method 02: 5.32s. Speedup: 1.61
Method 03: 0.66s. Speedup: 13.07
Method 04: 0.15s. Speedup: 58.11
THINGS: 50000
Method 01: 42.88s. Speedup: 1.00
Method 02: 27.09s. Speedup: 1.58
Method 03: 3.48s. Speedup: 12.31
Method 04: 0.65s. Speedup: 65.91
THINGS: 100000
Method 01: 97.13s. Speedup: 1.00
Method 02: 61.38s. Speedup: 1.58
Method 03: 7.21s. Speedup: 13.48
Method 04: 1.26s. Speedup: 77.15
THINGS: 500000
Method 01: 490.28s. Speedup: 1.00
Method 02: 293.63s. Speedup: 1.67
Method 03: 46.82s. Speedup: 10.47
Method 04: 8.03s. Speedup: 61.03
Testing Info
This was tested locally on my 2018 MacBook Pro using docker with the postgres:11.4
image for the database and the following requirements.txt
for the Django app:
Django==2.2.5
django-postgres-copy==2.4.2
Faker==2.0.2
pandas==0.25.1
A gist of the management command used for the comparison can be found here.