The Fastest Way to Load Data Into Your Django Project using PostgreSQL • Peter Baumgartner

tl;dr: Load data up to 77x faster with django-postgres-copy and an in-memory csv. Go to results.

When starting a new Django project often my first step is to create a model and bulk load in some existing data. As I’ve learned more about Django and databases, I’ve learned a few ways to speed up the data loading process. In this post I’ll walk through progressively more efficient ways of loading data, and at the end of the post measure the performance of the methods using fake data generated by the wonderful Faker library.

Setup: Some Fake Data

We start with a Django model composed of several different types of data:

class Thing(models.Model):

    char = models.CharField(max_length=100)
    text = models.TextField()
    integer = models.IntegerField()
    float = models.FloatField()
    boolean = models.BooleanField()

Using Faker, we’re going to generate some fake data to populate our database of Things.

fake = Faker()

def fake_thing():
    return dict(
        char=fake.name(),
        text=fake.text(),
        integer=fake.pyint(),
        float=fake.pyfloat(),
        boolean=fake.pybool(),
    )
  
n_things = 100
things = [fake_thing() for _ in range(n_things)]

This data generation process is run once at the beginning of the script, so the same data is being inserted by each method for a fair comparison.

Method 1: `.save()` on everything

One of the first things I learned in Django is how to save to the database using .save() on an object. With this initial knowledge, I used to save a bunch of things in the database with a loop, like this:

for thing in things:
    t = Thing(**thing)
    t.save()

This way totally works! But it’s slow. The issue here is that Django runs in autocommit mode by default, so it’s completing a transaction every time an object is saved.

An analogy might help. If we were going to deposit $100 into the bank, autocommit is like forcing the bank teller to deposit our money dollar by dollar, printing a confirmation receipt after each \$1 deposit and doing this 100 times. At a large enough scale, this behavior could be really helpful in preventing issues if thousands of people are making deposits of different amounts all at once. In our case we’re the only user of the bank and we want to make one large deposit, so it’s slowing us down.

Method 2: Using a Transaction

We want all of the objects to be saved at once, or not at all if there’s an error, which means we want a single database transaction to occur. With Django, our code doesn’t need to change much to use a transaction, we just wrap our same loop in the atomic transaction context manager:

from django.db import transaction

with transaction.atomic():
    for thing in things:
        t = Thing(**thing)
        t.save()

Learning how to control transactions is a really powerful tool, so head to the Django docs to learn more.

Wrapping it in a transaction means that if there’s an error in saving any of the things in our list, none of them will be saved to the database.

Back to the analogy, this is like if we gave the teller instructions to deposit the money dollar by dollar, but only give us a single receipt at the end if the whole transaction succeeds. It’s more efficient, but we still want a single $100 deposit not 100 individual \$1 deposits.

Method 3: `bulk_create`

Wrapping everything in a single transaction helps, but there’s still more room for improvement. With Django models bulk_create is an efficient way to create a bunch of objects all at once in a single query. To use it with Django, your code will look something like this:

thing_objects = []
for thing in things:
    t = Thing(**thing)
    thing_objects.append(t)
Thing.objects.bulk_create(thing_objects)

The workflow here is slightly different: you don’t call save on each object, but instead store all the objects you want to create in a list, then pass that list to the bulk_create method on the model manager.

There are some caveats to bulk_create to consider. The one I encounter most is that it doesn’t call the save method on your model or use any signals. If you already have an application with existing data, or some complex business logic that uses custom save methods or signals, you probably want to think a little more about how your data will be created.

At the Bank of Django, bulk_create is probably the closest metal model to how I imagine a deposit working. We give the teller \$100 and the teller adds \$100 to our account and gives us a receipt indicating the full amount has been deposited all at once.

Method 4: PostgreSQL’s COPY command with `django-postgres-copy`

If you aren’t using a PostgreSQL database, bulk_create is usually the best you can get. If you are using PostgreSQL, it has a COPY command, which is described well in the docs:

COPY moves data between PostgreSQL tables and standard file-system files.

Thankfully, some amazing people wrote django-postgres-copy to make using COPY with Django super easy. Using this library requires a bit of setup and some helper functions to get started, but the import speedup is worth it.

After installing, we’ll need to add the CopyManager to our Thing model. This means our model now looks like this:

from postgres_copy import CopyManager


class Thing(models.Model):

    char = models.CharField(max_length=100)
    text = models.TextField()
    integer = models.IntegerField()
    float = models.FloatField()
    boolean = models.BooleanField()

    objects = CopyManager()

This allows us to use a from_csv model manager method to load data from a csv. But things isn’t a csv a csv, it’s a list of dictionaries. I don’t want to clog up my filesystem with a bunch of csv files just to import, so we’ll write a helper function to create an in-memory csv in order to use this method.

from io import StringIO
import pandas as pd


def in_memory_csv(data):
    """Creates an in-memory csv.

    Assumes `data` is a list of dicts
    with native python types."""

    mem_csv = StringIO()
    pd.DataFrame(data).to_csv(mem_csv, index=False)
    mem_csv.seek(0)
    return mem_csv

The project I was using already had pandas as a dependency, so I used it for ease. If you’re thinking that using pandas to write a csv is overkill, it’s also easy to write to a csv using the built-in csv module.

Now that we have this helper function to create an in-memory csv of our things, let’s write the code to insert this data into the database.

from contextlib import closing

mem_csv = in_memory_csv(things)
with closing(mem_csv) as csv_io:
    Thing.objects.from_csv(csv_io)

Here we used the closing context manager that closes our in-memory csv when we’re done with it. If we don’t close the object and call this multiple times, we’ll run out of memory.

This speedup does come at a cost, we’re skipping some validation (database constraints, indexing) that’s happening behind the scenes when we use save or bulk_create. The docs for django-postgres-copy describe options that you can turn on and off when you import.

Our bank analogy breaks down a bit here, but I’ll give it a shot. Using COPY in this way is like avoiding the teller altogether and telling your bank: “I am depositing $100. Here’s a copy of what my account will look like after I deposit my money. Make it look like this.”

Performance Comparison

How much faster is using COPY via django-postgres-copy? Anywhere from 34-77x faster than save(), and 3.4-5.8x faster than bulk_create().

Detailed results are below.

THINGS: 1000
Method 01: 0.88s. Speedup: 1.00
Method 02: 0.56s. Speedup: 1.56
Method 03: 0.09s. Speedup: 10.03
Method 04: 0.03s. Speedup: 34.51

THINGS: 10000
Method 01: 8.56s. Speedup: 1.00
Method 02: 5.32s. Speedup: 1.61
Method 03: 0.66s. Speedup: 13.07
Method 04: 0.15s. Speedup: 58.11

THINGS: 50000
Method 01: 42.88s. Speedup: 1.00
Method 02: 27.09s. Speedup: 1.58
Method 03: 3.48s. Speedup: 12.31
Method 04: 0.65s. Speedup: 65.91

THINGS: 100000
Method 01: 97.13s. Speedup: 1.00
Method 02: 61.38s. Speedup: 1.58
Method 03: 7.21s. Speedup: 13.48
Method 04: 1.26s. Speedup: 77.15

THINGS: 500000
Method 01: 490.28s. Speedup: 1.00
Method 02: 293.63s. Speedup: 1.67
Method 03: 46.82s. Speedup: 10.47
Method 04: 8.03s. Speedup: 61.03

Testing Info

This was tested locally on my 2018 MacBook Pro using docker with the postgres:11.4 image for the database and the following requirements.txt for the Django app:

Django==2.2.5
django-postgres-copy==2.4.2
Faker==2.0.2
pandas==0.25.1

A gist of the management command used for the comparison can be found here.