Django Migrations on Kubernetes

Concurrent setups are helpful for scalability and performance, but may bring complexities for consistency. Not all tasks benefit from concurrency though. For example running database migrations is a task where consistency and controlled execution is more important than concurrent execution, if that's desired at all.

In a distributed setup, how can database migrations run safely to decrease chances of race conditions? Migrations need to run uninterrupted, and guaranteed to be run by a single operator at any given time. A locking mechanism could help with that. If the lock is an in-process data structure though, it can't protect from parallel processes, a shared network resource for a lock can help in such setups. To avoid adding yet another piece of infrastructure to maintain, the database itself could be a good choice to hold the lock.

For a concrete example I'm going to use a Django web application deployed on a Kubernetes cluster, and PostgreSQL for the database.

Note: The examples here may not suggest the best approach and miss some details, but should be enough to demonstrate the idea.

Consider a setup with multiple pods running the web app, where replicas are managed by a Kubernetes deployment, with a simple rollout process. To release the latest changes, the CI/CD pipeline builds the app image and pushes it to a registry, then runs this command to start new pods running the latest version, and stop the previous ones gradually.

kubectl rollout restart deploy/myapp

The definition of the deployment to run the app may look like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: myapp
  name: myapp
spec:
  replicas: 2  # or more
  selector:
    matchLabels:
      app.kubernetes.io/name: myapp
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  template:
    metadata:
      labels:
          app.kubernetes.io/name: myapp
    spec:
      containers:
        - name: myapp
          command: ["/bin/sh"]
          args:
            - "-c"
            - poetry run gunicorn myapp.wsgi:application --bind 0.0.0.0:8000
          image: myapp:latest
          imagePullPolicy: Always
          ports:
            - containerPort: 8000

Now the app server runs in the main container of the pods. Where should the database migrations run?

It makes sense to run the migrations right before the app server starts in the same pod definition, so the next rollout would run the migrations and start the app server in a single step. However the main container isn't the best choice:

containers:
  - name: myapp
    command: ["/bin/sh"]
    args:
      - "-c"
      # not a good idea
      - poetry run python manage.py migrate && poetry run gunicorn myapp.wsgi:application --bind 0.0.0.0:8000

The problem with this setup is that using liveness probes to monitor the web app could interrupt running migrations. The same container that runs the app and should be monitored by the liveness probe is also running database migrations, which may take longer to complete depending on the migration operations and data size, leading the pod to get restated by kubelet.

Init containers could be a better place to run data migrations, as they're guaranteed to run to completion, before the main container starts, and are not affected by liveness probes.

apiVersion: apps/v1
kind: Deployment
...
template:
  ...
  spec:
    containers:
      - name: myapp
        ...
    initContainers:
      - name: myapp-migrations
        image: myapp:latest
        command: ["/bin/sh"]
        args:
          - "-c"
          - poetry run python manage.py migrate

Now the database migrations run in a proper place in our deployment.

By default Django runs DDL migration operations in a single transaction (if supported by the RDBMS), however by acquiring a shared lock before running the migrations and releasing it afterwards, it can be guaranteed that multiple processes won't start running the migrations at the same time, keeping the migration process simpler and better controlled.

One solution is to wrap the Django migration command, with another command that uses an advisory lock.

The lock id can be configured via the app settings, and overridden by command line arguments, to support multiple apps using the same database cluster (where locking won't be required as each app uses its own database/schema)

# file myapp/management/commands/locked_migrate.py
from django.db import connections
from django.conf import settings
from django.core.management.commands.migrate import Command as MigrateCommand


DEFAULT_LOCK_ID = getattr(settings, "MIGRATE_LOCK_ID", 1000)  # just a random number


class Command(MigrateCommand):
    help = "Run Django migrations safely, using a lock"

    def add_arguments(self, parser):
        MigrateCommand.add_arguments(self, parser)
        parser.add_argument(
            "--migrate-lock-id",
            default=DEFAULT_LOCK_ID,
            type=int,
            help="The id of the advisory lock to use",
        )

    def handle(self, *args, **options):
        database = options["database"]
        if not options["skip_checks"]:
            self.check(databases=[database])

        # Get the database we're operating from
        connection = connections[database]
        # Hook for backends needing any database preparation
        connection.prepare_database()

        lock_id = options["migrate_lock_id"]
        with connection.cursor() as cursor:
            try:
                cursor.execute(f"SELECT pg_advisory_lock({lock_id})")
                MigrateCommand.handle(self, *args, **options)
            finally:
                cursor.execute(f"SELECT pg_advisory_unlock({lock_id})")

This command extends the default migrate command, supporting the same options, with addition of accepting the lock id, and wraps the default handle method which runs the migrations, in a locked block.

Running the command locked_migrate to run migrations instead of the default migrate command in the init container of the pods, makes sure only one container runs all the migrations for each deployment, even when multiple pods are starting up.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: myapp
  name: myapp
spec:
  replicas: 2  # or more
  selector:
    matchLabels:
      app.kubernetes.io/name: myapp
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  template:
    metadata:
      labels:
          app.kubernetes.io/name: myapp
    spec:
      containers:
        - name: myapp
          command: ["/bin/sh"]
          args:
            - "-c"
            - poetry run gunicorn myapp.wsgi:application --bind 0.0.0.0:8000
          image: myapp:latest
          imagePullPolicy: Always
          ports:
            - containerPort: 8000
      initContainers:
        - name: myapp-migrations
          command: ["/bin/sh"]
          args:
            - "-c"
            - poetry run python manage.py locked_migrate
          image: myapp:latest
          imagePullPolicy: Always