Taming the beast that is the Django ORM

man fighting dragon which represents the django orm

The material this blog post was originally developed from was a bunch of slides used for a skill share presentation I gave at my workplace @ coreplan.io.

EDIT: I shared this on hacker news and got a lot of positive feedback. See the HN post for a lively discussion.

I have 3+ years of experience with Django, with it being the main framework that underpins the backend of CorePlan’s main SaaS product. It is a mature, batteries included framework that has been around for a while now. One particular powerful yet dangerous feature of Django is the ORM. This is a Django specific ORM which cannot be separated from the rest of the framework. The other major python ORM is SQLAlchemy which can be used with other python web frameworks, but is an independent tool.

Below are some of the things that I have learned about the Django ORM, how it compares to raw SQL and gotchas that you should be aware of when using it.

What is an ORM (Object Relational Mapper)?

Abstraction over SQL to interact with databases

Code -> SQL

Hole.objects.all()

⬇️

SELECT * FROM drilling_hole;

Why use an ORM? - Pros

Abstraction over SQL, no need to write raw SQL (plus and minus)
Portability - Can change out database engines easily !?
- Probably not true, often will rely on db specific features e.g. postgres jsonb, triggers, etc
Direct mapping from db to models
Automatic schema generation
- Migrations are automatically generated
Security
- abstracts away enough that sql injection is less likely

Why use an ORM? - Cons

Abstraction over SQL…
- Hides the underlying SQL
- Can be difficult to debug
- Lazy loading can cause N+1 queries without the developer realising
- Harder to onboard new developers if they haven’t used Django before
Performance
- Generated sql be slower than crafted SQL

Fundamentals

Models = Tables

# drilling/models.py

from django.db import models
class Hole(models.Model):
    name = models.TextField()

CREATE TABLE drilling_hole (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100)
);

Migrations

python manage.py makemigrations # generate migration files
python manage.py migrate # apply migrations

python manage.py makemigrations drilling --empty # generate empty file for data migration for the `drilling` app

https://docs.djangoproject.com/en/dev/topics/migrations/

Querying

ActiveRecord pattern - ala Ruby on Rails style
QuerySets (Hole.objects.all())
- lazy
- chainable
- cached when iterated over multiple times !?
  - I would not recommend relying on this because it hard to comprehend when it is cached and when it is not when you are reading code

holes_qs = Hole.objects.filter(name="cheese") # not evaluated yet
holes_qs = holes_qs.filter(depth__gt=100) # still not evaluated

list(holes_qs) # evaluated
list(holes_qs) # cached

holes_qs[2] # not cached
holes_qs.first() # not cached
holes_qs.get(id=1) # not cached

WHERE

WHERE clause ≈ filter()

holes_qs = Hole.objects.filter(name="cheese")

⬇️

SELECT * 
FROM drilling_hole;
WHERE drilling_hole.name = 'cheese';

WHERE across tables?

But how do you do a left/inner join? With the ORM it isn’t done declaratively, but implicitly

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

class Pad(models.Model):
  name = models.TextField()

holes_qs = Hole.objects.filter(pad__name="cheese board")

⬇️

SELECT * 
FROM drilling_hole;
INNER JOIN drilling_pad ON drilling_hole.pad_id = drilling_pad.id
WHERE drilling_pad.name = 'cheese board';

WHERE other conditionals

filter(name="cheese") -> filter(name__exact="cheese") -> WHERE name = 'cheese'
filter(name__iexact="cheese") -> WHERE name ILIKE 'cheese'
filter(name__contains="cheese") -> WHERE name LIKE '%cheese%'
filter(name__icontains="cheese") -> WHERE name ILIKE '%cheese%'
filter(name__in=["cheese", "board"]) -> WHERE name IN ('cheese', 'board')
filter(name__gt=100) -> WHERE name > 100 etc
filter(name__isnull=True) -> WHERE name IS NULL
- At least for postgres shouldn’t name = None, null != null

AS

annotate ≈ AS

holes_qs = Hole.objects.annotate(this_thang=F("pad__name"))
hole = holes_qs.first()
print(hole.this_thang)

⬇️

SELECT 
  * , 
  drilling_pad.name AS this_thang
FROM drilling_hole;
INNER JOIN "drilling_pad" ON ("drilling_hole"."pad_id" = "drilling_pad"."id")

Subqueries

class Project(models.Model):
  name = models.TextField()

class Pad(models.Model):
  name = models.TextField()

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)
  project = models.ForeignKey(Project, on_delete=models.CASCADE)

# find pads that are on project_id=1
hole_subquery = Hole.objects.filter(project_id=1).values("pk")
pad_qs = Pad.objects.filter(hole__in=Subquery(hole_subquery))

⬇️

SELECT "drilling_pad"."id",
 "drilling_pad"."name"
FROM "drilling_pad"
 INNER JOIN "drilling_hole" ON ("drilling_pad"."id" = "drilling_hole"."pad_id")
WHERE "drilling_hole"."id" IN (
  SELECT U0."id"
  FROM "drilling_hole" U0
  WHERE U0."project_id" = 1
 )

Correlated Subqueries

Correlated subqueries are where the inner query depends on outer query

class Pad(models.Model):
  name = models.TextField()

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

# include the hole id of any hole that has a foreign key to the pad
hole_subquery = Hole.objects.filter(pad_id=OuterRef("pk")).values("pk")
pad_qs = Pad.objects.annotate(hole_id=Subquery(hole_subquery))

⬇️

SELECT "drilling_pad"."id",
 "drilling_pad"."name",
 (
  SELECT U0."id"
  FROM "drilling_hole" U0
  WHERE U0."pad_id" = ("drilling_pad"."id")
 ) AS "hole_id"
FROM "drilling_pad"

Performance improvements

Reduce N+1
- You typically want to reduce N+1 queries because they have communication overhead
- select_related
- prefetch_related
- You also might choose to use annotate() instead of select_related because select related pulls all the data for the associated table when you might only need one column. That associated might have a jsonb column which contains a lot of unnecessary data that you don’t need.

select_related

holes = Hole.objects.all()
for hole in holes:
  print(hole.pad.name) # N+1 queries

holes = Hole.objects.select_related("pad")

for hole in holes:
  print(hole.pad.name) # no extra query

Many-to-many — prefetch_related

You would use prefetch related when you are not pulling a direct foreign key such a many-to-many relationship like below.

width:400px


class Faculty(models.Model):
  name = models.TextField()

class Course(models.Model):
  name = models.TextField()
  faculty = models.ForeignKey(Faculty, on_delete=models.CASCADE)

class Student(models.Model):
  name = models.TextField()
  courses = models.ManyToManyField(Course, through="Enrolment")

class Enrolment(models.Model):
  course = models.ForeignKey(Course, on_delete=models.CASCADE)
  student = models.ForeignKey(Student, on_delete=models.CASCADE)
  grade = models.FloatField()

students = Student.objects.prefetch_related("courses")

for student in students:
  for course in student.courses.all():
    print(course.name) # no extra query
    print(course.faculty.name) # extra query

students = Student.objects.prefetch_related(
  Prefetch(
    "courses", 
    queryset=Course.objects.select_related("faculty")
  )
)

for student in students:
  for course in student.courses.all():
    print(course.name) # no extra query
    print(course.faculty.name) # no extra query

to_attr

to_attr can be used to make “filtered” relationships available on the instance.

class Enrolment(models.Model):
  course = models.ForeignKey(Course, on_delete=models.CASCADE)
  student = models.ForeignKey(Student, on_delete=models.CASCADE)
  grade = models.FloatField()

students = Student.objects.prefetch_related(
  Prefetch(
    "course", 
    queryset=Course.objects.filter(grade__gt=80.0).select_related("faculty"), to_attr="hd_courses"
  )
)

for student in students:
  for course in student.hd_courses.all():
    ...

Multiple instances when filtering across many-to-many

One gotcha is selecting across a many-to-many relationship can return multiple of the same instances.

data model showing many to many relationship

Student.objects.filter(courses__faculty__name="Science") # inner join returns duplicated rows
<QuerySet [<Student: Student object (1)>, <Student: Student object (1)>]>

Student.objects.filter(courses__faculty__name="Science").distinct()
<QuerySet [<Student: Student object (1)>]>

SELECT
  "testing_student"."id", 
  "testing_student"."name" 
FROM 
  "testing_student" 
INNER JOIN 
  "testing_enrolment" 
ON 
  ("testing_student"."id" = "testing_enrolment"."student_id") 
INNER JOIN 
  "testing_course" 
ON 
  ("testing_enrolment"."course_id" = "testing_course"."id") 
INNER JOIN 
  "testing_faculty" 
ON 
  ("testing_course"."faculty_id" = "testing_faculty"."id")
WHERE 
  "testing_faculty"."name" = 'Science'

Gotchas and other Funky stuff

Model instances when retrieved will try to populate all columns
- get() or first()
- for hole in Hole.objects.all()
This can make migrations hard, as older workers will be requesting columns that might have been removed or renamed which will cause errors
There are ways to do down-timeless migrations but are bit funky and multi step
Recommendation is to avoid deleting or renaming columns

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

class Pad(models.Model):
  name = models.TextField()

holes_qs = Hole.objects.annotate(this_thang=F("pad__name")).get()

⬇️

SELECT 
  drilling_hole.name, -- pulls all columns 
  drilling_hole.pad_id,
  drilling_pad.name AS this_thang
FROM drilling_hole;
WHERE drilling_pad.name = 'cheese board';
LIMIT 1;

Values

So how do you to only retrieve certain columns?

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

holes_qs = Hole.objects.values("name")

for hole in holes_qs:
  print(type(hole)) # dict
  # not `Hole` object, hence no class functions, no lazy loading e.g. can't access `hole.pad.name`

⬇️

SELECT 
  drilling_hole.name, -- only pulls name and maps it to a python dictionary object
FROM drilling_hole;

Less data sent down the wire, but no lazy loading and no class functions as the data is a python dictionary

Other options

only() and defer()
Will retrieve model instances, but won’t retrieve all fields
Values not declared when accessed on the model are lazy loaded
Would not recommend to be used regularly, very high chance of N+1

holes_qs = Hole.objects.only("pad_id")

for hole in holes_qs:
  print(hole.pad_id) # no extra query
  print(hole.name) # name will be lazy loaded, N+1 queries

How do you know what SQL is being generated?

print(queryset.query)
Django Debug Toolbar
Kolo

Updating rows

There are three typical ways to update a row in the database.

class Hole(models.Model):
  name = models.TextField()

instance = Hole.objects.create(name="cheese")

# save()
instance.name = "board"
instance.save()

# update()
Model.objects.filter(name="board").update(name="board2")

# bulk_update()
instance.name = "board3"
instances_to_update = [instance]
Model.objects.bulk_update(instances_to_update, ["name"])

Problems with updates

update() and bulk_update() do not trigger save() method on the model
built in django signals (publish/subscribe pattern), there are post_save and pre_save signals which can be triggered when calling save()
- update() and bulk_update() do not trigger those signals…

Pagination / order_by
- Not a Django ORM thing, but a Django ORM hides the implementation detail, which may lead to unexpected result
- Page pagination is default in DRF list views and implemented with LIMIT and OFFSET in SQL

?page_size=10&page=3

SELECT * 
FROM drilling_hole 
LIMIT 10 
OFFSET 20;

Anything wrong with this query?

There is no deterministic guarantee that the same 10 rows will be returned each time.
A plain SELECT in postgres (may be different in different dbs) provides no guarantee of order, unless ORDER BY is specified
It often appears to return in insertion/id order, but that is not guaranteed in postgres
Model Meta ordering may set a default order, but sometimes tht is ignored
For list views you should to provide a deterministic order_by
order_by(name) is not enough if name is not unique
- order_by(name, id) is required, because id is unique
This can been the the cause of some flaky tests issues where lists are returned seemingly in insertion order and asserted to return in id order

Thanks for reading! I hope this has been useful to you. There are definitely more particularities and gotchas to be aware of when using the Django ORM and Django in general but I think these are the most common ones.

David Hang

Taming the beast that is the Django ORM - An introduction

What is an ORM (Object Relational Mapper)?

Why use an ORM? - Pros

Why use an ORM? - Cons

Fundamentals

Migrations

Querying

WHERE

WHERE across tables?

WHERE other conditionals

AS

Subqueries

Correlated Subqueries

Performance improvements

select_related

Many-to-many — prefetch_related

to_attr

Multiple instances when filtering across many-to-many

Gotchas and other Funky stuff

Values

Other options

How do you know what SQL is being generated?

Updating rows

Problems with updates

Taming the beast that is the Django ORM - An introduction

What is an ORM (Object Relational Mapper)?

Why use an ORM? - Pros

Why use an ORM? - Cons

Fundamentals

Migrations

Querying

WHERE

WHERE across tables?

WHERE other conditionals

AS

Subqueries

Correlated Subqueries

Performance improvements

select_related

Many-to-many — prefetch_related

to_attr

Multiple instances when filtering across many-to-many

Gotchas and other Funky stuff

Values

Other options

How do you know what SQL is being generated?

Updating rows

Problems with updates

Other semi-related things (non deterministic ordering)