David Hang

Any sufficiently advanced technology is indistinguishable from magic - Arthur C. Clarke
Previous

Taming the beast that is the Django ORM - An introduction


man fighting dragon which represents the django orm

The material this blog post was originally developed from was a bunch of slides used for a skill share presentation I gave at my workplace @ coreplan.io.

I have 3+ years of experience with Django, with it being the main framework that underpins the backend of CorePlanā€™s main SaaS product. It is a mature, batteries included framework that has been around for a while now. One particular powerful yet dangerous feature of Django is the ORM. This is a Django specific ORM which cannot be separated from the rest of the framework. The other major python ORM is SQLAlchemy which can be used with other python web frameworks, but is an independent tool.

Below are some of the things that I have learned about the Django ORM, how it compares to raw SQL and gotchas that you should be aware of when using it.


What is an ORM (Object Relational Mapper)?

  • Abstraction over SQL to interact with databases

Code -> SQL

Hole.objects.all()

ā¬‡ļø

SELECT * FROM drilling_hole;

Why use an ORM? - Pros

  • Abstraction over SQL, no need to write raw SQL (plus and minus)
  • Portability - Can change out database engines easily !?
    • Probably not true, often will rely on db specific features e.g. postgres jsonb, triggers, etc
  • Direct mapping from db to models
  • Automatic schema generation
    • Migrations are automatically generated
  • Security
    • abstracts away enough that sql injection is less likely

Why use an ORM? - Cons

  • Abstraction over SQLā€¦
    • Hides the underlying SQL
    • Can be difficult to debug
    • Lazy loading can cause N+1 queries without the developer realising
    • Harder to onboard new developers if they havenā€™t used Django before
  • Performance
    • Generated sql be slower than crafted SQL

Fundamentals

  • Models = Tables
# drilling/models.py

from django.db import models
class Hole(models.Model):
    name = models.TextField()
CREATE TABLE drilling_hole (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100)
);

Migrations

python manage.py makemigrations # generate migration files
python manage.py migrate # apply migrations

python manage.py makemigrations drilling --empty # generate empty file for data migration for the `drilling` app

https://docs.djangoproject.com/en/dev/topics/migrations/


Querying

  • ActiveRecord pattern - ala Ruby on Rails style
  • QuerySets (Hole.objects.all())
    • lazy
    • chainable
    • cached when iterated over multiple times !?
      • I would not recommend relying on this because it hard to comprehend when it is cached and when it is not when you are reading code
holes_qs = Hole.objects.filter(name="cheese") # not evaluated yet
holes_qs = holes_qs.filter(depth__gt=100) # still not evaluated

list(holes_qs) # evaluated
list(holes_qs) # cached

holes_qs[2] # not cached
holes_qs.first() # not cached
holes_qs.get(id=1) # not cached

WHERE

  • WHERE clause ā‰ˆ filter()
holes_qs = Hole.objects.filter(name="cheese")

ā¬‡ļø

SELECT * 
FROM drilling_hole;
WHERE drilling_hole.name = 'cheese';

WHERE across tables?

  • But how do you do a left/inner join? With the ORM it isnā€™t done declaratively, but implicitly
class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

class Pad(models.Model):
  name = models.TextField()

holes_qs = Hole.objects.filter(pad__name="cheese board")

ā¬‡ļø

SELECT * 
FROM drilling_hole;
INNER JOIN drilling_pad ON drilling_hole.pad_id = drilling_pad.id
WHERE drilling_pad.name = 'cheese board';

WHERE other conditionals

  • filter(name="cheese") -> filter(name__exact="cheese") -> WHERE name = 'cheese'
  • filter(name__iexact="cheese") -> WHERE name ILIKE 'cheese'
  • filter(name__contains="cheese") -> WHERE name LIKE '%cheese%'
  • filter(name__icontains="cheese") -> WHERE name ILIKE '%cheese%'
  • filter(name__in=["cheese", "board"]) -> WHERE name IN ('cheese', 'board')
  • filter(name__gt=100) -> WHERE name > 100 etc
  • filter(name__isnull=True) -> WHERE name IS NULL
    • At least for postgres shouldnā€™t name = None, null != null

AS

  • annotate ā‰ˆ AS
holes_qs = Hole.objects.annotate(this_thang=F("pad__name"))
hole = holes_qs.first()
print(hole.this_thang)

ā¬‡ļø

SELECT 
  * , 
  drilling_pad.name AS this_thang
FROM drilling_hole;
INNER JOIN "drilling_pad" ON ("drilling_hole"."pad_id" = "drilling_pad"."id")


Subqueries

class Project(models.Model):
  name = models.TextField()

class Pad(models.Model):
  name = models.TextField()

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)
  project = models.ForeignKey(Project, on_delete=models.CASCADE)

# find pads that are on project_id=1
hole_subquery = Hole.objects.filter(project_id=1).values("pk")
pad_qs = Pad.objects.filter(hole__in=Subquery(hole_subquery))

ā¬‡ļø

SELECT "drilling_pad"."id",
 "drilling_pad"."name"
FROM "drilling_pad"
 INNER JOIN "drilling_hole" ON ("drilling_pad"."id" = "drilling_hole"."pad_id")
WHERE "drilling_hole"."id" IN (
  SELECT U0."id"
  FROM "drilling_hole" U0
  WHERE U0."project_id" = 1
 )

Correlated Subqueries

Correlated subqueries are where the inner query depends on outer query

class Pad(models.Model):
  name = models.TextField()

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

# include the hole id of any hole that has a foreign key to the pad
hole_subquery = Hole.objects.filter(pad_id=OuterRef("pk")).values("pk")
pad_qs = Pad.objects.annotate(hole_id=Subquery(hole_subquery))

ā¬‡ļø

SELECT "drilling_pad"."id",
 "drilling_pad"."name",
 (
  SELECT U0."id"
  FROM "drilling_hole" U0
  WHERE U0."pad_id" = ("drilling_pad"."id")
 ) AS "hole_id"
FROM "drilling_pad"

Performance improvements

  • Reduce N+1
    • You typically want to reduce N+1 queries because they have communication overhead
    • select_related
    • prefetch_related
    • You also might choose to use annotate() instead of select_related because select related pulls all the data for the associated table when you might only need one column. That associated might have a jsonb column which contains a lot of unnecessary data that you donā€™t need.

holes = Hole.objects.all()
for hole in holes:
  print(hole.pad.name) # N+1 queries

holes = Hole.objects.select_related("pad")

for hole in holes:
  print(hole.pad.name) # no extra query


You would use prefetch related when you are not pulling a direct foreign key such a many-to-many relationship like below.

width:400px


class Faculty(models.Model):
  name = models.TextField()

class Course(models.Model):
  name = models.TextField()
  faculty = models.ForeignKey(Faculty, on_delete=models.CASCADE)

class Student(models.Model):
  name = models.TextField()
  courses = models.ManyToManyField(Course, through="Enrolment")

class Enrolment(models.Model):
  course = models.ForeignKey(Course, on_delete=models.CASCADE)
  student = models.ForeignKey(Student, on_delete=models.CASCADE)
  grade = models.FloatField()

students = Student.objects.prefetch_related("courses")

for student in students:
  for course in student.courses.all():
    print(course.name) # no extra query
    print(course.faculty.name) # extra query

students = Student.objects.prefetch_related(
  Prefetch(
    "courses", 
    queryset=Course.objects.select_related("faculty")
  )
)

for student in students:
  for course in student.courses.all():
    print(course.name) # no extra query
    print(course.faculty.name) # no extra query

to_attr

to_attr can be used to make ā€œfilteredā€ relationships available on the instance.

class Enrolment(models.Model):
  course = models.ForeignKey(Course, on_delete=models.CASCADE)
  student = models.ForeignKey(Student, on_delete=models.CASCADE)
  grade = models.FloatField()

students = Student.objects.prefetch_related(
  Prefetch(
    "course", 
    queryset=Course.objects.filter(grade__gt=80.0).select_related("faculty"), to_attr="hd_courses"
  )
)

for student in students:
  for course in student.hd_courses.all():
    ...

Multiple instances when filtering across many-to-many

One gotcha is selecting across a many-to-many relationship can return multiple of the same instances.

data model showing many to many relationship

Student.objects.filter(courses__faculty__name="Science") # inner join returns duplicated rows
<QuerySet [<Student: Student object (1)>, <Student: Student object (1)>]>

Student.objects.filter(courses__faculty__name="Science").distinct()
<QuerySet [<Student: Student object (1)>]>

SELECT
  "testing_student"."id", 
  "testing_student"."name" 
FROM 
  "testing_student" 
INNER JOIN 
  "testing_enrolment" 
ON 
  ("testing_student"."id" = "testing_enrolment"."student_id") 
INNER JOIN 
  "testing_course" 
ON 
  ("testing_enrolment"."course_id" = "testing_course"."id") 
INNER JOIN 
  "testing_faculty" 
ON 
  ("testing_course"."faculty_id" = "testing_faculty"."id")
WHERE 
  "testing_faculty"."name" = 'Science'

Gotchas and other Funky stuff

  • Model instances when retrieved will try to populate all columns
    • get() or first()
    • for hole in Hole.objects.all()
  • This can make migrations hard, as older workers will be requesting columns that might have been removed or renamed which will cause errors
  • There are ways to do down-timeless migrations but are bit funky and multi step
  • Recommendation is to avoid deleting or renaming columns
class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

class Pad(models.Model):
  name = models.TextField()

holes_qs = Hole.objects.annotate(this_thang=F("pad__name")).get()

ā¬‡ļø

SELECT 
  drilling_hole.name, -- pulls all columns 
  drilling_hole.pad_id,
  drilling_pad.name AS this_thang
FROM drilling_hole;
WHERE drilling_pad.name = 'cheese board';
LIMIT 1;

Values

  • So how do you to only retrieve certain columns?
class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

holes_qs = Hole.objects.values("name")

for hole in holes_qs:
  print(type(hole)) # dict
  # not `Hole` object, hence no class functions, no lazy loading e.g. can't access `hole.pad.name`

ā¬‡ļø

SELECT 
  drilling_hole.name, -- only pulls name and maps it to a python dictionary object
FROM drilling_hole;
  • Less data sent down the wire, but no lazy loading and no class functions as the data is a python dictionary

Other options

  • only() and defer()
  • Will retrieve model instances, but wonā€™t retrieve all fields
  • Values not declared when accessed on the model are lazy loaded
  • Would not recommend to be used regularly, very high chance of N+1
holes_qs = Hole.objects.only("pad_id")

for hole in holes_qs:
  print(hole.pad_id) # no extra query
  print(hole.name) # name will be lazy loaded, N+1 queries

How do you know what SQL is being generated?


Updating rows

There are three typical ways to update a row in the database.

class Hole(models.Model):
  name = models.TextField()

instance = Hole.objects.create(name="cheese")

# save()
instance.name = "board"
instance.save()

# update()
Model.objects.filter(name="board").update(name="board2")

# bulk_update()
instance.name = "board3"
instances_to_update = [instance]
Model.objects.bulk_update(instances_to_update, ["name"])

Problems with updates

  • update() and bulk_update() do not trigger save() method on the model
  • built in django signals (publish/subscribe pattern), there are post_save and pre_save signals which can be triggered when calling save()
    • update() and bulk_update() do not trigger those signalsā€¦

  • Pagination / order_by
    • Not a Django ORM thing, but a Django ORM hides the implementation detail, which may lead to unexpected result
    • Page pagination is default in DRF list views and implemented with LIMIT and OFFSET in SQL

?page_size=10&page=3

SELECT * 
FROM drilling_hole 
LIMIT 10 
OFFSET 20;

Anything wrong with this query?

  • There is no deterministic guarantee that the same 10 rows will be returned each time.
  • A plain SELECT in postgres (may be different in different dbs) provides no guarantee of order, unless ORDER BY is specified
  • It often appears to return in insertion/id order, but that is not guaranteed in postgres
  • Model Meta ordering may set a default order, but sometimes tht is ignored
  • For list views you should to provide a deterministic order_by
  • order_by(name) is not enough if name is not unique
    • order_by(name, id) is required, because id is unique
  • This can been the the cause of some flaky tests issues where lists are returned seemingly in insertion order and asserted to return in id order

Thanks for reading! I hope this has been useful to you. There are definitely more particularities and gotchas to be aware of when using the Django ORM and Django in general but I think these are the most common ones.

Previous
Previous