Table of Contents
In Wagtail, pages exist within a hierarchical system, where each page has a single parent and potentially several children. The method of storing the page’s hierarchical ‘location’ in the database is slightly atypical. Instead of having a one-to-many relationship joining parents to their children, each page has a path, depth, and numchild field (I’ll explain these fields in a moment). While Wagtail offers some helper methods for querying between pages, they don’t easily facilitate more complicated querying beyond ‘show me all children for this page’.
How page hierarchy is tracked
As I mentioned in the introduction, Wagtail uses 3 fields to track a page’s position within the hierarchy:
pathdepthnumchild
Path
path is a numerical representation of the whole tree, where each ‘layer’ has a 4-digit number to represent a single page. To illustrate this, I’ve prepared a diagram:
├── 0001
│ ├── 00010001
│ ├── 00010002
│ │ └── 000100020001
│ └── 00010003
├── 0002
│ ├── 00020001
│ └── 00020002
└── 0003
As you can see, page 0001 has 3 children: 00010001, 00010002, 00010003. Each new layer takes the numerical value of its parent and then adds a new 4-digit number to the end, representing the number of this child at the current tier. This is demonstrated at the third level, where we have a page 000100020001. You can see that the first 8 digits directly correspond to its ancestors.
Depth & numchild
depth and numchild are a bit simpler. depth is simple how many ‘layers’ deep we are within the tree. Where the top-level is 1, any children of that layer are 2, and so on.
numchild is the number of children a given page has.
Here’s a more detailed version of the above example, where each layer shows the path, depth, numchild. For example, 0001, 1, 3 means the path is 0001, the depth is 1, and the numchild is 3:
├── 0001, 1, 3
│ ├── 00010001, 2, 0
│ ├── 00010002, 2, 1
│ │ └── 000100020001, 3, 0
│ └── 00010003, 2, 0
├── 0002, 1, 2
│ ├── 00020001, 2, 0
│ └── 00020002, 2, 0
└── 0003, 1, 0
Using these fields for advanced page querying
Now that we understand the way in which pages are stored and tracked relative to each other, we can apply this to writing more complicated and efficient queries. Here is an example:
import operator
from functools import reduce
from django.db.models import Q
from django.utils.timezone import now
from dateutil.relativedelta import relativedelta
from wagtail.models import Page, PageQuerySet
class MyCustomQuerySet(PageQuerySet):
def has_old_parent(self):
# Get all the paths belonging to the current queryset
page_paths = self.values_list("path", flat=True)
# Create a set of parent paths by removing the last 4 characters (the page's own ID)
parent_paths = {path[:-4] for path in page_paths}
# Get parent pages older than 12 months
parent_pages = (
Page.objects.filter(path__in=parent_paths)
.filter(first_published_at__lt=now() - relativedelta(months=12))
.values("path", "depth") # Get only the path and depth of the parent pages
)
# Assemble Q objects for each parent page to filter the current queryset
q_objects = [
Q(path__startswith=parent_page["path"], depth=parent_page["depth"] + 1)
for parent_page in parent_pages
]
# If no parent pages are found, return an empty queryset
if not q_objects:
return self.none()
# Filter the current queryset based on the parent pages
return self.filter(reduce(operator.or_, q_objects))
I’ve added comments to provide context at each step. The outline of the process is:
- Get a list of all the paths for the pages in the current queryset.
- Remove the last 4 digits from the current page paths to assemble a set of paths for the page parents. A set is used instead of a list to remove duplicates automatically.
- Run a fresh query for Pages with matching paths, then apply any extra filters you want. The results are stripped down to only use the
pathanddepthvalues to optimise performance. - Assemble a list of
Qobjects, which ensures any results belong to pages with paths that begin with the same path as the parent, but are one layer deeper than the parent. - Apply the full
q_objectslist in one combined query.
While this logic may feel a little complicated at first, it’s rather simple once you’re more comfortable working with this structure. This allows us to get all pages at a given layer and perform highly complex queries, all while only adding an extra couple of queries to the page overall. Other approaches tend to use poorly optimised logic, such as looping over a list of pages and running a new query for each one. This introduces an n+1 issue, whereas the total operations for the above solution is only 2.
Potential drawbacks
Aside from the slight complexity from a code perspective, the main potential drawback revolves around how this approach scales with a large dataset. Since we’re combining a list of Q objects with no upper limit, we could end up with very large queries. This could be circumvented by implementing batching, breaking up a single large query into a couple of more manageable ones. In my experience, this hasn’t been necessary.