-
Notifications
You must be signed in to change notification settings - Fork 18
Description
(I'm creating this ticket from the initial discussion that happened in a PR, along with a couple of minor thoughts I've had since then.)
Requirements
We need to quickly and efficiently do hierarchy traversals for things like outline display. This is a very common need in courses.
- The solution must provide fast reads for all descendants of a node or all ancestors of a node.
- The solution must be performant for these reads, even when there are 10K nodes (maybe 20 ms?)
- Write speed is less critical. We should still be able to update a course with 10K nodes in time for a synchronous request.
High level considerations
- It makes sense to add this in Learning Core itself, since it's of great interest to plugin writers and it can operate on containers directly.
- The APIs should be more QuerySet/model based, i.e. there's something like a
ContainerPathmodel that can be returned, so that other APIs can extend that queryset to add moreselect_relatedcalls to it. - It's okay to assume that a certain type of thing is always at a particular level of the hierarchy, e.g. a Component is always at the bottom layer, and a Unit is always one layer above that.
- We should not assume that a given layer always maps to a particular container type. A plugin may one day implement a custom container type that holds Components but is not a Unit. This should still work. We will likely also introduce Files and Uploads as a type that sits at the bottom layer.
- We can probably get away with only tracking the expanded form for the current Draft and Published versions.
Modeling
A containers app (or publishing for now) might have a model like:
class PublishedContainerPath(models.Model):
# level_0 is the lowest/leaf layer, e.g. the Component.
# This is a DAG, so we can have many duplicates of level_0_entity, and the
# only UNIQUE restriction would be on the entire row, i.e. exact duplicates
# of everything would be a race condition. Even having the same entity at
# multiple versions might be allowable for pinned child references in the
# future.
#
# It feels unintuitive to model the hierarchy from bottom-up like this, but
# as Braden pointed out, this would let us avoid the issue of having Units
# show up in multiple columns, since a Unit would always be at level_1 in
# this type of model.
level_0_entity = models.ForeignKey(models.PublishableEntity)
level_0_version = models.ForeignKey(models.PublishableEntityVersion, null=True)
level_0_order = models.PositiveSmallInteger(null=True) # or PositiveInteger?
# Continue this pattern for at least Component -> Unit -> Subsection ->
# Section -> OutlineRoot (0-4), but probably at least to 6 to give a little
# headroom?
level_1_entity = models.ForeignKey(models.PublishableEntity)
level_1_version = models.ForeignKey(models.PublishableEntityVersion, null=True)
level_1_order = models.PositiveSmallInteger(null=True)
# highest level doesn't have an ordering (because it's never a peer child
# of anything)
level_6_entity = models.ForeignKey(models.PublishableEntity)
level_6_version = models.ForeignKey(models.PublishableEntityVersion, null=True)
# This is going to be very index heavy, but hopefully the table itself
# doesn't grow too large if we constrain it to only show published and draft.
# We can do incremental updates as drafts are updated, i.e. we should never
# be rewriting a whole course's hierarchy when a single subsection changes.
# ForeignKeys mean implicit indexes already, so we can do searches for
# individual containers easily enough. We'd need to add indexes for
# efficient ordered traversal, which in this model would be a composite
# index with ordering, e.g. (level_6_entity, level_5_order, level_5_entity,
# etc.)This model isn't fully baked, but I think the basic idea is workable. Something flat that represents hierarchy, is easily filterable, and where callers can efficiently tack joins onto the queryset to add whatever data they think is relevant. I think that we can potentially do some really good caching once I can land the publishing/draft dependency stuff, but that the caching would mostly happen at the edges (e.g. content_libraries), while the inner pieces like containers would continue to pass QuerySets to make things more extensible.
Model Considerations and Tradeoffs
Limitations
- There is a fixed hierarchy limit, so content of arbitrary depth cannot be represented.
- The representation makes reads fast, but it does waste a fair amount of space because sibling nodes are very redundantly encoded. This makes it inappropriate for historical data.
How to handle containers that may exist at different levels of the hierarchy?
There are containers like OutlineRoot which may hold children at different levels of the hierarchy. I think that there are a couple of options:
- Treat these containers as special, existing in its entirely separate table.
- Reserve the top level for these containers that aren't fixed to a particular level of the hierarchy. That would make it so that some of these rows would be sparse, e.g.
(component_1, unit_1, subsection_1, section_1, None, None, outline_1)
Do we need separate tables for Draft and Published hierarchies, or can we unify them?
There's a lot of commonality there, but a unified table is going to need a boolean field in every index. I still feel like they're different models from the point of view of their relationships with other things, but maybe that just really means they're two proxy models on top of the same underlying one. I'm currently leaning towards making separate concrete tables that are derived from the same abstract model, to centralize the code but still give separate tables to index, join to, and receive signals from.