Remove "descendant_count" to optimize large taxonomies by bradenmacdonald · Pull Request #517 · openedx/openedx-core

bradenmacdonald · 2026-03-26T20:44:35Z

Hmm, I guess when testing my previous PR #511 I didn't test it on a sufficiently large taxonomy.

Once I tested with Lightcast Open Skills Taxonomy.csv (4,268 skill tags in a 3-level hierarchy), I realized that it was slower for very large taxonomies, going from ~150ms to ~14s :/

The reason is due to computing the descendant_count field accurately.

Technical Details

I thought that this `lineage__startswith` query would be performant, because we have an index on the `lineage` column and it's a `startswith` query:

        # Count all descendants at any depth using depth + lineage prefix.
        # depth__gt correctly excludes self; lineage prefix matches all descendants.
        descendants_sq = (
            self.tag_set.filter(depth__gt=models.OuterRef("depth"), lineage__startswith=models.OuterRef("lineage"))
            .order_by()
            .annotate(count=models.Func(F("id"), function="Count"))
        )
        qs = qs.annotate(descendant_count=models.Subquery(descendants_sq.values("count")))  # type: ignore[no-redef]

But, SQL doesn't natively have a "starts with" operator, and this gets converted to a LIKE "...%" operator. The problem is that because this is a subquery, our LIKE "..." expression is using a variable, the OuterRef("lineage") from the outer query, and so we run into a fundamental MySQL optimizer limitation:

for a correlated subquery, MySQL plans the inner query once before executing it. Since CONCAT(outer_ref, '%') is a runtime value, MySQL can't compute range bounds at plan time and won't apply a range scan — even with the perfect index.

This makes the query O(n²).

Instead of finding a way to optimize descendant_count in this query (I don't think there is a good way, other than the hard-coded approach we had before), I think it's better to just remove descendant_count completely. It's expensive to compute in the database, but it's almost trivial to compute afterward in python once you've run the same query. Also, we aren't using it at all (I checked frontend-platform and the Authoring MFE). So I don't think we should compute it or return it, when we don't have a use case. The existing child_count is far more important for things like pagination purposes, and it works fine.

Q: Why don't I just compute descendant_count in python within the get_filtered_tags_deep code?

A: Because doing so requires evaluating the queryset, and I want this low-level API to return a queryset that can then be paginated or filtered further. That way is much more performant then pre-evaluating the entire query for the whole taxonomy.

Testing Instrucions

Import the linked taxonomy above and test its performance before and after this change.

Private ref: MNG-4914

openedx-webhooks · 2026-03-26T20:44:40Z

Thanks for the pull request, @bradenmacdonald!

This repository is currently maintained by @axim-engineering.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Details

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

It can be relatively easily calculated in python using the result data if needed, now that the API otherwise supports unlimited depth.

openedx-webhooks added open-source-contribution PR author is not from Axim or 2U core contributor PR author is a Core Contributor (who may or may not have write access to this repo). labels Mar 26, 2026

openedx-webhooks added this to Contributions Mar 26, 2026

github-project-automation bot moved this to Needs Triage in Contributions Mar 26, 2026

bradenmacdonald requested review from ChrisChV, jesperhodge and kdmccormick March 26, 2026 20:45

bradenmacdonald added 2 commits March 26, 2026 14:34

feat!: deprecate "descendant_count" to improve performance

4c4b876

It can be relatively easily calculated in python using the result data if needed, now that the API otherwise supports unlimited depth.

feat!: remove deprecated "descendant_count" completely

20a574a

bradenmacdonald force-pushed the braden/optimize-taxonomy branch from cb3f179 to 20a574a Compare March 26, 2026 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove "descendant_count" to optimize large taxonomies#517

Remove "descendant_count" to optimize large taxonomies#517
bradenmacdonald wants to merge 2 commits intomainfrom
braden/optimize-taxonomy

bradenmacdonald commented Mar 26, 2026 •

edited

Loading

Uh oh!

openedx-webhooks commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bradenmacdonald commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing Instrucions

Uh oh!

openedx-webhooks commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bradenmacdonald commented Mar 26, 2026 •

edited

Loading