Remove "descendant_count" to optimize large taxonomies#517
Remove "descendant_count" to optimize large taxonomies#517bradenmacdonald wants to merge 2 commits intomainfrom
Conversation
|
Thanks for the pull request, @bradenmacdonald! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
It can be relatively easily calculated in python using the result data if needed, now that the API otherwise supports unlimited depth.
cb3f179 to
20a574a
Compare
Hmm, I guess when testing my previous PR #511 I didn't test it on a sufficiently large taxonomy.
Once I tested with Lightcast Open Skills Taxonomy.csv (4,268 skill tags in a 3-level hierarchy), I realized that it was slower for very large taxonomies, going from ~150ms to ~14s :/
The reason is due to computing the
descendant_countfield accurately.Technical Details
I thought that this `lineage__startswith` query would be performant, because we have an index on the `lineage` column and it's a `startswith` query:But, SQL doesn't natively have a "starts with" operator, and this gets converted to a
LIKE "...%"operator. The problem is that because this is a subquery, ourLIKE "..."expression is using a variable, theOuterRef("lineage")from the outer query, and so we run into a fundamental MySQL optimizer limitation:This makes the query O(n²).
Instead of finding a way to optimize
descendant_countin this query (I don't think there is a good way, other than the hard-coded approach we had before), I think it's better to just removedescendant_countcompletely. It's expensive to compute in the database, but it's almost trivial to compute afterward in python once you've run the same query. Also, we aren't using it at all (I checked frontend-platform and the Authoring MFE). So I don't think we should compute it or return it, when we don't have a use case. The existingchild_countis far more important for things like pagination purposes, and it works fine.Q: Why don't I just compute
descendant_countin python within theget_filtered_tags_deepcode?A: Because doing so requires evaluating the queryset, and I want this low-level API to return a queryset that can then be paginated or filtered further. That way is much more performant then pre-evaluating the entire query for the whole taxonomy.
Testing Instrucions
Import the linked taxonomy above and test its performance before and after this change.
Private ref: MNG-4914