Skip to content

glossary: generated-only glossary source with reproducibility checks#656

Open
PLeVasseur wants to merge 4 commits intorust-lang:mainfrom
PLeVasseur:glossary-single-source-phase1
Open

glossary: generated-only glossary source with reproducibility checks#656
PLeVasseur wants to merge 4 commits intorust-lang:mainfrom
PLeVasseur:glossary-single-source-phase1

Conversation

@PLeVasseur
Copy link
Contributor

@PLeVasseur PLeVasseur commented Feb 5, 2026

Summary

  • Rebased and restacked the glossary single-source migration onto pinned upstream main (fb8a46795eda1f1db5e3232002fd94a270bfbffd) as four logical commits.
  • 06da2fd - feat(glossary): add single-source glossary tooling core
    • Adds glossary-entry/glossary-include infrastructure, glossary generation tooling, and build integration.
  • 85a60b6 - docs(glossary): migrate glossary content to single-source entries
    • Migrates glossary definitions into chapter-local single-source entries across the spec.
  • 6716db3 - ci(glossary): add transitional html parity verification
    • Adds CI parity verification to compare rendered HTML output during the transition.
  • 8e93f355 - refactor(glossary): switch to generated-only glossary source and reproducibility checks
    • Removes committed autogenerated glossary source and keeps generated-only workflow with reproducibility checks.

Closes #655

Reference alignment

  • No Rust Reference semantic changes; this PR is glossary architecture/tooling/verification work.
  • Rust 2021 scope remains unchanged.

Testing

@kirtchev-adacore
Copy link
Contributor

(adding this bit from our meeting)

I think that the terms should be defined within the body of the FLS (:dt:s), while the Glossary should simply contain references to these terms (:t:s). The main reason for this is to allow the reader to review the co-located semantics of a term.

Currently all references to terms lead to the Glossary, and often there is no convenient way to go from the Glossary to the semantics of that term.

@tshepang
Copy link
Member

tshepang commented Feb 9, 2026

Currently all references to terms lead to the Glossary, and often there is no convenient way to go from the Glossary to the semantics of that term.

we can make clicking on the term in the glossary point to where the term is defined (which is the opposite direction to what you describe)

@PLeVasseur
Copy link
Contributor Author

Yes, this is possible!

However, this first step would not do that, as there are sometimes multiple definitions (:dt:) within the same entry in the glossary. If we swapped those canonical definitions to the chapter, this'd be odd in the glossary.

Feels like we do this switch to the directive, then we move on to those refinements like splitting apart glossary entries and then making the glossary entries link into the chapters as we talked about in the meeting.

@tshepang
Copy link
Member

tshepang commented Feb 10, 2026

here are steps I think we should take, to ease review and have less changes

  • ensure main text has all glossary entries
  • ensure the duplicate definitions that are in both glossary and main text have matching text
    • if text does not match, choose one that is more accurate
  • ensure :dt entries are in main text and that there are none in glossary
    • this will mean clicking on the term in a glossary will take one to the main text
  • remove the "See SomeTerm" paragraphs in the glossary, because above step makes them redundant
  • generate the glossary from reading all :dt entries from main text, and ensure it matches the old glossary

these could be made separate PRs because each earlier step is valuable enough on its own, without proceeding to later steps

@kirtchev-adacore
Copy link
Contributor

I like Tshepang's strategy!

@kirtchev-adacore
Copy link
Contributor

there are sometimes multiple definitions (:dt:) within the same entry in the glossary

This is basically a bug.

@PLeVasseur
Copy link
Contributor Author

PLeVasseur commented Feb 10, 2026

Yeah a phased approach could definitely make this easier to review. I'd like to ask for your ear for a moment again on the "why" of this purely mechanical transformation for the moment.

It depends on how the problem is tackled, I suppose. I think having a mechanical transformation that arrives at the same rendered text has some merit and would then let us carefully perform the steps we agreed are in-scope (same definitions in both glossary and main text, linking into main text with it having :dt:, and so on as you write below)

I favored this approach because in one jump we go from "two scattered sources of truth" to "two co-located sources of truth" to allow for further followup transformations to get us what we'd like.

I think it'd even make sense to strip out the two different :glossary: and :chapter: bits of the glossary directive and have a single canonical definition that gets put in both the chapter and the glossary.

The current PR aims purely for the mechanical transformation to arrive at the same rendered text. We can and should shape how the directive works to serve us as we would proceed through the steps you outline below.


Some questions!

here are steps I think we should take, to ease review and have less changes

* ensure main text has all glossary entries

Does this effectively mean "all glossary entries which exist today in the glossary are moved to have text in some chapter"?

For example the definition of the C programming language would have some text in a chapter, maybe FFI.

* ensure the duplicate definitions that are in both glossary and main text have matching text

Makes sense. Does require splitting and separating out different structures, where say in the main text there's a longer paragraph that's mostly like the glossary text or the main text then says something which leads into some bullet points. 1

  * if text does not match, choose one that is more accurate

More accurate is going to be interesting. Probably a case-by-case thing we'll need to audit. As I wrote above in the main text it seems like this means longer paragraphs or lead ins to bullet points with more detail (i.e. more paragraphs).

* ensure `:dt` entries are in main text and that there are none in glossary
  * this will mean clicking on the term in a glossary will take one to the main text

Agreed, yep. That'll let us jump from some word in glossary into a meaningful section of the FLS main body text.

* remove the "See SomeTerm" paragraphs in the glossary, because above step makes them redundant

Could you elaborate a bit more on this? Because not always do these link to :t: terms. These seem to often link to :s: syntax terms as well. Generally the glossary entries have some patterns, but they are not held to tightly and can vary. 2

* generate the glossary from reading all `:dt` entries from main text, and ensure it matches the old glossary

Right-o. That's eventually where we'd like to be.

these could be made separate PRs because each earlier step is valuable enough on its own, without proceeding to later steps.

I hear you that this is a large change! Definitely it is. On the other hand, it is purely mechanical in nature and arrives at exactly the document as-is today to allow for the above refinements.


Given all that -- what do we think? Purely mechanical transform of this PR seems to get us a nice way to then perform the further improvements we both agree on which can be more subtle, while allowing for a simple method to gather both sides up and compare them in a single place.

The mechanical transform as the first step will put the pieces in place to allow for preventing regressions (e.g. check for and fail :glossary: entries with :dt:, add check to ensure that :glossary: and :chapter: entries are identical and warn if not until we have them all made same and can then remove the :glossary: entirely)

Footnotes

  1. Thing that I'm finding is that this may not be straight forward. Worth it for me to find some examples to share.

  2. Because the glossary is currently not strictly regulated on what each heading contains, there are irregularities. Worth looking for examples.

@tshepang
Copy link
Member

tshepang commented Feb 10, 2026

here are steps I think we should take, to ease review and have less changes

* ensure main text has all glossary entries

Does this effectively mean "all glossary entries which exist today in the glossary are moved to have text in some chapter"?

yes

For example the definition of the C programming language would have some text in a chapter, maybe FFI.

yes, near where it's first mentioned

  * if text does not match, choose one that is more accurate

More accurate is going to be interesting. Probably a case-by-case thing we'll need to audit. As I wrote above in the main text it seems like this means longer paragraphs or lead ins to bullet points with more detail (i.e. more paragraphs).

yeah, we can decide together what is more accurate (or more complete, or worded better)

* remove the "See SomeTerm" paragraphs in the glossary, because above step makes them redundant

Could you elaborate a bit more on this? Because not always do these link to :t: terms. These seem to often link to :s: syntax terms as well. Generally the glossary entries have some patterns, but they are not held to tightly and can vary. 2

Those glossary links target places close to where the term is introduced, and these terms tend to closely match the :s: roles (like :dt:ABI Clobber glossary entry having a link to :s:AbiClobber).

these could be made separate PRs because each earlier step is valuable enough on its own, without proceeding to later steps.

I hear you that this is a large change! Definitely it is. On the other hand, it is purely mechanical in nature and arrives at exactly the document as-is today to allow for the above refinements.

There are going to parts we where we must decide what to do, like choosing which of the definitions to go with.

Given all that -- what do we think? Purely mechanical transform of this PR seems to get us a nice way to then perform the further improvements we both agree on which can be more subtle, while allowing for a simple method to gather both sides up and compare them in a single place.

I really would want to avoid what we have currently in this pr, where there are drastic changes to content (specifically adding extra rST roles in the main text, and lots of other things, like name changes and lots of code), only to later remove it.

@PLeVasseur
Copy link
Contributor Author

I really would want to avoid what we have currently in this pr, where there are drastic changes to content (specifically adding extra rST roles in the main text, and lots of other things, like name changes and lots of code), only to later remove it.

I hear you that it's a lot of changes in the source. 🫠

I do think that having .. glossary-entry:: as a directive does allow for cleaner guardrails to prevent regressions back to how things were prior. I added a small postscript above that I'll copy-paste down here:

The mechanical transform as the first step will put the pieces in place to allow for preventing regressions (e.g. check for and fail :glossary: entries with :dt:, add check to ensure that :glossary: and :chapter: entries are identical and warn if not until we have them all made same and can then remove the :glossary: entirely)

I think without the .. glossary-entry:: directive as a first step it's possible we accidentally introduce regressions.

An argument could be made to write a different part of the Sphinx extension to check for these various things, but if we'd like to eventually generate the contents of the glossary from the main-body text I think the purely mechanical transform sounds reasonable.


Honestly, even if this PR, in its current form, is deemed too much, I'd probably still end up using this branch in order to work through the ordered list of priorities you've got. I think it could end up being a bit more brittle at each stage without the .. glossary-entry:: directive though.

Thoughts?

@tshepang
Copy link
Member

tshepang commented Feb 10, 2026

The code in this pr can be used to help generate changes that fit my strategy, and those steps can even be done in one pr (where each step in my plan is a commit). I only mentioned separate pr idea so we can benefit from each step without delay.

@tshepang
Copy link
Member

tshepang commented Feb 10, 2026

We can avoid regressions by reviewing each step (whether a separate pr or a separate commit). For example, the step that unifies the definitions between main text and glossary will be clearly visible in the diff; we do not have to have a glossary directive to do that, which we'll later have to remove anyway.

@PLeVasseur
Copy link
Contributor Author

We can avoid regressions by reviewing each step (whether a separate pr or a separate commit). For example, the step that unifies the definitions between main text and glossary will be clearly visible in the diff; we do not have to have a glossary directive to do that, which we'll later have to remove anyway.

Perhaps miscommunication, but a glossary directive would allow for this to happen more cleanly, I think, and avoid regressions. Not now during this migration, but in the future, when it may no longer be me or you that's maintaining the FLS.

If we don't use a glossary directive, then it seems like this would involve needing other ways of preventing regressions, like some more involved part of the Sphinx extension that can check for the kinds of properties we care about (:dt: only in chapter text, entries in glossary are same as entries in chapter text), without the benefit of a clear glossary directive to benefit from.


To be clear -- if we wanna go the route of the multiple PRs or commits, while I wouldn't pick that route, it's okay. I do think though that eventually we'll have either the glossary directive or some other extra part of the Sphinx extension to check for the properties we care about.

I'll give some thought on how to chunk this up as it seems to be the way that @tshepang and @kirtchev-adacore are more comfortable. 🖖

@tshepang
Copy link
Member

tshepang commented Feb 11, 2026

I don't understand why my approach would risk regressions, since glossary.rst will get automatically regenerated, meaning it will always have all the :dt: entries from main text (as :t:). If someone does modify it without knowing that it's automatically generated, they will see from the diff, when their changes are overwritten. We can leave a warning at the top of the document, as well as mention in the README.rst/CONTRIBUTING.rst that the file is automatically generated.

@kirtchev-adacore
Copy link
Contributor

I finally caught up with the discussion. It might be worth spelling out exactly what we want the final product to be, and how to arrive there without dropping term definitions on the floor.

  1. The Glossary should be automatically generated.
  2. All term definitions (:dt:s) should be in the FLS text.
  3. All existing term definitions in the Glossary should be converted to term references (:t:s). (The Glossary points back into the FLS text)
  4. All existing term references in the FLS text (continue to) point to the term definitions in the FLS text. (The FLS text is always the source of truth for terms)
  5. The IDs of all existing paragraphs in the Glossary should be stable during the transition. (No new IDs, no lost IDs)
  6. The automation should not delete term definitions. (No loss of terms, no orphaned term references)
  7. The automation should not introduce new term definitions without term references. (No unused terms)

Some points stemming from the requirements:

If term definitions are going to be used to generate the term references in the Glossary, we need to somehow associate the ID of the corresponding Glossary paragraph with the term definition. This can be done using Pete's .. glossary-entry:: directive, or if it is possible, the :dt: annotation itself. (I am no Sphinx wiz, so I have no idea whether :dt:fls_Id: is even possible)

Given that the automation will basically invert :dt:s and :t:, the proper linking from Glossary term reference and FLS term references to FLS term definitions "should just work" out of the box.

Given that we do not want any regressions (no loss of terms, no orphaned term reference, no unused terms), a diff between the original Glossary and the generated Glossary should differ only on :dt:s vs :t: usage. (Reviewing that diff would suck, but once we have it, we know that the automation "works")

@tshepang
Copy link
Member

Didn't think of glossary ids, but am not sure we should preserve them or even have them around... they are linkable by urls already. Is there a need for them to be more stable than that? I don't want us to constrain ourselves without a strong need.

@PLeVasseur
Copy link
Contributor Author

I don't understand why my approach would risk regressions, since glossary.rst will get automatically regenerated, meaning it will always have all the :dt: entries from main text (as :t:). If someone does modify it without knowing that it's automatically generated, they will see from the diff, when their changes are overwritten. We can leave a warning at the top of the document, as well as mention in the README.rst/CONTRIBUTING.rst that the file is automatically generated.

Ah, right. So I was working under the assumption that if we get partway done through the phases you outlined, but not all the way to having the glossary generated by the directive, then we could end up with still drifting over time.

The upside of doing the purely mechanical transformation right away is that we can know the "source of truth" for both the glossary and the chapter are colocated to allow for further refinement through the phases we agree are important.

Have you any thoughts on this "purely mechanical transform" first, based on the above? Sorry if I was unclear 😅

@PLeVasseur
Copy link
Contributor Author

I finally caught up with the discussion. It might be worth spelling out exactly what we want the final product to be, and how to arrive there without dropping term definitions on the floor.

1. <snip>
7. <snip>

I agree with these requirements 🖖

Some points stemming from the requirements:

If term definitions are going to be used to generate the term references in the Glossary, we need to somehow associate the ID of the corresponding Glossary paragraph with the term definition. This can be done using Pete's .. glossary-entry:: directive, or if it is possible, the :dt: annotation itself. (I am no Sphinx wiz, so I have no idea whether :dt:fls_Id: is even possible)

Yeah, this was part of the glossary directive to enable this.

Given that the automation will basically invert :dt:s and :t:, the proper linking from Glossary term reference and FLS term references to FLS term definitions "should just work" out of the box.

Agreed! But also -- "should just work" <= famous last words 🤣

Given that we do not want any regressions (no loss of terms, no orphaned term reference, no unused terms), a diff between the original Glossary and the generated Glossary should differ only on :dt:s vs :t: usage. (Reviewing that diff would suck, but once we have it, we know that the automation "works")

Agreed. We can also automate this check with a script.

@PLeVasseur
Copy link
Contributor Author

Didn't think of glossary ids, but am not sure we should preserve them or even have them around... they are linkable by urls already. Is there a need for them to be more stable than that? I don't want us to constrain ourselves without a strong need.

Might just be that we're coming at the problem from different angles.

I'm suggesting being very ambitious with this PR's changes, but ultimately arriving at the same rendered representation so that we can then take more incremental steps to improve it along the phases we agree on.

We could consider the kind of change you're proposing here as one of those incremental steps.

@kirtchev-adacore
Copy link
Contributor

Is there a need for them to be more stable than that? I don't want us to constrain ourselves without a strong need.

I do not think that there was a specific need for IDs in the Glossary.

@tshepang
Copy link
Member

Have you any thoughts on this "purely mechanical transform" first, based on the above?

If the purely mechanical apporach does not result in temporary steps1 recorded in git history, that would be fine, but it looks like than cannot be avoided2. This can be done semi-mechanically of course, like the part of your script that ensures all terms in glossary are in the main text can be refined to achieve my proposed approach.

Perhaps my approach means more work for the developer (and am happy to do it, if you want), but it is less work for the reviewer, and has a more clean git history.

Footnotes

  1. they only exist to serve later steps, for they would later then be removed

  2. for one, we will have to choose which of duplicate definitions to go with

@traviscross traviscross self-assigned this Feb 12, 2026
@traviscross
Copy link
Contributor

Talking with @PLeVasseur about this PR. I'll take review on it. I'm asking Pete to remove the autogenerated file; I'd prefer not having two sources of truth in the repo; we can ensure reproducibility in other ways. Regarding the commit history, I want to see this squashed down significantly. It'd probably be OK to have 2-5 logical commits, e.g. one to add the tooling, one to do the big rearrangement, etc.

The key to trusting the big rearrangement is verifying that the before and after results are the same. Pete's doing this. I'll want him to reverify this on the final series of commits. That'll be enough for me to build confidence here.

@PLeVasseur
Copy link
Contributor Author

Hey @tshepang 👋 could you expand a bit more on this? Want to understand where you're coming from before we make any decisions.

Perhaps my approach means more work for the developer (and am happy to do it, if you want), but it is less work for the reviewer, and has a more clean git history.

I'd like to know about the "more work for the developer" and "less work for the reviewer" part here -- for which parts? Perhaps moving of the terms into the chapters? Do you foresee this being a more gradual migration with a number of terms per PR?

Could you also share what you mean by "clean git history" here? Is there a concern from an assessor's standpoint regarding the git history containing work which gets modified over time related to scripts / tools / directives?

@tshepang
Copy link
Member

Perhaps my approach means more work for the developer (and am happy to do it, if you want), but it is less work for the reviewer, and has a more clean git history.

I'd like to know about the "more work for the developer" and "less work for the reviewer" part here -- for which parts? Perhaps moving of the terms into the chapters? Do you foresee this being a more gradual migration with a number of terms per PR?

I see each of the main bullet points in my plan being just 1 commit. That is, not a gradual migration.

Could you also share what you mean by "clean git history" here? Is there a concern from an assessor's standpoint regarding the git history containing work which gets modified over time related to scripts / tools / directives?

This is for the benefit of those who look at history to understand what changed and why. The assessor does not at all care about the git history... they will only care about the changes between releases of a toolchain (like Ferrocene 26.02 vs Ferrocene 26.05).

@PLeVasseur PLeVasseur force-pushed the glossary-single-source-phase1 branch from 00ad891 to 8e93f35 Compare February 13, 2026 22:30
@rustbot
Copy link
Collaborator

rustbot commented Feb 13, 2026

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@PLeVasseur PLeVasseur changed the title glossary: single-source glossary entries glossary: generated-only glossary source with reproducibility checks Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Change]: Have a single source of truth for glossary and chapter entries of terms

5 participants