glossary: generated-only glossary source with reproducibility checks#656
glossary: generated-only glossary source with reproducibility checks#656PLeVasseur wants to merge 4 commits intorust-lang:mainfrom
Conversation
|
(adding this bit from our meeting) I think that the terms should be defined within the body of the FLS ( Currently all references to terms lead to the Glossary, and often there is no convenient way to go from the Glossary to the semantics of that term. |
we can make clicking on the term in the glossary point to where the term is defined (which is the opposite direction to what you describe) |
|
Yes, this is possible! However, this first step would not do that, as there are sometimes multiple definitions ( Feels like we do this switch to the directive, then we move on to those refinements like splitting apart glossary entries and then making the glossary entries link into the chapters as we talked about in the meeting. |
|
here are steps I think we should take, to ease review and have less changes
these could be made separate PRs because each earlier step is valuable enough on its own, without proceeding to later steps |
|
I like Tshepang's strategy! |
This is basically a bug. |
|
Yeah a phased approach could definitely make this easier to review. I'd like to ask for your ear for a moment again on the "why" of this purely mechanical transformation for the moment. It depends on how the problem is tackled, I suppose. I think having a mechanical transformation that arrives at the same rendered text has some merit and would then let us carefully perform the steps we agreed are in-scope (same definitions in both glossary and main text, linking into main text with it having :dt:, and so on as you write below) I favored this approach because in one jump we go from "two scattered sources of truth" to "two co-located sources of truth" to allow for further followup transformations to get us what we'd like. I think it'd even make sense to strip out the two different The current PR aims purely for the mechanical transformation to arrive at the same rendered text. We can and should shape how the directive works to serve us as we would proceed through the steps you outline below. Some questions!
Does this effectively mean "all glossary entries which exist today in the glossary are moved to have text in some chapter"? For example the definition of the C programming language would have some text in a chapter, maybe FFI.
Makes sense. Does require splitting and separating out different structures, where say in the main text there's a longer paragraph that's mostly like the glossary text or the main text then says something which leads into some bullet points. 1
More accurate is going to be interesting. Probably a case-by-case thing we'll need to audit. As I wrote above in the main text it seems like this means longer paragraphs or lead ins to bullet points with more detail (i.e. more paragraphs).
Agreed, yep. That'll let us jump from some word in glossary into a meaningful section of the FLS main body text.
Could you elaborate a bit more on this? Because not always do these link to :t: terms. These seem to often link to :s: syntax terms as well. Generally the glossary entries have some patterns, but they are not held to tightly and can vary. 2
Right-o. That's eventually where we'd like to be.
I hear you that this is a large change! Definitely it is. On the other hand, it is purely mechanical in nature and arrives at exactly the document as-is today to allow for the above refinements. Given all that -- what do we think? Purely mechanical transform of this PR seems to get us a nice way to then perform the further improvements we both agree on which can be more subtle, while allowing for a simple method to gather both sides up and compare them in a single place. The mechanical transform as the first step will put the pieces in place to allow for preventing regressions (e.g. check for and fail Footnotes |
yes
yes, near where it's first mentioned
yeah, we can decide together what is more accurate (or more complete, or worded better)
Those glossary links target places close to where the term is introduced, and these terms tend to closely match the
There are going to parts we where we must decide what to do, like choosing which of the definitions to go with.
I really would want to avoid what we have currently in this pr, where there are drastic changes to content (specifically adding extra rST roles in the main text, and lots of other things, like name changes and lots of code), only to later remove it. |
I hear you that it's a lot of changes in the source. 🫠 I do think that having
I think without the An argument could be made to write a different part of the Sphinx extension to check for these various things, but if we'd like to eventually generate the contents of the glossary from the main-body text I think the purely mechanical transform sounds reasonable. Honestly, even if this PR, in its current form, is deemed too much, I'd probably still end up using this branch in order to work through the ordered list of priorities you've got. I think it could end up being a bit more brittle at each stage without the Thoughts? |
|
The code in this pr can be used to help generate changes that fit my strategy, and those steps can even be done in one pr (where each step in my plan is a commit). I only mentioned separate pr idea so we can benefit from each step without delay. |
|
We can avoid regressions by reviewing each step (whether a separate pr or a separate commit). For example, the step that unifies the definitions between main text and glossary will be clearly visible in the diff; we do not have to have a glossary directive to do that, which we'll later have to remove anyway. |
Perhaps miscommunication, but a glossary directive would allow for this to happen more cleanly, I think, and avoid regressions. Not now during this migration, but in the future, when it may no longer be me or you that's maintaining the FLS. If we don't use a glossary directive, then it seems like this would involve needing other ways of preventing regressions, like some more involved part of the Sphinx extension that can check for the kinds of properties we care about (:dt: only in chapter text, entries in glossary are same as entries in chapter text), without the benefit of a clear glossary directive to benefit from. To be clear -- if we wanna go the route of the multiple PRs or commits, while I wouldn't pick that route, it's okay. I do think though that eventually we'll have either the glossary directive or some other extra part of the Sphinx extension to check for the properties we care about. I'll give some thought on how to chunk this up as it seems to be the way that @tshepang and @kirtchev-adacore are more comfortable. 🖖 |
|
I don't understand why my approach would risk regressions, since glossary.rst will get automatically regenerated, meaning it will always have all the |
|
I finally caught up with the discussion. It might be worth spelling out exactly what we want the final product to be, and how to arrive there without dropping term definitions on the floor.
Some points stemming from the requirements: If term definitions are going to be used to generate the term references in the Glossary, we need to somehow associate the ID of the corresponding Glossary paragraph with the term definition. This can be done using Pete's Given that the automation will basically invert Given that we do not want any regressions (no loss of terms, no orphaned term reference, no unused terms), a diff between the original Glossary and the generated Glossary should differ only on |
|
Didn't think of glossary ids, but am not sure we should preserve them or even have them around... they are linkable by urls already. Is there a need for them to be more stable than that? I don't want us to constrain ourselves without a strong need. |
Ah, right. So I was working under the assumption that if we get partway done through the phases you outlined, but not all the way to having the glossary generated by the directive, then we could end up with still drifting over time. The upside of doing the purely mechanical transformation right away is that we can know the "source of truth" for both the glossary and the chapter are colocated to allow for further refinement through the phases we agree are important. Have you any thoughts on this "purely mechanical transform" first, based on the above? Sorry if I was unclear 😅 |
I agree with these requirements 🖖
Yeah, this was part of the glossary directive to enable this.
Agreed! But also -- "should just work" <= famous last words 🤣
Agreed. We can also automate this check with a script. |
Might just be that we're coming at the problem from different angles. I'm suggesting being very ambitious with this PR's changes, but ultimately arriving at the same rendered representation so that we can then take more incremental steps to improve it along the phases we agree on. We could consider the kind of change you're proposing here as one of those incremental steps. |
I do not think that there was a specific need for IDs in the Glossary. |
If the purely mechanical apporach does not result in temporary steps1 recorded in git history, that would be fine, but it looks like than cannot be avoided2. This can be done semi-mechanically of course, like the part of your script that ensures all terms in glossary are in the main text can be refined to achieve my proposed approach. Perhaps my approach means more work for the developer (and am happy to do it, if you want), but it is less work for the reviewer, and has a more clean git history. Footnotes |
|
Talking with @PLeVasseur about this PR. I'll take review on it. I'm asking Pete to remove the autogenerated file; I'd prefer not having two sources of truth in the repo; we can ensure reproducibility in other ways. Regarding the commit history, I want to see this squashed down significantly. It'd probably be OK to have 2-5 logical commits, e.g. one to add the tooling, one to do the big rearrangement, etc. The key to trusting the big rearrangement is verifying that the before and after results are the same. Pete's doing this. I'll want him to reverify this on the final series of commits. That'll be enough for me to build confidence here. |
|
Hey @tshepang 👋 could you expand a bit more on this? Want to understand where you're coming from before we make any decisions.
I'd like to know about the "more work for the developer" and "less work for the reviewer" part here -- for which parts? Perhaps moving of the terms into the chapters? Do you foresee this being a more gradual migration with a number of terms per PR? Could you also share what you mean by "clean git history" here? Is there a concern from an assessor's standpoint regarding the git history containing work which gets modified over time related to scripts / tools / directives? |
I see each of the main bullet points in my plan being just 1 commit. That is, not a gradual migration.
This is for the benefit of those who look at history to understand what changed and why. The assessor does not at all care about the git history... they will only care about the changes between releases of a toolchain (like Ferrocene 26.02 vs Ferrocene 26.05). |
00ad891 to
8e93f35
Compare
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
Summary
main(fb8a46795eda1f1db5e3232002fd94a270bfbffd) as four logical commits.06da2fd-feat(glossary): add single-source glossary tooling coreglossary-entry/glossary-includeinfrastructure, glossary generation tooling, and build integration.85a60b6-docs(glossary): migrate glossary content to single-source entries6716db3-ci(glossary): add transitional html parity verification8e93f355-refactor(glossary): switch to generated-only glossary source and reproducibility checksCloses #655
Reference alignment
Testing
8e93f35547fcc123ec47ba723ed13129bc19600f:mainvs commit3: pr656-diff-upstream-main-vs-commit3.txtmainvs commit4: pr656-diff-upstream-main-vs-commit4.txt.txtfor attachment compatibility): pr656-glossary-placement-scores.jsonl.txt