Localize linktrail regex following wikimedia by daxida · Pull Request #415 · tatuylonen/wikitextprocessor

daxida · 2026-03-06T10:43:09Z

Adds localization for Greek (because it was mentioned in the original issue), and Russian (because a test in wiktextract was failing with the default [a-z]).

With these changes, there are 2 japanese test failing (in wiktextract, otherwise everything passes here), namely:

tests.test_ja_example.TestJaExample.test_one_line_ruby
test_ja_example.TestJaExample.test_one_line_ref_tag

They fail because bold_text_offsets does not match. I went and double checked the ruby one, and I'm under the impression that the expected offsets were wrong, so this "fixes" the test. Reasoning:

        self.assertEqual(
            sense.examples[0],
            Example(
                text="尤も僕は気の毒にも度大島を泣かせては、泣虫泣虫とからかひしものなり。",
                bold_text_offsets=[(19, 21), (21, 23), (21, 24)],
                ref="（芥川龍之介『学校友だち』）〔1925年〕",
                ruby=[("尤", "もっと"), ("度", "たびたび")],
            ),
        )

but we obtain bold_text_offsets=[(19, 21), (21, 23)].

When debug printing:

        for a, b in ex.bold_text_offsets:
            print(ex.text[a:b])

we have:

泣虫
泣虫
泣虫と

The last one makes no sense to me:
https://ja.wiktionary.org/wiki/%E3%81%AA%E3%81%8D%E3%82%80%E3%81%97

と is not bolded in the wiki, and it seems a linktrail issue from:

So I claim thet bold_text_offsets=[(19, 21), (21, 23)] should have been the expected result.

The other test is the same.

This should not be merged as is, because it will create problems in other extractors that might rely on different behavior.

Have I missed other cases?

kristian-clausal · 2026-03-06T10:54:55Z

I just wrestled with these same tests, and I agree with the changes to them.

I just spent the whole morning working on the semicolon issue in wiktextract...
And I couldn't figure out why these tests failed. I even changed back to the master
branch in wikitexprocessor, so it wasn't there. But it was, because I had accidentally
committed to master before creating a new branch, and I forgot about it...

As for how this should be handled, I think this is actually something that should be in
the edition config files. We can have something in src/wikitextprocessor/data/[langcode]/
that is then used when instantiating Wtp(), so we have a wxr.wtp.linktrail_re available
everywhere.

By default, we can use the current \w+ as the basic regex, because in languages where
it is relevant you will not get constructions like [[englishtext]]абвгд, except by accident.
This issue is mainly a problem for languages don't use space at all, and those we can have
use [a-z]+, because linktrailing will be almost certainly irrelevant in use there.

If we come across an actual issue with some edition, we can then implement specific regexes when
they come along. Now that we already have the Greek and Russian ones, we could implement
them just as well (but I bet \w+ would work perfectly fine for the reasons stated earlier.)

daxida · 2026-03-06T11:07:09Z

As for how this should be handled, I think this is actually something that should be in
the edition config files. We can have something in src/wikitextprocessor/data/[langcode]/
that is then used when instantiating Wtp(), so we have a wxr.wtp.linktrail_re available
everywhere.

Sure, that seems cleaner, even though I don't think it's very noisy either to have them in a function as above.

By default, we can use the current \w+ as the basic regex, because in languages where
it is relevant you will not get constructions like [[englishtext]]абвгд, except by accident.
This issue is mainly a problem for languages don't use space at all, and those we can have
use [a-z]+, because linktrailing will be almost certainly irrelevant in use there.

The \w+ logic was robust enough to not have needed revisiting up until this point (the two above tests still are wrong, but one has to look into bold_text_offsets, and I frankly never had before now).

Maybe it's tricky because \w allows for uppercase, but again, personally, if I don't see it in a test, I'll rather not talk about things I may not understand.

As stated above, the Greek regex may be of no use, and the default of [a-z] may just work as well for that case. The Russian one will fail with [a-z] though (but pass with \w).

xxyzz · 2026-03-06T11:07:46Z

IMO the regex could also be in the wiktextract package config files like "src/wiktextract/data/en/config.json" and passed to the "Wtp" class.

kristian-clausal · 2026-03-06T11:14:10Z

I'll try to figure out something next week.

Having it in wiktextract data or wikitexprocessor data is valid either way, putting it in wiktextract data makes it something the user should handle, which makes sense in case there are differences between different editions in the same language.

xxyzz · 2026-03-06T11:24:54Z

These two ja edition tests should be changed and your fix is correct. calculate_bold_offsets() also checks if a link is linked to the same page, on MediaWiki these links are in bold text.

kristian-clausal · 2026-03-09T10:54:28Z

tatuylonen/wiktextract#1607 replaces this.

Changes on wikitextprocessor: linktrailing_re is now an attribute in Wtp.
wiktextract: Enable config.json field linktrailing_regex_pattern which may replace default \w+ in Wtp.

Localize linktrail regex following wikimedia

9417515

kristian-clausal closed this Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Localize linktrail regex following wikimedia#415

Localize linktrail regex following wikimedia#415
daxida wants to merge 1 commit intotatuylonen:linktrailingfrom
daxida:linktrailing

daxida commented Mar 6, 2026 •

edited

Loading

Uh oh!

kristian-clausal commented Mar 6, 2026 •

edited

Loading

Uh oh!

daxida commented Mar 6, 2026

Uh oh!

xxyzz commented Mar 6, 2026

Uh oh!

kristian-clausal commented Mar 6, 2026

Uh oh!

xxyzz commented Mar 6, 2026

Uh oh!

kristian-clausal commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

daxida commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kristian-clausal commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daxida commented Mar 6, 2026

Uh oh!

xxyzz commented Mar 6, 2026

Uh oh!

kristian-clausal commented Mar 6, 2026

Uh oh!

xxyzz commented Mar 6, 2026

Uh oh!

kristian-clausal commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

daxida commented Mar 6, 2026 •

edited

Loading

kristian-clausal commented Mar 6, 2026 •

edited

Loading