Skip to content

Localize linktrail regex following wikimedia#415

Closed
daxida wants to merge 1 commit intotatuylonen:linktrailingfrom
daxida:linktrailing
Closed

Localize linktrail regex following wikimedia#415
daxida wants to merge 1 commit intotatuylonen:linktrailingfrom
daxida:linktrailing

Conversation

@daxida
Copy link
Copy Markdown
Contributor

@daxida daxida commented Mar 6, 2026

Adds localization for Greek (because it was mentioned in the original issue), and Russian (because a test in wiktextract was failing with the default [a-z]).

With these changes, there are 2 japanese test failing (in wiktextract, otherwise everything passes here), namely:

tests.test_ja_example.TestJaExample.test_one_line_ruby
test_ja_example.TestJaExample.test_one_line_ref_tag

They fail because bold_text_offsets does not match. I went and double checked the ruby one, and I'm under the impression that the expected offsets were wrong, so this "fixes" the test. Reasoning:

        self.assertEqual(
            sense.examples[0],
            Example(
                text="尤も僕は気の毒にも度大島を泣かせては、泣虫泣虫とからかひしものなり。",
                bold_text_offsets=[(19, 21), (21, 23), (21, 24)],
                ref="(芥川龍之介『学校友だち』)〔1925年〕",
                ruby=[("尤", "もっと"), ("度", "たびたび")],
            ),
        )

but we obtain bold_text_offsets=[(19, 21), (21, 23)].

When debug printing:

        for a, b in ex.bold_text_offsets:
            print(ex.text[a:b])

we have:

泣虫
泣虫
泣虫と

The last one makes no sense to me:
https://ja.wiktionary.org/wiki/%E3%81%AA%E3%81%8D%E3%82%80%E3%81%97

image

と is not bolded in the wiki, and it seems a linktrail issue from:

image

So I claim thet bold_text_offsets=[(19, 21), (21, 23)] should have been the expected result.

The other test is the same.

This should not be merged as is, because it will create problems in other extractors that might rely on different behavior.

Have I missed other cases?

@kristian-clausal
Copy link
Copy Markdown
Collaborator

kristian-clausal commented Mar 6, 2026

I just wrestled with these same tests, and I agree with the changes to them.

I just spent the whole morning working on the semicolon issue in wiktextract...
And I couldn't figure out why these tests failed. I even changed back to the master
branch in wikitexprocessor, so it wasn't there. But it was, because I had accidentally
committed to master before creating a new branch, and I forgot about it...

As for how this should be handled, I think this is actually something that should be in
the edition config files. We can have something in src/wikitextprocessor/data/[langcode]/
that is then used when instantiating Wtp(), so we have a wxr.wtp.linktrail_re available
everywhere.

By default, we can use the current \w+ as the basic regex, because in languages where
it is relevant you will not get constructions like [[englishtext]]абвгд, except by accident.
This issue is mainly a problem for languages don't use space at all, and those we can have
use [a-z]+, because linktrailing will be almost certainly irrelevant in use there.

If we come across an actual issue with some edition, we can then implement specific regexes when
they come along. Now that we already have the Greek and Russian ones, we could implement
them just as well (but I bet \w+ would work perfectly fine for the reasons stated earlier.)

@daxida
Copy link
Copy Markdown
Contributor Author

daxida commented Mar 6, 2026

As for how this should be handled, I think this is actually something that should be in
the edition config files. We can have something in src/wikitextprocessor/data/[langcode]/
that is then used when instantiating Wtp(), so we have a wxr.wtp.linktrail_re available
everywhere.

Sure, that seems cleaner, even though I don't think it's very noisy either to have them in a function as above.

By default, we can use the current \w+ as the basic regex, because in languages where
it is relevant you will not get constructions like [[englishtext]]абвгд, except by accident.
This issue is mainly a problem for languages don't use space at all, and those we can have
use [a-z]+, because linktrailing will be almost certainly irrelevant in use there.

The \w+ logic was robust enough to not have needed revisiting up until this point (the two above tests still are wrong, but one has to look into bold_text_offsets, and I frankly never had before now).

Maybe it's tricky because \w allows for uppercase, but again, personally, if I don't see it in a test, I'll rather not talk about things I may not understand.

As stated above, the Greek regex may be of no use, and the default of [a-z] may just work as well for that case. The Russian one will fail with [a-z] though (but pass with \w).

@xxyzz
Copy link
Copy Markdown
Collaborator

xxyzz commented Mar 6, 2026

IMO the regex could also be in the wiktextract package config files like "src/wiktextract/data/en/config.json" and passed to the "Wtp" class.

@kristian-clausal
Copy link
Copy Markdown
Collaborator

I'll try to figure out something next week.

Having it in wiktextract data or wikitexprocessor data is valid either way, putting it in wiktextract data makes it something the user should handle, which makes sense in case there are differences between different editions in the same language.

@xxyzz
Copy link
Copy Markdown
Collaborator

xxyzz commented Mar 6, 2026

These two ja edition tests should be changed and your fix is correct. calculate_bold_offsets() also checks if a link is linked to the same page, on MediaWiki these links are in bold text.

@kristian-clausal
Copy link
Copy Markdown
Collaborator

tatuylonen/wiktextract#1607 replaces this.

Changes on wikitextprocessor: linktrailing_re is now an attribute in Wtp.
wiktextract: Enable config.json field linktrailing_regex_pattern which may replace default \w+ in Wtp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants