Localize linktrail regex following wikimedia#415
Localize linktrail regex following wikimedia#415daxida wants to merge 1 commit intotatuylonen:linktrailingfrom
Conversation
|
I just wrestled with these same tests, and I agree with the changes to them. I just spent the whole morning working on the semicolon issue in wiktextract... As for how this should be handled, I think this is actually something that should be in By default, we can use the current If we come across an actual issue with some edition, we can then implement specific regexes when |
Sure, that seems cleaner, even though I don't think it's very noisy either to have them in a function as above.
The Maybe it's tricky because \w allows for uppercase, but again, personally, if I don't see it in a test, I'll rather not talk about things I may not understand. As stated above, the Greek regex may be of no use, and the default of [a-z] may just work as well for that case. The Russian one will fail with [a-z] though (but pass with \w). |
|
IMO the regex could also be in the wiktextract package config files like "src/wiktextract/data/en/config.json" and passed to the "Wtp" class. |
|
I'll try to figure out something next week. Having it in wiktextract data or wikitexprocessor data is valid either way, putting it in wiktextract data makes it something the user should handle, which makes sense in case there are differences between different editions in the same language. |
|
These two ja edition tests should be changed and your fix is correct. |
|
tatuylonen/wiktextract#1607 replaces this. Changes on wikitextprocessor: |
Adds localization for Greek (because it was mentioned in the original issue), and Russian (because a test in wiktextract was failing with the default [a-z]).
With these changes, there are 2 japanese test failing (in wiktextract, otherwise everything passes here), namely:
tests.test_ja_example.TestJaExample.test_one_line_rubytest_ja_example.TestJaExample.test_one_line_ref_tagThey fail because
bold_text_offsetsdoes not match. I went and double checked the ruby one, and I'm under the impression that the expected offsets were wrong, so this "fixes" the test. Reasoning:but we obtain
bold_text_offsets=[(19, 21), (21, 23)].When debug printing:
we have:
The last one makes no sense to me:
https://ja.wiktionary.org/wiki/%E3%81%AA%E3%81%8D%E3%82%80%E3%81%97
と is not bolded in the wiki, and it seems a linktrail issue from:
So I claim thet
bold_text_offsets=[(19, 21), (21, 23)]should have been the expected result.The other test is the same.
Have I missed other cases?