Skip to content

remove Kconv.toutf8 conversion#16

Open
pessi-v wants to merge 1 commit into
hirakiuc:masterfrom
pessi-v:master
Open

remove Kconv.toutf8 conversion#16
pessi-v wants to merge 1 commit into
hirakiuc:masterfrom
pessi-v:master

Conversation

@pessi-v
Copy link
Copy Markdown

@pessi-v pessi-v commented Jul 4, 2024

In lib/ogpr/fetcher/html_fetcher.rb:20 the fetched meta tag content is forced to UTF-8 using the stdlib Kconv. This conversion seems unnecessary, but also introduces a lot of wrongly converted characters. In my use case, a lot of accented latin letters are converted to chinese characters. This also seems to happen with some punctuation.

@pessi-v
Copy link
Copy Markdown
Author

pessi-v commented Jul 4, 2024

@hirakiuc

@hirakiuc
Copy link
Copy Markdown
Owner

hirakiuc commented Jul 5, 2024

Thanks for your report, and this PR. 😃

But, at first, I don't recommend to use this rubygem in production 🙏🏼
Because..., this library was implemented several years ago, and not maintained well for long time.

acceptable_content!(head.headers[:content_type])

res = send_request(:get, @uri, headers)
Kconv.toutf8(res.to_str)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: my opinion is that such behavior (converting string encodings in this gem) better to be configurable for various use cases, instead of removing this line simply.

read the followings for the detail. 🙏🏼


At first, let's check the String value in the OGP spec.
https://ogp.me/#string 👀

As you can see in the official docs, String value is described as A sequence of Unicode characters. (Unicode, but not UTF-8)
So, I think that this gem should follow the String value spec as possible.

Based on this thought, and just for my personal use,
I had decided to convert those web contents(meta tags) into UTF-8 encoding.
(I think that this is the root cause of those encoding issue in this gem, and my bad decision. 😢 )

However, web contents (especially meta tag values in HTML files in this context) could be in various encodings as you know.
After merging your PR, users of this library will have to consider OGP string encoding without any additional information (like, which string encoding was used in each web site).

Due to above reason, I don't think that removing converting string encodings is the best way, like this PR. 🤔

So, as the result, as I wrote in the head of this comment,
my opinion is that such behavior (converting string encodings in this gem) better to be configurable for various cases.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I simply made a GitHub issue for this encoding issue, #17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants