Skip to content

Unicode URL slug gotcha #247

@toothbrush

Description

@toothbrush

I ran into an interesting bug today (not actually a bug with Frog, but a gotcha i thought was worth mentioning here). I have a post with a title, for example malgré. There's a bit of code

https://github.com/greghendershott/frog/blob/master/frog/paths.rkt#L308-L311

that normalises slugs. It's quite permissive in what it allows (anything which passes char-alphabetic? is what i care about), which didn't seem to be a problem. Finally, it uses string-normalize-nfd to normalise the string.

The issue arose when I used an online mailing list service to send out an email with a link to my post. My browser pretty-prints the URL to look like http://me.com/2019/02/malgré.html which is not incorrect, but when i pasted that into the mailing list service, it turns out my subscribers got a 404. What had happened is that Frog turns the link into

.../malgre%CC%81.html

whereas the naive ASCII->UTF encoding would be this: (which is what Mailchimp generated from my .../malgré.html input in the body of my newsletter)

.../malgr%C3%A9.html

Of course, my web host says those two filenames aren't the same. The answer is probably that I should use a sane browser (Chrome seems to copy correctly, i think my troubles arose from using Safari), but i only felt safe after patching the relevant snippet to read something like the following:

   (for/list ([c (in-string (string-normalize-nfc s))])
     (cond [(regexp-match? #rx"^[a-zA-Z0-9]$" (~a c)) c]
           [else #\-]))

This is probably frightfully hacky, and results in less pretty URLs like .../malgr.html but for now i figured i could live with that. Feel free to close if this is dumb or irrelevant, but at least it's here for posterity. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions