Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

working with chinese character urls in UK

         

Arturo99

8:15 am on Jul 23, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



Hi, I need help or suggestions on how to use Chinese urls in the UK
If I visit a chinese site in UK with chinese characters and copy paste the url, i get a series
of roman characters not the original chinese characters.
This makes it very difficult.
What can i use to copy Chinese characters and paste them in Chinese?
I need to be able to do this to optimise the page.

thanks
Art

Dimitri

2:01 pm on Jul 23, 2022 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



Where are you "pasting" ?

If you copy an URL from browser address bar, and paste it into the Notepad, for example, this should work.

If you are pasting the URL into the HTML code of a page, that is different story, the page needs to be declared as UTF-8 at least, which should include multi bytes UTF-8, if I don't make mistakes. (your html/text editor, needs to be configured to support UTF-8 too).

If you are storing it, into a database, this is the same, the table needs to use a multi bytes UTF-8 character set.

Arturo99

5:25 pm on Jul 23, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



If I paste my url example.com/chinese characters, into notepad or word, i get roman characters like
example.com/B1%E5%9D%BD%E4%BC%A
...not the Chinese characters which I need for future reference and for optimizing.

Do you mean i should declare my notepad or Word doc as a UTF-8?

not2easy

5:42 pm on Jul 23, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Declaring UTF-8 as part of the document won't change the software's ability to use UTF-8. You may find settings within Word and Notepad to set it to use UTF-8. The html page should be UTF-8 to be able to use UTF-8 in browsers. Notepad and Word both use default Windows (1252) Character sets. That is why you see entity characters. You should look for the settings to use UTF-8 for these tasks. It is simple to do with Notepad++ (free). Windows Notepad can also be changed but I haven't used it for so long I do not recall how/where I changed it.

Dimitri

6:11 pm on Jul 23, 2022 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



I mentionned Notepad, because I had just tried, before answering. If I copy an URL with chinese characters , from the Web browser's address bar, to the Notepad, it works, it keeps the chinese characters . (Default Notepad, with Windows 10).

So, the question is now, from where are you "copying"?

An URL from an <a> tag, should be "url encoded" for backward compatibility, for example.

[en.wikipedia.org...]
[en.wikipedia.org...]

lucy24

6:47 pm on Jul 23, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What can i use to copy Chinese characters and paste them in Chinese?
In this day and age you should be able to use anything--yes, even on Windows--but you have to tell the text editor (or whatever you're using) what encoding to use. Any self-respecting text editor should also have a way to reinterpret text, so if you find yourself staring at
Å“
it can be changed to
œ

(Thanks to forums limitations, I had to pick something that exists in Windows-Latin-1 but not vanilla ISO-Latin-1.)

TorontoBoy

8:45 pm on Jul 23, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



I use Chinese daily and don't have any issues with Chinese URLs. Download the Chinese character set (simplified Chinese) and the characters will display properly. Chinese URLs all follow worldwide conventions (english letters), so there should be no issues. There are some Chinese sites that ban foreigners from access, but that is a geolocation issue, not a site or URL rendering issue.

TorontoBoy

8:49 pm on Jul 23, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



&#36825;&#26159; &#28857;&#20013;&#25991; &#65292;&#30475;&#30475;&#21543; Here is some Chinese, take a look.

Haha, the webmasterworld web server does not render Chinese!

phranque

9:45 pm on Jul 23, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



the webmasterworld web server does not render Chinese!

the forum software translates your ampersands to &amp; for "reasons".
(try "view source")

TorontoBoy

9:58 pm on Jul 23, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



Here's my Chinese from view source: <br> &amp;#36825;&amp;#26159; &amp;#28857;&amp;#20013;&amp;#25991; &amp;#65292;&amp;#30475;&amp;#30475;&amp;#21543;

and I have the full Chinese and Japanese character set.

lucy24

4:39 am on Jul 24, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Download the Chinese character set (simplified Chinese) and the characters will display properly
The issue here is not with the ability to display characters. If your device is correctly reading UTF-8 but you happen not to have any Chinese font installed, you will instead see some type of generic I-can’t-display-that for each individual character. That's different from the string of  and à and so on that you get if the Chinese has been interpreted as Latin-1 (or analogous gibberish if your device defaults to a different one-byte encoding).

the webmasterworld web server does not render Chinese!
Or anything else outside the Windows-Latin-1 range. But if you have a text editor that does HTML preview, just paste it in and Preview.

:: further side trip to G### Translate ::

“This is some Chinese, let's see”. Gosh. A remarkably idiomatic translation, for G### ;)

Arturo99

12:19 pm on Aug 2, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



Thanks to all for the advice4 and ideas to try out

Arturo99

4:58 pm on Aug 23, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



Update on implementing the above suggestions:

I am trying to do internal linkings from URL to URL in my Chinese site in the UK.
I have downloaded simplified Chinese font. Source Han serif.
The live pages are UTF-8
But when I try to grab the Chinese URL
(which has English domain name/ Chinese characters)
the url changes to the roman characters and i cannot find pages to link to

So I can see URL in chrome of English and Chinese characters combined.
But i cannot copy-paste them for internal linking.

Any suggestions on this
It is actually possible?
Arturo

not2easy

5:26 pm on Aug 23, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



It sounds like the browser is converting the characters. Downloading the Chinese font is step one, you may need to install that font to the editor that you are using and check that Chinese is enabled in the browser.

Can you create a list of the URLs that use combined characters and translate that list (English -->Chinese)?

Do you have a Chinese language sitemap for those pages?

lucy24

6:38 pm on Aug 23, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the url changes to the roman characters
Please paste in a few examples of what you mean. I seriously doubt your browser, or even your server, is brilliant enough to transliterate from Chinese to Roman*, so it still has to be a question of reinterpreting from one encoding to another.


* Understatement for humrous effect, since it is not possible to “transliterate” to or from a non-phonetic script.

Arturo99

8:27 pm on Aug 23, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



If you browse the Chinese site you see url in the browser is
1. example.com/chinese-characters

But if you cut and paste this into word or excel and it becomes
2. example.com/%e7%be%8e%e7%89%88%e8%b6%85%e5%a3%b0%e5%88%80

If you enter this generated string 2 back into chrome, you get the original url with Chinese characters string 1.
So you cannot grab the chinses URL string 1, and you cannot do internal linking with it

Site is UTF-8 in header.

not2easy

9:59 pm on Aug 23, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Word and Excel are not using plain text. If you paste your copied URL into a plain text (UTF-8) text editor, you will have the Chinese character URL.

lucy24

11:08 pm on Aug 23, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, that kind of encoding. Yes, everything that travels across the internet gets percent-encoded somewhere along the line, but it normally sorts itself out unaided.

:: detour to script-changing program on my HD, followed by further detour to G### Translate ::

US version of ultrasonic scalpel
OK then.

It should work, though. I tried pasting the percent-encoded version into a browser, replacing “example.com” with one of my site names (so I can see in logs what got requested). The visible URL ended up in Chinese characters, while logs show the expected percent version.

Did you at some time establish that it doesn't work on your site, or are you just assuming it won’t work because it looks alarming in a text editor?

Arturo99

9:07 am on Aug 24, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



not2easy.
what are you calling a plain text editor?
notepad does not do it,
EditPlus does not do it.
In all editors, string 2 remains as string 2.
no Chinese characters ever appear, even though i have chinses installed fonts

Lucy24
I cannot do anything with string 2. No matter what ediitor i use it remains in string 2 format.
I am trying to set up internal redirects with a plugin but plugin simply does not recognise string 2 and says it cannot be found on the site.

lucy24

5:01 pm on Aug 24, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No matter what ediitor i use it remains in string 2 format.
Then you'll need to route it via a bit of ajax or php. Or simply javascript, if the site already requires scripting to function; otherwise you may lose some users. In javascript--which is what I use for my local decoding function--the command is
decodeURIComponent(argument)
Someone hereabouts will know the equivalent command in php or whatever language you end up using.

:: poring over code from 2012 ::

/(%[cd][\da-f]%[89ab][\da-f]|%e[\da-f]%[89ab][\da-f]%[89ab][\da-f]|%f[\da-f]%[89ab][\da-f]%[89ab][\da-f]%[89ab][\da-f])/i
Whew. I think that's meant to cover all possible permutations of the percent-encoding of multi-byte characters. It's a heck of a lot easier to read in RegEx engines that have the shorthand \h = [a-f0-9] but hardly any of them do.