topic How to recognize Chinese characters in a string？ in Discussions

How to recognize Chinese characters in a string？

lehaofeng — Mon, 03 Jul 2023 08:11:36 GMT

Hi,I have a problem on regex,

For example:

regex("零件2305wu","[\u4e00-\u9fa5]+")

I want to extract the "零件" but the output is "2305".

Thanks!

Re: How to recognize Chinese characters in a string？

lehaofeng — Mon, 03 Jul 2023 08:59:44 GMT

It seems to have been found.

regex("零件2305wu","[^\x00-\xff]+")

Re: How to recognize Chinese characters in a string？

Craige_Hales — Mon, 03 Jul 2023 14:01:57 GMT

Nice! I was going to suggest similar,

regex("零件2305wu","[^\x01-\x7f]+")

The characters from x00 to x7F are ASCII; x00 might make a note in the log, so maybe start at x01. Characters from x80 to xFF are similar to ASCII and you might want to leave them out as well. Outside that range is non-ASCII Unicode.

The [ square brackets ] make a character set, and the leading ^ means not in this set. The minus means a range. The + means one or more.

Re: How to recognize Chinese characters in a string？

lala — Fri, 03 May 2024 10:12:25 GMT

tx=regex("零件2305wu","([一-﨩]{0,})");

Is this possible, but reasonable?

Thanks Experts!

Re: How to recognize Chinese characters in a string？

Craige_Hales — Fri, 03 May 2024 13:31:35 GMT

https://www.google.com/search?q=unicode+%E4%B8%80 (4E00) is smaller than

https://www.google.com/search?q=unicode+%EF%A8%A9 (FA29) so it should be a valid range. It matches about 11/16 of the Unicode characters.

{0,} means zero or more, just like *

it appears to work, keeping the first two characters and rejecting the last 6 characters.