The standard alpha-numeric regular expression validation [a-zA-Z0-9]+ is not suitable for multi-lingual Unicode string validation.
In a multi-lingual software environment, you’ve set up your Unicode database table columns and set your interface to accept the UTF-8 character set – but you want to validate your user’s input. That is to say you still want ‘alpha-numeric’ validation but need to accept Latin accented characters and or even Chinese, Taiwanese etc character sets – you don’t want special characters like !”£$%^& etc. But of course you could modify the combinations below to suit your needs.
Here is a regex expression code snippet with comments that will help you validate the input.
/**
* A localised string valid characters.
*
* Numbers
* [\u0030-\u0039]+
* English alpha characters
* [\u0041-\u005a\u0061-\u007a]+
* Underscore and hyphen
* [\u002d\u005f]+
* Quote, apostrophe, punctuation
* [\\u0022\u0027\u00b4]+
* Whitespace
* [\\s]+
* Latin-1 Supplement - With excess removed, only leaving characters
* Suitable for; German, Italian, Portuguese, Spanish.
* [\u00c0-\u00ff]+
* Latin-1 Supplement - Extra characters
* Suitable for Dutch as per http://en.wikipedia.org/wiki/ISO/IEC_8859-1
* [\u0131-\u0132]+
* Latin-1 Supplement - Extra characters
* Suitable for French as per http://en.wikipedia.org/wiki/ISO/IEC_8859-1
* [\u0152\u0178]+
* Chinese Characters
* [\u4e00-\ud7a3]+
*/
private static final String VALID_CHARS = "[\u0030-\u0039\u0041-\u005a\u0061-\u007a\u002d\u005f\\s\\u0022\u0027\u00b4\u00c0-\u00ff\u0131-\u0132\u0152\u0178\u4e00-\ud7a3]+";
In Java you might use this something like:
private Boolean isValidChars(String inputValue) {
return (inputValue != null && !inputValue.matches(VALID_CHARS) ? false : true);
}
In JavaScript the same can be used like:
function isValidChars(inputValue) {
var VALID_CHARS = /[\u0030-\u0039\u0041-\u005a\u0061-\u007a\u002d\u005f\u0022\u0027\u00b4\s\u00c0-\u00ff\u0131-\u0132\u0152\u0178\u4e00-\ud7a3]/gm;
return inputValue.match(VALID_CHARS);
}
June 12th, 2009 at 6:15 am
[...] In fact, you can find a good amount of details about this here…and if you would like to know in detail about the usage of regular expressions with UTF-8, you can refer to this post. [...]
June 22nd, 2009 at 1:26 pm
Adrian
Thank you for this wonderful post. It really helped. I really appreciate this.
Peace
Chari
September 8th, 2009 at 10:52 pm
Hey, you forgot greek
[\u0391-\u03a9\u03b1-\u03c9]+
September 9th, 2009 at 9:48 pm
Its a lot simpler than that!
[^\\p{L}\\d]
Look in the Pattern javadoc under “Unicode support”
September 9th, 2009 at 9:49 pm
Its a lot simpler than that!
[\\p{L}\\d]
Look in the Pattern javadoc under “Unicode support”
(removed the “^” in the previous comment)
January 1st, 2010 at 5:35 pm
Great post it help me so much