Tokenize and detokenize Unicode blocks
Unicode in CADP for Java
Apart from ASCII characters, CADP for Java performs tokenization and detokenization of Unicode characters ranging from 0000-FFFF
. Each Unicode character is assigned with a number, called as code point. CADP for Java tokenizes and detokenizes input value of one Unicode block at a time. If any input value other than the selected block is passed, those values are retained.
While tokenizing and detokenizing a Unicode block, CADP for Java gives the priority to the setUnicode
parameter of AlgoSpec
. If setUnicode
is not specified, then CADP for Java checks for the unicode.properties
file. Tokenization and detokenization of Unicode blocks are supported with all the token formats.
Important notes
Numeric values will always get tokenized.
Undefined Unicode blocks will be retained in the generated tokens.
If Unicode is not enabled using either of the methods described later in this article, only alphanumeric values will be tokenized and detokenized.
If plaintext values contains Unicode characters, the
luhnCheck
andnonIdempotentTokens
parameters are not applicable.
How to tokenize/detokenize Unicode blocks
Tokenization and detokenization of Unicode blocks can be achieved using: