Tokenize and detokenize Unicode blocks using AlgoSpec

CADP for Java can tokenize and detokenize Unicode input characters by setting the setUnicode parameter in AlgoSpec. Following Unicode blocks can be tokenized and detokenized:


KATAKANA	HIRAGANA
CJK_UNIFIED_IDEOGRAPHS	HANGUL
HEBREW	ARABIC
THAI	RUSSIAN_CYRILLIC
GREEK	JOYO_KANJI

Unicode blocks may contain some undefined range. An undefined range is a set of non-human readable characters. Undefined ranges are preserved in the generated tokens. The below table describes the scope and undefined range for Unicode block.

#	Unicode Block	Scope	Undefined Ranges
1	KATAKANA	30a0-30ff	nil
2	HIRAGANA	3040-309f	—3040-3040 — 3097-3098
3	CJK_UNIFIED_IDEOGRAPHS	4e00-9fff	9fc7-9fff
4	HEBREW	05d0-05ea	nil
5	ARABIC	0600-06ff	—0600-061d — 064b-065f — 066a-066d — 0670-0670 — 06d6-06ed
6	HANGUL	ac00-d7a3	nil
7	THAI	0e01-0e5b	0e01-0e5b
8	RUSSIAN_CYRILLIC	0410-04ff	nil
9	GREEK	0370-03ff	— 0370-0390 — 03a2-03a2 — 03cf-03ff

To know more about JOYO_KANJI, follow the below link:

http://x0213.org/joyo-kanji-code/index.en.html

Sample

Following sample demonstrates the use of setUnicode parameter of AlgoSpec in tokenization and detokenization of Unicode input character.

AlgoSpec algospec=new AlgoSpec();
algospec.setVersion(1);
algospec.setUnicode(algospec.UNICODE_xxx);

Here, xxx is the Unicode block.

Tokenize and detokenize Unicode blocks using AlgoSpec

Sample

On this page

Suggest A Change