Tokenize and detokenize Unicode blocks using AlgoSpec
CADP for Java can tokenize and detokenize Unicode input characters by setting the setUnicode
parameter in AlgoSpec. Following Unicode blocks can be tokenized and detokenized:
KATAKANA | HIRAGANA |
CJK_UNIFIED_IDEOGRAPHS | HANGUL |
HEBREW | ARABIC |
THAI | RUSSIAN_CYRILLIC |
GREEK | JOYO_KANJI |
Unicode blocks may contain some undefined range. An undefined range is a set of non-human readable characters. Undefined ranges are preserved in the generated tokens. The below table describes the scope and undefined range for Unicode block.
# | Unicode Block | Scope | Undefined Ranges |
---|---|---|---|
1 | KATAKANA | 30a0-30ff | nil |
2 | HIRAGANA | 3040-309f | —3040-3040 — 3097-3098 |
3 | CJK_UNIFIED_IDEOGRAPHS | 4e00-9fff | 9fc7-9fff |
4 | HEBREW | 05d0-05ea | nil |
5 | ARABIC | 0600-06ff | —0600-061d — 064b-065f — 066a-066d — 0670-0670 — 06d6-06ed |
6 | HANGUL | ac00-d7a3 | nil |
7 | THAI | 0e01-0e5b | 0e01-0e5b |
8 | RUSSIAN_CYRILLIC | 0410-04ff | nil |
9 | GREEK | 0370-03ff | — 0370-0390 — 03a2-03a2 — 03cf-03ff |
To know more about JOYO_KANJI, follow the below link:
http://x0213.org/joyo-kanji-code/index.en.html
Sample
Following sample demonstrates the use of setUnicode parameter of AlgoSpec in tokenization and detokenization of Unicode input character.
AlgoSpec algospec=new AlgoSpec();
algospec.setVersion(1);
algospec.setUnicode(algospec.UNICODE_xxx);
Here, xxx
is the Unicode block.