DDC GLASS Reference
This document serves as a guide on the syntax, operators and rules to observe when writing GLASS TechnologyTM (GLASS) expressions using the Expression Editor. This is not a comprehensive GLASS guide and only covers basic GLASS operators. Please contact Thales Technical Support if you have more detailed questions about GLASS.
BYTE LEVEL OPERATION
The scanning engine evaluates scan data in octets. This means the engine has the ability to look for any byte within the data stream that passes through the scanning engine.
GLASS SYNTAX
You have to follow several basic rules when defining GLASS expressions:
An expression is a combination of operators and values which is terminated by a new line.
For readability, a single expression can be split across multiple lines by ending a line with a backslash
\
character.The example below forms a single expression:
WORD "Foo" THEN \ RANGE "0-9" TIMES 4
Operators and values are separated by one or more blank spaces.
Operators are keywords that describe what actions to perform.
Values are literals or integers.
Blank lines in the Expression Editor are ignored by the compiler.
A comment is anything that follows a hash
#
character and will be ignored by the compiler. Comments can start at the beginning or in the middle of a line.
CHARACTER ENCODING
You must write GLASS expressions in ASCII or UTF-8 notation only. The engine operates at byte level, so any expression that is UTF-8 encoded can be matched to the corresponding octets if they are present within the input data stream to the scanning engine.
Note
Writing custom GLASS expressions using anything other than ASCII or UTF-8 encoding will yield unexpected results.
For example, the word "world" is written as "Мир" (/mir/) in Russian. Example 1 and Example 2 are two ways to define the GLASS expression to search for the phrase "Hello, Мир!" using the WORD operator.
Example 1
UTF-8 encoded expression for "Hello, Мир!".
WORD 'Hello, Мир!'
Example 2
ASCII encoded expression for "Hello, Мир!" specifying a UTF-8 encoded octet sequence.
WORD 'Hello, \xd0\x9c\xd0\xb8\xd1\x80!'
LITERALS AND INTEGERS
You can define GLASS expressions. These values can be in the form of literals or integers.
LITERALS
Description
Literals are defined as a string of characters that are surrounded by matching single quotes ''
or double quotes " "
. These characters must be in ASCII or UTF-8 encoding only.
"Search for this pattern"
'Look for this pattern too'
Certain literal characters have a special meaning when preceded by a backslash \
character:
Escaped Literal | Escaped Literal Meaning | ASCII Code |
---|---|---|
\t | Horizontal tab | 0x09 |
\n | New line | 0x0A |
\v | Vertical tab | 0X0B |
\f | Form feed | 0X0C |
\r | Carriage return | 0X0D |
\" | Literal double quote character | 0x22 |
\' | Literal single quote character | 0x27 |
\\ | Literal backslash character | 0x5C |
\xHH | The two characters HH following \x will be taken as hexadecimal values of a character | - |
Example 3
WORD "First Phrase\nSecond Phrase\n"
For Example 3, the engine searches for a single pattern consisting of the strings "First Phrase" and "Second Phrase" separated by a new line, followed by a new line at the end of "Second Phrase".
First Phrase
Second Phrase
INTEGERS
Integers are defined as a string of ASCII digits in the inclusive range of 0-9
. You can use the underscore _
character to separate the digits for readability. For example, if you use the underscore _
character as a thousands separator, the GLASS expressions are simpler to read.
Example 4
12345
12_345
1_2_3_4_5
1234_5_
In Example 4, the integers from line 1 to line 4 are all equivalent. The GLASS engine will process all 4 representations as 12345
.
Certain operators in GLASS require positive or negative integers to be specified explicitly. By default, integers are always positive unless a sign is provided.
To explicitly express a positive or negative integer, prepend:
The plus
+
sign (ASCII 0x2B) for positive integers.The minus
-
sign (ASCII 0x2B) for positive integers.
For more information, see Operators.
OPERATORS
Operators are functions that can be used in GLASS expressions to instruct the engine to perform a specific action. All operators are left associative and case insensitive.
Tip
For readability, it is recommended to use uppercase letters to specify operators within GLASS expressions.
GLASS operators can be grouped by function:
Primary pattern generators
Secondary pattern generators
Pattern modifiers
WORD
Description
Search for a specific pattern as defined by the <literal>
. If the pattern is found, the location will be returned as a match.
Matches can happen anywhere in a stream of bytes and are not limited to the traditional word boundaries only.
Syntax
WORD [NOCASE] <literal>
Tip
Literals are case sensitive. You can use the NOCASE keyword to instruct the engine to be case insensitive when searching for matching patterns.
Example 5
The expression below searches for the string "Foo".
WORD "Foo"
Based on Example 5, all the following lines will be marked as match locations:
Foo
FooBar
BazFoo
BazFooBar
Example 6
The expression below searches for the string "HELLO world".
WORD NOCASE "HELLO world"
Based on Example 6, the following lines will be marked as match locations:
hello world
HELLO WORLD
HeLlO wOrLd
RANGE
Description
Search for one or more specific characters defined by the <literal>
. If the character is found, the location will be returned as a match.
There are several rules when defining literals that you can use with the RANGE operator:
Using the hyphen
-
between two characters instructs the GLASS engine to include all values between both characters. For example,RANGE "0-9" RANGE "a-z"
Line 1 matches the following characters: 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. Line 2 matches all lowercase characters in the alphabet from a to z.
Characters can be defined using hexadecimal values. For example,
RANGE "\x41-\x5A"
0x41 and 0x5A are the hexadecimal representations of the uppercase ASCIIcharacters A and Z respectively. Therefore, line 1 matches all uppercase characters in the alphabet from A to Z.
The caret
^
symbol before a literal instructs the GLASS engine to match all characters that are not defined in the RANGE. For example,RANGE "^0-9"
Line 1 matches all characters except the ASCII digits from 0 to 9.
Literals are case sensitive. The NOCASE keyword can be used to instruct the engine to be case insensitive when searching for matching characters.
RANGE "aBc" RANGE NOCASE "abc"
Line 1 matches only the lowercase characters a and c, as well uppercase B. Line 2 matches the characters a, A, b, B, c and C.
Syntax
RANGE [NOCASE] <literal>
Example 7
RANGE "a-zA-Z"
RANGE NOCASE "a-z"
Both line 1 and line 2 matches all lowercase and uppercase characters in the ASCII alphabet set.
Keywords
There are several predefined keywords representing common character sets that can be used to replace the <literal>
value. Keywords are case insensitive.
Keyword | Description | Literal Characters |
---|---|---|
SPACE | Matches any ASCII whitespace characters like blank space, horizontal tab, new line, vertical tab, form feed and carriage return | "\t\n\v\f\r" |
BYTE | Matches any byte within the ASCII 0x00 to 0xFF range | "\x00-\xFF" |
ALNUM | Matches any ASCII alphanumeric character | "a-zA-Z0-9" |
LETTER | Matches any ASCII alphabet character | "a-zA-Z" |
DIGIT | Matches any ASCII numeral | "0-9" |
PRINTABLE | Matches any printable ASCII character including horizontal and vertical whitespace | "a-zA-Z0-9\r\n\v\f\t!\"#$%&'()*+,-./:;<=>?@[\]^_`{ |
PRINTABLENONALPHA | Matches any printable ASCII characters excluding alphabet characters and including horizontal and vertical whitespace | "0-9\r\n\v\f\t!\"#$%&'()*+,-./:;<=>?@[\]^_`{ |
PRINTABLENONALNUM | Matches any printable ASCII characters excluding alphanumeric characters and including horizontal and vertical whitespace | "\r\n\v\f\t!\"#$%&'()*+,-./:;<=>?@[\]^_`{ |
GRAPHIC | Matches any ASCII character that is not whitespace or a control character | "a-zA-Z0-9!\"#$%&'()*+,-./:;<=>?@[\]^_`{ |
SAMELINE | Matches any printable ASCII character including horizontal whitespace but excluding vertical whitespace | "a-zA-Z0-9\r\t!\"#$%&'()*+,-./:;<=>?@[\]^_`{ |
NONALNUM | Matches any character that is not an ASCII alphanumeric character | "^a-zA-Z0-9" |
NONALPHA | Matches any character that is not an ASCII alphabet | "^a-zA-Z" |
NONDIGIT | Matches any character that is not an ASCII numeral | "^0-9" |
LINE | Matches any new line or carriage return character | "\r\n" |
Example 8
RANGE "a-zA-Z"
RANGE LETTER
Both line 1 and line 2 are equivalent and matches all lowercase and uppercase characters in the ASCII alphabet set.
Example 9
RANGE "^0-9"
RANGE NONDIGIT
Both line 1 and line 2 are equivalent and matches any character that is not an ASCII numeral.
Example 10
RANGE PRINTABLE
Line 1 matches any printable ASCII character.
TIMES
Description
Repeat the preceding expression for N number of times. N can also be specified as a range.
Syntax
WORD <literal> TIMES <integer>[-<integer>]
RANGE <literal> TIMES <integer>[-<integer>]
If only one integer
is defined, it will require this exact number of literals
.
If two integers are defined, the first one will be the lower limit and the second one the upper limit of literals
. Also, there will be as many matches as substrings found.
Example 11
WORD "abc" TIMES 2
Example 11 matches any string where the pattern "abc" is repeated exactly two (N=2) times, such as "123abcabc456". Example 11 is interpreted the same way as:
Example 12
RANGE DIGIT TIMES 12
Example 12 matches any string consisting of twelve (N=12) consecutive ASCII numerals, such as "123456789012".
Example 13
RANGE DIGIT TIMES 16-18
Example 13 matches any string of 16, 17 or 18 consecutive digits, such as:
abc1234567812345678
012345678012345678xyz
The example 13 has 4 matches:
abc1234567812345678
012345678012345678xyz
012345678012345678xyz
012345678012345678xyz
THEN
Description
Use THEN to combine two or more expressions that must be matched consecutively.
Syntax
<expression> THEN <expression>
Example 14
WORD "HELLO" THEN RANGE "!.?" THEN WORD " I'm here."
Example 14 matches any string that contains "HELLO" followed immediately by the "!", "." or "?" character, and then followed by the phrase " I'm here.", as below:
HELLO! I'm here.
HELLO. I'm here.
HELLO? I'm here.
Example 15
RANGE "0-9" TIMES 4 THEN WORD "abc"
Example 15 matches any string that contains four consecutive ASCII numerals followed immediately by the pattern "abc", such as:
1111abc
9876abc
OR
Description
Use OR to combine two expressions that can be matched on either side of the OR operator.
Syntax
<expression> OR <expression>
Example 16
WORD "personal details" OR WORD "personal information"
Example 16 matches any string that contains either "personal details" or "personal information".
For example, the underlined sections in line 1 and line 2 will be marked as match locations:
This file contains my personal details.
Search for any folder containing personal information.
Example 17
RANGE "0-9" TIMES 4 OR WORD "abcd"
Example 17 matches any string that contains four consecutive ASCII numerals or the pattern "abcd".
BOUND
Description
Use the BOUND operator to set specific rules or delimiters on how a pattern must match to the left, right or both sides of the pattern to be marked as a valid match.
The boundary for search patterns can be a specific character or range of characters. You can also use the BOUND operator to check if a pattern occurs at the beginning or end of a file to be marked as a match.
The pattern "abc" must be preceded by a colon
:
.WORD "abc" BOUND LEFT ":"
The pattern "abc" must be surrounded at both ends by only non-alphanumeric characters.
WORD "abc" BOUND NONALNUM
The pattern "abc" must occur at the beginning of a file (BOF) stream.
WORD "abc" BOUND BOF
Syntax
<pattern / expression> BOUND [LEFT|RIGHT] <range of characters>
<pattern / expression> BOUND BOF|EOF
The boundary for search patterns can be set up using various keywords:
Keyword | Description |
---|---|
<pattern/expression> BOUND <range of characters> | Match the same <range of characters> on both sides, surrounding the <pattern / expression>. |
<pattern/expression> BOUND LEFT <range of characters> | Match a <range of characters> on the LEFT side, just before the <pattern / expression>. |
<pattern/expression> BOUND RIGHT <range of characters> | Match a <range of characters> on the RIGHT side, just after the <pattern / expression>. |
<pattern/expression> BOUND LEFT <range of characters> BOUND RIGHT <range of characters> | Match a <range of characters> on both sides, surrounding the <pattern / expression>. |
<pattern/expression> BOUND BOF | Match a <pattern / expression> that is found at the start of a file. |
<pattern/expression> BOUND EOF | Match a <pattern / expression> that is found at the end of a file. |
Note
For BOUND, BOUND LEFT, and BOUND RIGHT operators, it is possible to set a number of bytes, indicating how many bytes (before and/or after) the <pattern/expression> will be searched for the <range of characters>. For example: WORD “abc” BOUND NONALNUM WITHIN 32 BYTES.
Example 18
WORD "End of internet." BOUND EOF
Example 18 instructs the engine to check that the pattern "End of internet." appears at the end of a stream to be considered a match.
Example 19
RANGE DIGIT TIMES 4 BOUND NONDIGIT
Example 19 instructs the engine to search for a sequence of four consecutive ASCII numerals that are bounded by non-digit characters on either side of the four-digit sequence.
Based on Example 19, the sections in line 1 and line 2 below will be marked as match locations as the four-digit sequences are bounded by whitespace, brackets and comma characters.
1234 5678 A1234
1111,2222{3333}[4444]
123456
Line 3 contains three sets of four-digit sequences: "1234", "2345" and "3456". However, these will not be marked as match locations as they do not fulfil the BOUND conditions.
Example 20
RANGE DIGIT TIMES 4
If the BOUND operator from Example 19 is removed as shown in Example 20, line 3 that contains the string "123456" would now be marked with three matches: "1234","2345" and "3456".
PARENTHESIS
Description
The parenthesis ( )
are operators to combine a number of expressions into a single logical statement, or to alter the precedence of operations. You can also use parentheses to clearly show the precedence of operations in complicated expressions.
Expressions contained within parentheses are evaluated first.
Syntax
( <expression> )
Example 21
WORD "Folder" OR WORD "File" THEN RANGE DIGIT
WORD "Folder" OR (WORD "File" THEN RANGE DIGIT)
In Example 21, both expressions on line 1 and line 2 are equivalent. The expressions match any string containing the pattern "Folder", or any string containing "File" followed immediately by a single digit from 0 to 9.
The parentheses in line 2 does not change any operation precedence; it is only used to explicitly show how the expression is parsed by the GLASS engine.
Based on Example 21, the underlined sections in line 1 and line 2 will be marked as match locations:
Folder 1 contains sensitive data.
Personal details found in File9.
Example 22
(WORD "Folder" OR WORD "File") THEN RANGE "0-9"
Example 22 uses parentheses to change how the expression is parsed by the GLASS engine. The expression now matches the pattern "Folder" or "File", followed immediately by a single digit from 0 to 9.
Based on Example 22, the underlined sections in line 1 and line 2 will be marked as match locations:
Folder1 contains sensitive data.
Personal details found in File9.
MAP
Description
A MAP defines a list of words for future reference.
Syntax
MAP [NOCASE] 'MAP_NAME' 'ITEM_1' [, 'ITEM_2', ..., 'ITEM_N']
Example 23
MAP 'VISA_KEYWORDS' \
'Visa', 'Visa Card Number', 'Visa Number', 'Visa card', 'Visa CC'
Example 23 defines a list of keywords containing Visa
, Visa Card Number
, Visa Number
, Visa card
, and Visa CC
.
GROUP
Description
A GROUP refers to a list previously created with MAP.
Syntax
GROUP 'MAP_NAME'
Example 24
MAP 'VISA_KEYWORDS' \
'Visa', 'Visa Card Number', 'Visa Number', 'Visa card', 'Visa CC'
GROUP NOCASE 'VISA_KEYWORDS' \
THEN RANGE DIGIT TIMES 16 BOUND NONALNUM
Example 24 uses the list from previous example by requiring a 16-digit number to be after any keyword from the MAP by combining THEN operator and GROUP operator.
SAMPLE GLASS EXPRESSIONS
In this section, we will take a look at some real-world GLASS expressions to help you get started with writing your custom infotypes.
SEARCH FOR SEVEN DIGIT CLUB MEMBERSHIP ID
Requirements
Search for club membership ID which consists of seven consecutive digits. No restrictions on the first and last digit of the membership ID.
Seven consecutive digits must not be contained within a string of alphanumeric characters. No other restrictions on pattern boundaries.
Solution
Step 1
RANGE DIGIT TIMES 7
Step 1 starts with the most basic requirement of the club membership ID, which is a string of seven consecutive ASCII numerals.
Step 2
Based on the second requirement, the seven consecutive digits must not be contained within any string of alphanumeric characters. This means the boundaries on each side of the membership ID can be any character except ASCII alphabets and numerals.
RANGE DIGIT TIMES 7 BOUND NONALNUM
In Step 2 the BOUND operator is used to reduce false matches by specifying the boundaries on each side of the seven-digit club membership ID.
Example 25
Membership ID: 1012345
ID123456789
Name,Sherlock Holmes,ID,2023456,Email,sherlock@example.com
Based on the expression from Step 2, only the underlined sections in line 1 and line 2 will be marked as match locations.
SEARCH FOR COMPANY EMAIL ADDRESSES
Requirements
Search for company email addresses with the format
<mailbox>@example.com
.The maximum length of a mailbox name is limited to 64 ASCII characters.
Valid email addresses can only start with ASCII alphabets but may contain a combination of alphabets, numerals and in the mailbox name.
Email addresses should be bounded only by non-alphanumeric characters.
Solution
Step 1
WORD "@example.com"
Step 1 starts with the most straightforward expression to match the domain "example.com".
Step 2
RANGE ALNUM TIMES 1-64 THEN WORD "@example.com"
In Step 2, the ALNUM keyword and TIMES operator are used to limit the range of allowed characters along with the maximum length of the mailbox name.
Step 3
(RANGE LETTER) THEN (RANGE ALNUM TIMES 1-63) THEN \
(WORD "@example.com")
Based on the third requirement, mailbox names can only start with ASCII alphabets. RANGE LETTER limits the first character of the mailbox name to an ASCII alphabet, followed by up to 63 alphanumeric characters, and ending with "@example.com".
Tip
The parentheses ( )
are not compulsory but are added for readability.
Step 4
((RANGE LETTER) THEN (RANGE ALNUM TIMES 1-63) THEN \
(WORD "@example.com")) BOUND NONALNUM
Step 4 uses the BOUND operator to reduce false matches by specifying the boundaries on each side of the company email addresses.
The outermost parentheses ( )
are used to apply the BOUND operator to all the expressions within that set of parentheses. Without them, the BOUND operator would only apply to the WORD that preceded it.
Example 26
Employee1,employee1@example.com,Marketing
Email: employee1@example.com
employee1@example.comemployee2@example.com
123@employee.com
Based on the expression from Step 4, only the underlined sections in line 1 and line 2 will be marked as match locations.