Tokenizerpublic interface Tokenizer This interface defines basic character sequence tokenizing capabilities.
It can serve as the underpinnings of simple parsers.
The methods of this class fall into three categories:
- methods to configure the tokenizer, such as {@link #skipSpaces} and
{@link #tokenizeWords}.
- methods to read a token: {@link #next}, {@link #nextChar}, and
{@link #scan(char,boolean,boolean,boolean)}.
- methods to query the current token, such as {@link #tokenType},
{@link #tokenText} and {@link #tokenKeyword}.
In its default state, a Tokenizer performs no tokenization at all:
{@link #next} returns each input character as an individual token.
You must call one or more configuration methods to specify the type of
tokenization to be performed. Note that the configuration methods all
return the Tokenizer object so that repeated method calls can be chained.
For example:
Tokenizer t;
t.skipSpaces().tokenizeNumbers().tokenizeWords().quotes("'#","'\n");
One particularly important configuration method is
{@link #maximumTokenLength}
which is used to specify the maximum token length in the input. A
Tokenizer implementation must ensure that it can handle tokens at least
this long, typically by allocating a buffer at least that long.
The constant fields of this interface are token type constants.
Note that their values are all negative. Non-negative token types
always represent Unicode characters.
A tokenizer may be in one of three states:
- Before any tokens have been read. In this state, {@link #tokenType}
always returns (@link #BOF}, and {@link #tokenLine} always returns 0.
{@link #maximumTokenLength} and {@link #trackPosition} may only be called
in this state.
- During tokenization. In this state, {@link #next}, {@link #nextChar},
and {@link #scan(char,boolean,boolean,boolean)} are being called to tokenize
input characters, but none of these methods has yet returned {@link #EOF}.
Configuration methods other than those listed above may be called from this
state to dynamically change tokenizing behavior.
- End-of-file. Once one of the tokenizing methods have returned EOF,
the tokenizer has reached the end of its input. Any subsequent calls to
the tokenizing methods or to {@link #tokenType} will return EOF. Most
methods may still be called from this state, although it is not useful
to do so.
|
Fields Summary |
---|
public static final int | EOFEnd-of-file. Returned when there are no more characters to tokenize | public static final int | SPACEThe token is a run of whitespace. @see #tokenizeSpaces() | public static final int | NUMBERThe token is a run of digits. @see #tokenizeNumbers() | public static final int | WORDThe token is a run of word characters. @see #tokenizeWords() | public static final int | KEYWORDThe token is a keyword. @see #keywords() | public static final int | TEXTThe token is arbitrary text returned by
{@link #scan(char,boolean,boolean,boolean)}. | public static final int | BOFBeginning-of-file. This is the value returned by {@link #tokenType}
when it is called before tokenization begins. | public static final int | OVERFLOWSpecial return value for {@link #scan(char,boolean,boolean,boolean)}. |
Methods Summary |
---|
public je3.classes.Tokenizer | keywords(java.lang.String[] keywords)Specify keywords to receive special recognition.
If a {@link #WORD} token matches one of these keywords, then the token
type will be set to {@link #KEYWORD}, and {@link #tokenKeyword} will
return the index of the keyword in the specified array.
| public je3.classes.Tokenizer | maximumTokenLength(int size)Specify the maximum token length that the Tokenizer is required to
accomodate. If presented with an input token longer than the specified
size, a Tokenizer behavior is undefined. Implementations must typically
allocate an internal buffer at least this large, but may use a smaller
buffer if they know that the total length of the input is smaller.
Implementations should document their default value, and are encouraged
to define constructors that take the token length as an argument.
| public int | next()Make the next token of input the current token, and return its type.
Implementations must tokenize input using the following algorithm, and
must perform each step in the order listed.
- If there are no more input characters, set the current token to
{@link #EOF} and return that value.
- If configured to skip or tokenize spaces, and the current character
is whitespace, coalesce any subsequent whitespace characters into a
token. If spaces are being skipped, start tokenizing a new token,
otherwise, make the spaces the current token and return {@link #SPACE}.
See {@link #skipSpaces}, {@link #tokenizeSpaces}, and
{@link Character#isWhitespace}.
- If configured to tokenize numbers and the current character is a
digit, coalesce all adjacent digits into a single token, make it the
current token, and return {@link #NUMBER}. See {@link #tokenizeNumbers}
and {@link Character#isDigit}
- If configured to tokenize words, and the current character is a
word character, coalesce all adjacent word characters into a single
token, and make it the current token. If the word matches a registered
keyword, determine the keyword index and return {@link #KEYWORD}.
Otherwise return {@link #WORD}. Determine whether a character is a
word character using the registered {@link WordRecognizer}, if any,
or with {@link Character#isJavaIdentifierStart} and
{@link Character#isJavaIdentifierPart}. See also
{@link #tokenizeWords} and {@link #wordRecognizer}.
- If configured to tokenize quotes or other delimited tokens, and the
current character appears in the string of opening delimiters, then
scan until the character at the same position in the string of closing
delimiters is encountered or until there is no more input of the
maximum token size is reached. Coalesce the characters between (but
not including) the delimiters into a single token, set the token type
to the opening delimiter, and return this character.
See {@link #quotes}.
- If none of the steps above has returned a token, then make the
current character the current token, and return the current character.
| public int | nextChar()Make the next character of input the current token, and return it.
| public je3.classes.Tokenizer | quotes(java.lang.String openquotes, java.lang.String closequotes)Specify pairs of token delimiters. If the tokenizer encounters
any character in openquotes, then it will scan until it
encounters the corresponding character in closequotes.
When such a token is tokenized, {@link #tokenType} returns the character
from openquotes that was recognized and {@link #tokenText}
returns the characters between, but not including the delimiters.
Note that no escape characters are recognized. Quote tokenization occurs
after other types of tokenization so openquotes should not
include whitespace, number or word characters, if spaces, numbers, or
words are being tokenized.
Quote tokenization is useful for tokens other than quoted strings.
For example to recognize single-quoted strings and single-line
comments, you might call this method like this:
quotes("'#", "'\n");
| public int | scan(char delimiter, boolean extendCurrentToken, boolean includeDelimiter, boolean skipDelimiter)Scan until the first occurrance of the specified delimiter character.
Because a token scanned in this way may contain arbitrary characters,
the current token type is set to {@link #TEXT}.
| public int | scan(java.lang.String delimiter, boolean matchAll, boolean extendCurrentToken, boolean includeDelimiter, boolean skipDelimiter)This method is just {@link #scan(char,boolean,boolean,boolean)} except
that it uses a String delimiter, possibly containing more than one
character.
| public je3.classes.Tokenizer | skipSpaces(boolean skip)Specify whether to skip spaces or return them.
| public int | tokenColumn()Get the column number of the current token.
| public int | tokenKeyword()Get the index of the tokenized keyword.
| public int | tokenLine()Get the line number of the current token.
| public java.lang.String | tokenText()Get the text of the current token.
| public int | tokenType()Get the type of the current token. Valid token types are the token
type constants (all negative values) defined by this interface, and all
Unicode characters. Positive return values typically represent
punctuation characters or other single characters that were not
tokenized. But see {@link #quotes} for an exception.
| public je3.classes.Tokenizer | tokenizeNumbers(boolean tokenize)Specify whether adjacent digit characters should be coalesced into
a single token. The default is false.
| public je3.classes.Tokenizer | tokenizeSpaces(boolean tokenize)Specify whether adjacent whitespace characters should be coalesced
into a single SPACE token. This has no effect if spaces are being
skipped. The default is false.
| public je3.classes.Tokenizer | tokenizeWords(boolean tokenize)Specify whether adjacent word characters should be coalesced into
a single token. The default is false. Word characters are defined by
a {@link WordRecognizer}.
| public je3.classes.Tokenizer | trackPosition(boolean track)Specify whether the tokenizer should keep track of the line number
and column number for each returned token. The default is false.
If set to true, then tokenLine() and tokenColumn() return the
line and column numbers of the current token.
| public je3.classes.Tokenizer | wordRecognizer(je3.classes.Tokenizer$WordRecognizer wordRecognizer)Specify a {@link Tokenizer.WordRecognizer} to define what constitues a
word. If set to null (the default) then words are defined by
{@link Character#isJavaIdentifierStart} and
{@link Character#isJavaIdentifierPart}.
This has no effect if word tokenizing has not been enabled.
|
|