CJKTokenizerpublic final class CJKTokenizer extends Tokenizer CJKTokenizer was modified from StopTokenizer which does a decent job for
most European languages. It performs other token methods for double-byte
Characters: the token will return at each two charactors with overlap match.
Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4" it
also need filter filter zero length token ""
for Digit: digit, '+', '#' will token as letter
for more info on Asia language(Chinese Japanese Korean) text segmentation:
please search google |
Fields Summary |
---|
private static final int | MAX_WORD_LENMax word length | private static final int | IO_BUFFER_SIZEbuffer size: | private int | offsetword offset, used to imply which character(in ) is parsed | private int | bufferIndexthe index used only for ioBuffer | private int | dataLendata length | private final char[] | buffercharacter buffer, store the characters which are used to compose
the returned Token | private final char[] | ioBufferI/O buffer, used to store the content of the input(one of the
members of Tokenizer) | private String | tokenTypeword type: single=>ASCII double=>non-ASCII word=>default | private boolean | preIsTokenedtag: previous character is a cached double-byte character "C1C2C3C4"
----(set the C1 isTokened) C1C2 "C2C3C4" ----(set the C2 isTokened)
C1C2 C2C3 "C3C4" ----(set the C3 isTokened) "C1C2 C2C3 C3C4" |
Constructors Summary |
---|
public CJKTokenizer(Reader in)Construct a token stream processing the given input.
//~ Constructors -----------------------------------------------------------
input = in;
|
Methods Summary |
---|
public final org.apache.lucene.analysis.Token | next()Returns the next token in the stream, or null at EOS.
See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html
for detail.
/** how many character(s) has been stored in buffer */
int length = 0;
/** the position used to create Token */
int start = offset;
while (true) {
/** current charactor */
char c;
/** unicode block of current charactor for detail */
Character.UnicodeBlock ub;
offset++;
if (bufferIndex >= dataLen) {
dataLen = input.read(ioBuffer);
bufferIndex = 0;
}
if (dataLen == -1) {
if (length > 0) {
if (preIsTokened == true) {
length = 0;
preIsTokened = false;
}
break;
} else {
return null;
}
} else {
//get current character
c = ioBuffer[bufferIndex++];
//get the UnicodeBlock of the current character
ub = Character.UnicodeBlock.of(c);
}
//if the current character is ASCII or Extend ASCII
if ((ub == Character.UnicodeBlock.BASIC_LATIN)
|| (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS)
) {
if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {
/** convert HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN */
int i = (int) c;
i = i - 65248;
c = (char) i;
}
// if the current character is a letter or "_" "+" "#"
if (Character.isLetterOrDigit(c)
|| ((c == '_") || (c == '+") || (c == '#"))
) {
if (length == 0) {
// "javaC1C2C3C4linux" <br>
// ^--: the current character begin to token the ASCII
// letter
start = offset - 1;
} else if (tokenType == "double") {
// "javaC1C2C3C4linux" <br>
// ^--: the previous non-ASCII
// : the current character
offset--;
bufferIndex--;
tokenType = "single";
if (preIsTokened == true) {
// there is only one non-ASCII has been stored
length = 0;
preIsTokened = false;
break;
} else {
break;
}
}
// store the LowerCase(c) in the buffer
buffer[length++] = Character.toLowerCase(c);
tokenType = "single";
// break the procedure if buffer overflowed!
if (length == MAX_WORD_LEN) {
break;
}
} else if (length > 0) {
if (preIsTokened == true) {
length = 0;
preIsTokened = false;
} else {
break;
}
}
} else {
// non-ASCII letter, eg."C1C2C3C4"
if (Character.isLetter(c)) {
if (length == 0) {
start = offset - 1;
buffer[length++] = c;
tokenType = "double";
} else {
if (tokenType == "single") {
offset--;
bufferIndex--;
//return the previous ASCII characters
break;
} else {
buffer[length++] = c;
tokenType = "double";
if (length == 2) {
offset--;
bufferIndex--;
preIsTokened = true;
break;
}
}
}
} else if (length > 0) {
if (preIsTokened == true) {
// empty the buffer
length = 0;
preIsTokened = false;
} else {
break;
}
}
}
}
return new Token(new String(buffer, 0, length), start, start + length,
tokenType
);
|
|