Class KoreanNumberFilter
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Unwrappable<TokenStream>
TokenFilter
that normalizes Korean numbers to regular Arabic decimal numbers in
half-width characters.
Korean numbers are often written using a combination of Hangul and Arabic numbers with various kinds punctuation. For example, 3.2천 means 3200. This filter does this kind of normalization and allows a search for 3200 to match 3.2천 in text, but can also be used to make range facets based on the normalized numbers and so on.
Notice that this analyzer uses a token composition scheme and relies on punctuation tokens
being found in the token stream. Please make sure your KoreanTokenizer
has
discardPunctuation
set to false. In case punctuation characters, such as . (U+FF0E FULLWIDTH
FULL STOP), is removed from the token stream, this filter would find input tokens 3 and 2천 and
give outputs 3 and 2000 instead of 3200, which is likely not the intended result. If you want to
remove punctuation characters from your index that are not part of normalized numbers, add a
StopFilter
with the punctuation you wish to remove after
KoreanNumberFilter
in your analyzer chain.
Below are some examples of normalizations this filter supports. The input is untokenized text and the result is the single term attribute emitted for the input.
- 영영칠 becomes 7
- 일영영영 becomes 1000
- 삼천2백2십삼 becomes 3223
- 조육백만오천일 becomes 1000006005001
- 3.2천 becomes 3200
- 1.2만345.67 becomes 12345.67
- 4,647.100 becomes 4647.1
- 15,7 becomes 157 (be aware of this weakness)
Tokens preceded by a token with PositionIncrementAttribute
of zero are left untouched
and emitted as-is.
This filter does not use any part-of-speech information for its normalization and the motivation for this is to also support n-grammed token streams in the future.
This filter may in some cases normalize tokens that are not numbers in their context. For
example, is 전중경일 is a name and means Tanaka Kyōichi, but 경일 (Kyōichi) out of context can strictly
speaking also represent the number 10000000000000001. This filter respects the KeywordAttribute
, which can be used to prevent specific normalizations from happening.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Buffer that holds a Korean number string and a position index used as a parsed-to markerNested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate boolean
private static char[]
private int
private final KeywordAttribute
private static char
private StringBuilder
private static char[]
private final OffsetAttribute
private final PositionIncrementAttribute
private final PositionLengthAttribute
private AttributeSource.State
private final CharTermAttribute
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate int
arabicNumeralValue
(char c) Returns the numeric value for the specified character Arabic numeral.private int
HangulNumeralValue
(char c) Returns the value for the provided Hangul numeral.final boolean
Consumers (i.e.,IndexWriter
) use this method to advance the stream to the next token.boolean
isArabicNumeral
(char c) Arabic numeral predicate.private boolean
isDecimalPoint
(char c) Decimal point predicateprivate boolean
isFullWidthArabicNumeral
(char c) Arabic full-width numeral predicateprivate boolean
isHalfWidthArabicNumeral
(char c) Arabic half-width numeral predicateprivate boolean
isHangulNumeral
(char c) Hangul numeral predicate that tests if the provided character is one of 영, 일, 이, 삼, 사, 오, 육, 칠, 팔, or 구.boolean
isNumeral
(char c) Numeral predicateboolean
Numeral predicateboolean
isNumeralPunctuation
(char c) Numeral punctuation predicateboolean
isNumeralPunctuation
(String input) Numeral punctuation predicateprivate boolean
isThousandSeparator
(char c) Thousand separator predicatenormalizeNumber
(String number) Normalizes a Korean numberprivate BigDecimal
Parse a basic number, which is a sequence of Arabic numbers or a sequence or 0-9 Hangul numerals (영 to 구).Parse large Hangul numerals (ten thousands or larger)private BigDecimal
Parses a pair of large numbers, i.e.Parse medium Hangul numerals (tens, hundreds or thousands)private BigDecimal
Parses a "medium sized" number, typically less than 10,000(만), but might be larger due to a larger factor from {link parseBasicNumber}.private BigDecimal
Parses a pair of "medium sized" numbers, i.e.private BigDecimal
Parses a Korean numbervoid
reset()
This method is called by a consumer before it begins consumption usingTokenStream.incrementToken()
.Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, unwrap
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
termAttr
-
offsetAttr
-
keywordAttr
-
posIncrAttr
-
posLengthAttr
-
NO_NUMERAL
private static char NO_NUMERAL -
numerals
private static char[] numerals -
exponents
private static char[] exponents -
state
-
numeral
-
fallThroughTokens
private int fallThroughTokens -
exhausted
private boolean exhausted
-
-
Constructor Details
-
KoreanNumberFilter
-
-
Method Details
-
incrementToken
Description copied from class:TokenStream
Consumers (i.e.,IndexWriter
) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriateAttributeImpl
s with the attributes of the next token.The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use
AttributeSource.captureState()
to create a copy of the current attribute state.This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to
AttributeSource.addAttribute(Class)
andAttributeSource.getAttribute(Class)
, references to allAttributeImpl
s that this stream uses should be retrieved during instantiation.To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in
TokenStream.incrementToken()
.- Specified by:
incrementToken
in classTokenStream
- Returns:
- false for end of stream; true otherwise
- Throws:
IOException
-
reset
Description copied from class:TokenFilter
This method is called by a consumer before it begins consumption usingTokenStream.incrementToken()
.Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call
super.reset()
, otherwise some internal state will not be correctly reset (e.g.,Tokenizer
will throwIllegalStateException
on further usage).NOTE: The default implementation chains the call to the input TokenStream, so be sure to call
super.reset()
when overriding this method.- Overrides:
reset
in classTokenFilter
- Throws:
IOException
-
normalizeNumber
Normalizes a Korean number- Parameters:
number
- number or normalize- Returns:
- normalized number, or number to normalize on error (no op)
-
parseNumber
Parses a Korean number- Parameters:
buffer
- buffer to parse- Returns:
- parsed number, or null on error or end of input
-
parseLargePair
Parses a pair of large numbers, i.e. large Hangul factor is 10,000(만)or larger- Parameters:
buffer
- buffer to parse- Returns:
- parsed pair, or null on error or end of input
-
parseMediumNumber
Parses a "medium sized" number, typically less than 10,000(만), but might be larger due to a larger factor from {link parseBasicNumber}.- Parameters:
buffer
- buffer to parse- Returns:
- parsed number, or null on error or end of input
-
parseMediumPair
Parses a pair of "medium sized" numbers, i.e. large Hangul factor is at most 1,000(천)- Parameters:
buffer
- buffer to parse- Returns:
- parsed pair, or null on error or end of input
-
parseBasicNumber
Parse a basic number, which is a sequence of Arabic numbers or a sequence or 0-9 Hangul numerals (영 to 구).- Parameters:
buffer
- buffer to parse- Returns:
- parsed number, or null on error or end of input
-
parseLargeHangulNumeral
Parse large Hangul numerals (ten thousands or larger)- Parameters:
buffer
- buffer to parse- Returns:
- parsed number, or null on error or end of input
-
parseMediumHangulNumeral
Parse medium Hangul numerals (tens, hundreds or thousands)- Parameters:
buffer
- buffer to parse- Returns:
- parsed number or null on error
-
isNumeral
Numeral predicate- Parameters:
input
- string to test- Returns:
- true if and only if input is a numeral
-
isNumeral
public boolean isNumeral(char c) Numeral predicate- Parameters:
c
- character to test- Returns:
- true if and only if c is a numeral
-
isNumeralPunctuation
Numeral punctuation predicate- Parameters:
input
- string to test- Returns:
- true if and only if c is a numeral punctuation string
-
isNumeralPunctuation
public boolean isNumeralPunctuation(char c) Numeral punctuation predicate- Parameters:
c
- character to test- Returns:
- true if and only if c is a numeral punctuation character
-
isArabicNumeral
public boolean isArabicNumeral(char c) Arabic numeral predicate. Both half-width and full-width characters are supported- Parameters:
c
- character to test- Returns:
- true if and only if c is an Arabic numeral
-
isHalfWidthArabicNumeral
private boolean isHalfWidthArabicNumeral(char c) Arabic half-width numeral predicate- Parameters:
c
- character to test- Returns:
- true if and only if c is a half-width Arabic numeral
-
isFullWidthArabicNumeral
private boolean isFullWidthArabicNumeral(char c) Arabic full-width numeral predicate- Parameters:
c
- character to test- Returns:
- true if and only if c is a full-width Arabic numeral
-
arabicNumeralValue
private int arabicNumeralValue(char c) Returns the numeric value for the specified character Arabic numeral. Behavior is undefined if a non-Arabic numeral is provided- Parameters:
c
- arabic numeral character- Returns:
- numeral value
-
isHangulNumeral
private boolean isHangulNumeral(char c) Hangul numeral predicate that tests if the provided character is one of 영, 일, 이, 삼, 사, 오, 육, 칠, 팔, or 구. Larger number Hangul gives a false value.- Parameters:
c
- character to test- Returns:
- true if and only is character is one of 영, 일, 이, 삼, 사, 오, 육, 칠, 팔, or 구 (0 to 9)
-
HangulNumeralValue
private int HangulNumeralValue(char c) Returns the value for the provided Hangul numeral. Only numeric values for the characters where {link isHangulNumeral} return true are supported - behavior is undefined for other characters.- Parameters:
c
- Hangul numeral character- Returns:
- numeral value
- See Also:
-
isDecimalPoint
private boolean isDecimalPoint(char c) Decimal point predicate- Parameters:
c
- character to test- Returns:
- true if and only if c is a decimal point
-
isThousandSeparator
private boolean isThousandSeparator(char c) Thousand separator predicate- Parameters:
c
- character to test- Returns:
- true if and only if c is a thousand separator predicate
-