文章目錄
在有些应用中,需要针对应用的特征编写Analyzer,这里以Lucene5.0为例。在许多中文搜索应用,往往需要对文本进行分词,而用单字分词不能满足条件,所以需要使用其它分词,而MMSEG是其中一种。
从网上找到了chenbl写的mmseg4j,学会如何使用mmseg4j后,开始编写Analyzer。查看Analysis包的介绍后,发现主要是实现一个Tokenizer,然后在Analyzer中调用即可。于是编写了如下MMSegAnalyzer,
1 2 3 4 5 6 7 8 9
| public class MMSegAnalyzer extends Analyzer { public MMSegAnalyzer() { } @Override protected TokenStreamComponents createComponents(String fieldName) { return new TokenStreamComponents(new MMSegTokenizer()); } }
|
之后编写MMSegTokenizer,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
| public class MMSegTokenizer extends Tokenizer { private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); Dictionary dic; Seg seg; MMSeg mmSeg;
public MMSegTokenizer() { dic = Dictionary.getInstance(); seg = new ComplexSeg(dic); mmSeg = new MMSeg(input, seg); }
@Override public boolean incrementToken() throws IOException { clearAttributes(); Word word = null; while((word = mmSeg.next())!=null) { termAtt.copyBuffer(word.getSen(), word.getWordOffset(), word.getLength()); offsetAtt.setOffset(word.getStartOffset(), word.getEndOffset()); return true; } return false; } @Override public void close() throws IOException { super.close(); }
@Override public void reset() throws IOException { super.reset(); } }
|
其中
1 2
| private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
|
这两个属性是用来设置Token的内容和文本的偏移位置。
然后使用《Lucene in Action2》第四章中提到的AnalyzerDemo.java来进行测试,发现抛出异常java.lang.IllegalStateException: TokenStream contract violation,
查看TokenStream类后,知道reset函数是在incrementToken函数之前调用,主要是完成一些初始化工作。猜测是MMSeg有一些初始化工作没有完成,然后查看MMSeg类,发现有个reset函数,正是完成一些初始化工作。
于是修改修改MMSegTokenizer的reset函数,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| public class MMSegTokenizer extends Tokenizer { private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); Dictionary dic; Seg seg; MMSeg mmSeg;
public MMSegTokenizer() { dic = Dictionary.getInstance(); seg = new ComplexSeg(dic); mmSeg = new MMSeg(input, seg); }
@Override public boolean incrementToken() throws IOException { clearAttributes(); Word word = null; while((word = mmSeg.next())!=null) { termAtt.copyBuffer(word.getSen(), word.getWordOffset(), word.getLength()); offsetAtt.setOffset(word.getStartOffset(), word.getEndOffset()); return true; } return false; } @Override public void close() throws IOException { super.close(); }
@Override public void reset() throws IOException { super.reset(); mmSeg.reset(input); } }
|
MMSegAnalyzer可以进行分词了。之后看mmseg4j的实现,才发现要实现一个高效的MMSEG分词并不是一件容易的事。