自从写了Elasticsearch从MySQL到数据,收到几位同学的来信,主要是问如何使用elasticsearch-jdbc进行增量数据导入,这里还是写写具体操作。这里以从Wordpress导数据为例。
Elasticsearch准备
curl -XPUT ‘localhost:9200/article?pretty’
curl -XPUT ‘localhost:9200/article/_mappings/blog’ -d ‘@mapping.json’
mapping.json内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| { "_all": { "enabled" : true, "analyzer": "ik_max_word_syno", "search_analyzer": "ik_smart" }, "properties": { "id": { "type": "string", "index": "not_analyzed", "include_in_all": false }, "title": { "type": "string", "analyzer": "ik_max_word_syno", "search_analyzer": "ik_smart", "boost": 2 }, "content": { "type": "string", "analyzer": "ik_max_word_syno", "search_analyzer": "ik_smart" } } }
|
分词配置可参看
Elatcissearch中ik添加同义词
- 查看mapping
curl -XGET ‘localhost:9200/article/_mapping’
elasticsearch-jdbc配置
到elasticsearch-jdbc的bin目录下,查看mysql-blog.sh文件, 内容如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| #!/bin/sh
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" bin=${DIR}/../bin lib=${DIR}/../lib
echo ' { "type" : "jdbc", "jdbc" : { "url" : "jdbc:mysql://localhost:3306/blog", "statefile" : "statefile.json", "schedule" : "0 0-59 0-23 ? * *", "user" : "blog", "password" : "12345678", "sql" : [{ "statement": "select id as _id, id, post_title as title, post_content as content from wp_posts where post_status = ? and post_modified > ? ", "parameter": ["publish", "$metrics.lastexecutionstart"]} ], "index" : "article", "type" : "blog", "metrics": { "enabled" : true }, "elasticsearch" : { "cluster" : "elasticsearch", "host" : "localhost", "port" : 9300 } } } ' | java \ -cp "${lib}/*" \ -Dlog4j.configurationFile=${bin}/log4j2.xml \ org.xbib.tools.Runner \ org.xbib.tools.JDBCImporter
|
这里主要看两个配置, statefile和schedule,
其中statefile这个配置对于增量导数据一定不能少。因为只有配置了statefile,elasticsearch-jdbc才知道将上次抓取时间存在哪里,才可以做增量索引。
schedule的作用与crontab类似,用来固定时间执行增量导数据,具体用法参看文档活着crontab。
查看结果
http://localhost:9200/article/blog/_search?q=test
在配置elasticsearch-jdbc的过程中,查看日志很重要。日志文件在bin目录下的logs里,可以修改log4j2.xml文件,把日志等级改为debug以查看更多日志。