Most articles on this theme are written in the year of 2015.Few(if not none) in this year(2017).Seems much has been changed since then.
Here is the main steps.
- Download and install Hbase 0.98.8-hadoop2,and start it.
- Download nutch 2.3.1 source and build with Hbase store
- Download solr 6.6.0 and create a core named nutch.Using the schema file provided in nutch installation file.(Some modification needed,See below),and start it.
Besides the editing suggested by the stackoverflow thread in Reference 1,deletion of the “defaultSearchField” and “defaultOperator” line in the schema.xml is also needed(These are not supported anymore in lucene 6.6.0). - Crawl starting some seed urls using the following command
bin/crawl /opt/apache-nutch-2.3.1/urls/seed.txt TestCrawl http://localhost:8983/solr/nutch 2
Note:The Nutch 2.X Tutorial(Reference 2) is also a bit outdated,and is ending with the invalid “nutch readdb”(samples of valid command :nutch readdb -crawId TestCrawl -url http://nutch.apache.org/ or nutch readdb -crawId TestCrawl -stats) command,without any crawling!
References:
- https://stackoverflow.com/questions/38525848/solr-6-and-nutch-2-3-1-integration
- https://wiki.apache.org/nutch/Nutch2Tutorial