0

I need to crawl two websites and index them into elasticsearch as two different indexes or types. I am using nutch 1.15 with elasticsearch-5.3.3

How can we crawl two different sites and index them separately in elasticsearch in nutch? Can this be achieved in single instance of nutch?

1 Answer 1

0

At the moment there is nothing in Nutch to do document routing. For instance, if you use the index-jexl-filter, the filtering is done before is the document is sent to the Nutch writers. You can configure multiple Index writers (2) and then the documents will be sent to both Index writers. These writers could be writing to different indexes/document types, but all documents will end in both indexes/document types.

That been said, if you find a way of do the filtering in the ES side, you could configure those Index Writers and route the documents to both of them. Then filter in ES at ingestion time (perhaps something like a script in ES that prevents the document for begin ingested if it doesn't match certain requirement. But I cannot out of the top of my mind, pin point to something specific that does this right now.

Also, you can just clone the elastic indexer and customise it so that the type is extracted from the document itself.

EDIT

Thanks to @sebastian-nagel for pointing this out.

I totally missed the https://nutch.apache.org/apidocs/apidocs-1.15/org/apache/nutch/exchange/jexl/JexlExchange.html exchange that does exactly what you want. With this is posible to do document routing at indexing time, using a JEXL expression.

Sign up to request clarification or add additional context in comments.

2 Comments

Nutch 1.15 adds the possibility to route documents and it should be possible to route documents by host to two ES indexes, see wiki.apache.org/nutch/Exchanges and NUTCH-2412./NUTCH-2412
Ups, missed the JexlExchange totally 😅. I thought

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.