Yeah so you dont have to write all the code to get the umbraco content into elastic. You also dont have to write the client to query it either.
Basically we have examine which is the search and indexing in umbraco. It wraps around lucene.net. Examine is extensible.
The package link i sent you is an examine provider for elastic search so instead of the engine under being lucene its now elastic (although elastic is powered also by lucene). There is also an azure search provider examinex although that one is paid.
The elastic one only does content and media however for the media it only does stub information like filename size extension not the actual content of the media. So in theory you can use examine events and test is current item being indexed media item, if it is then test is it pdf and if it is extract the pdf lib of your choice then inject the extracted content in. That way you can get actual pdf content.
I did create a composition for examine pdf indexer which uses textsharp and i swapped out the textsharp engine with apache tika. Apache tika can extract most file formats, its a bit on the heavy side as its written in java and uses IKVM but it works really well. See https://www.nuget.org/packages/TikaOnDotNet/
@Ismail Mayat
PDFsharp does the same job for the PDF files right? Is it good to use the PDFsharp instead of apache tika if i only needed to extract data from PDF files?
Elastic search in Umbraco 8
Hey guy's
Is there any package or code sample for implementing elastic search for content and pdf?
Regards Dhanesh:)
Only for content see https://our.umbraco.com/packages/website-utilities/novicellexamineelasticsearch/ however you could tap into media save events and then inject pdf items into index?
Regards
Ismail
Hey Ismail, Yes, but I mean this one https://www.elastic.co/elasticsearch/
yeah you create account and index on that. then you use the elastic package to index the content and in the config point it to that.
Oh,for this we can use this package
https://our.umbraco.com/packages/website-utilities/novicellexamineelasticsearch/
for content right?Yeah so you dont have to write all the code to get the umbraco content into elastic. You also dont have to write the client to query it either.
Basically we have examine which is the search and indexing in umbraco. It wraps around lucene.net. Examine is extensible.
The package link i sent you is an examine provider for elastic search so instead of the engine under being lucene its now elastic (although elastic is powered also by lucene). There is also an azure search provider examinex although that one is paid.
The elastic one only does content and media however for the media it only does stub information like filename size extension not the actual content of the media. So in theory you can use examine events and test is current item being indexed media item, if it is then test is it pdf and if it is extract the pdf lib of your choice then inject the extracted content in. That way you can get actual pdf content.
Regards
Ismial
Great man!, Thanks for the explanation.
I have done pdf extraction before see https://our.umbraco.com/packages/website-utilities/cogumbracoexaminemediaindexer/ you could look at the code for this and the libs i used and then use that.
I did create a composition for examine pdf indexer which uses textsharp and i swapped out the textsharp engine with apache tika. Apache tika can extract most file formats, its a bit on the heavy side as its written in java and uses IKVM but it works really well. See https://www.nuget.org/packages/TikaOnDotNet/
@Ismail Mayat PDFsharp does the same job for the PDF files right? Is it good to use the PDFsharp instead of apache tika if i only needed to extract data from PDF files?
Great! 🤘🏻🤘🏻
is working on a reply...