Hey Harm,
By default HTML is stripped by Lucene. However you'll find in the SearchResult Fields collection a __raw_<propertyAlias> version which contains the original HTML version.
I recent discovered this myself in my own journey into Lucene/Examine. :)
I discovered this also but the problem is that I do a Fuzzy search on the contentText and when the items will be concatenated some results won't show up. It's strange that te words will be concatenated when only a <br /> is in between.
I guess you could add an event to alleviate the situation. I'm not suggesting this is the best solution mind you. There might be a better solution as I'm fairly new to Lucene.
First of all assign your GatheringNodeData which allow you to edit data before its indexed.
var nameOfYourIndexer = "MyCustomExternalIndexer";
ExamineManager.Instance.IndexProviderCollection[nameOfYourIndexer].GatheringNodeData += ExamineEvents_GatheringNodeData;
Then in the indexer event handler.
void ExamineEvents_GatheringNodeData(object sender, IndexingNodeDataEventArgs e) {
GenerateSearchableHtmlContent(e);
}
private void GenerateSearchableHtmlContent(IndexingNodeDataEventArgs e) {
var node = e.Node;
var htmlContentPropertyAlias = "yourPropertyAlias";
var htmlContentFromXmlNode = node.Descendants(htmlContentPropertyAlias).FirstOrDefault();
if (htmlContentFromXmlNode != null && string.IsNullOrEmpty(htmlContentFromXmlNode .Value) == false) {
var contentWhereTheClosingTagAndLinebreakTagsAreRemovedAndReplacedWithAnAdditionalSpace = Regex.Replace(elementContent.Value, @"(\</[a-z]+\>|\<br\/\>))", " ");
var strippedHtml = umbraco.library.StripHtml(contentWhereTheClosingTagIsRemovedAndReplacedWithAnAdditionalSpace);
var htmlTrimmedForWhitespaceToEnsureNotTooMuchWhitespaceIsLeftInTheResultingSearchField = Regex.Replace(strippedHtml, @"\s+", " ").Trim();
e.Fields.Add("searchableSanitizedHtmlField", htmlTrimmedForWhitespaceToEnsureNotTooMuchWhitespaceIsLeftInTheResultingSearchField);
}
}
I hope this helped. Note: I've not tested the code at all...
Examine - lucene index problem
Hello,
I have a problem while lucene is indexing Umbraco.
The following html is indexed wrong
test<br />test2 will be testtest2 in the search results the <br /> is stripped.
Is it possible to replace the <br /> with \n instead of stripping it.
Best regards,
Harm Holtackers
Hey Harm,
By default HTML is stripped by Lucene. However you'll find in the
SearchResult
Fields collection a__raw_<propertyAlias>
version which contains the original HTML version.I recent discovered this myself in my own journey into Lucene/Examine. :)
I hope this answers your question.
Thanks,
Jamie
Thanks Jamie,
I discovered this also but the problem is that I do a Fuzzy search on the contentText and when the items will be concatenated some results won't show up.
It's strange that te words will be concatenated when only a <br /> is in between.
Best regards,
Harm Holtackers
I guess you could add an event to alleviate the situation. I'm not suggesting this is the best solution mind you. There might be a better solution as I'm fairly new to Lucene.
First of all assign your GatheringNodeData which allow you to edit data before its indexed.
Then in the indexer event handler.
I hope this helped. Note: I've not tested the code at all...
Thanks,
Jamie
is working on a reply...