examine lucene index problem - API Questions

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Harm Holtackers 2 posts 22 karma points

Jun 13, 2014 @ 14:28

0

Examine - lucene index problem

Hello,

I have a problem while lucene is indexing Umbraco.

The following html is indexed wrong

test<br />test2 will be testtest2 in the search results the <br /> is stripped.

Is it possible to replace the <br /> with \n instead of stripping it.

Best regards,

Harm Holtackers

Copy Link
Jamie Pollock 174 posts 853 karma points c-trib

Jun 13, 2014 @ 16:42

0

Hey Harm,
By default HTML is stripped by Lucene. However you'll find in the SearchResult Fields collection a __raw_<propertyAlias> version which contains the original HTML version.

I recent discovered this myself in my own journey into Lucene/Examine. :)

I hope this answers your question.

Thanks,
Jamie

Copy Link
Harm Holtackers 2 posts 22 karma points

Jun 13, 2014 @ 16:54

0

Thanks Jamie,

I discovered this also but the problem is that I do a Fuzzy search on the contentText and when the items will be concatenated some results won't show up.
It's strange that te words will be concatenated when only a <br /> is in between.

Best regards,

Harm Holtackers

Copy Link

Jamie Pollock 174 posts 853 karma points c-trib

Jun 13, 2014 @ 17:13

I guess you could add an event to alleviate the situation. I'm not suggesting this is the best solution mind you. There might be a better solution as I'm fairly new to Lucene.

First of all assign your GatheringNodeData which allow you to edit data before its indexed.

var nameOfYourIndexer = "MyCustomExternalIndexer";
ExamineManager.Instance.IndexProviderCollection[nameOfYourIndexer].GatheringNodeData += ExamineEvents_GatheringNodeData;

Then in the indexer event handler.

void ExamineEvents_GatheringNodeData(object sender, IndexingNodeDataEventArgs e) {
    GenerateSearchableHtmlContent(e);
}

private void GenerateSearchableHtmlContent(IndexingNodeDataEventArgs e) {
    var node = e.Node;
    var htmlContentPropertyAlias = "yourPropertyAlias";

    var htmlContentFromXmlNode = node.Descendants(htmlContentPropertyAlias).FirstOrDefault();
    if (htmlContentFromXmlNode != null && string.IsNullOrEmpty(htmlContentFromXmlNode .Value) == false) {
        var contentWhereTheClosingTagAndLinebreakTagsAreRemovedAndReplacedWithAnAdditionalSpace = Regex.Replace(elementContent.Value, @"(\</[a-z]+\>|\<br\/\>))", " ");

        var strippedHtml = umbraco.library.StripHtml(contentWhereTheClosingTagIsRemovedAndReplacedWithAnAdditionalSpace);

        var htmlTrimmedForWhitespaceToEnsureNotTooMuchWhitespaceIsLeftInTheResultingSearchField = Regex.Replace(strippedHtml, @"\s+", " ").Trim();

        e.Fields.Add("searchableSanitizedHtmlField", htmlTrimmedForWhitespaceToEnsureNotTooMuchWhitespaceIsLeftInTheResultingSearchField);
    }
}

Flag this post as spam?

Examine - lucene index problem