Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Harm Holtackers 2 posts 22 karma points
    Jun 13, 2014 @ 14:28
    Harm Holtackers
    0

    Examine - lucene index problem

    Hello,

     

    I have a problem while lucene is indexing Umbraco.

    The following html is indexed wrong

    test<br />test2 will be testtest2 in the search results the <br /> is stripped.

    Is it possible to replace the <br /> with \n  instead of stripping it.

     

    Best regards,

    Harm Holtackers

  • Jamie Pollock 174 posts 853 karma points c-trib
    Jun 13, 2014 @ 16:42
    Jamie Pollock
    0

    Hey Harm,
    By default HTML is stripped by Lucene. However you'll find in the SearchResult Fields collection a __raw_<propertyAlias> version which contains the original HTML version.

    I recent discovered this myself in my own journey into Lucene/Examine. :)

    I hope this answers your question.

    Thanks,
    Jamie

  • Harm Holtackers 2 posts 22 karma points
    Jun 13, 2014 @ 16:54
    Harm Holtackers
    0

    Thanks Jamie,

    I discovered this also but the problem is that I do a Fuzzy search on the contentText and when the items will be concatenated some results won't show up.
    It's strange that te words will be concatenated when only a <br /> is in between.

    Best regards,

    Harm Holtackers

  • Jamie Pollock 174 posts 853 karma points c-trib
    Jun 13, 2014 @ 17:13
    Jamie Pollock
    0

    I guess you could add an event to alleviate the situation. I'm not suggesting this is the best solution mind you. There might be a better solution as I'm fairly new to Lucene.

    First of all assign your GatheringNodeData which allow you to edit data before its indexed.

    var nameOfYourIndexer = "MyCustomExternalIndexer";
    ExamineManager.Instance.IndexProviderCollection[nameOfYourIndexer].GatheringNodeData += ExamineEvents_GatheringNodeData;
    

    Then in the indexer event handler.

    void ExamineEvents_GatheringNodeData(object sender, IndexingNodeDataEventArgs e) {
        GenerateSearchableHtmlContent(e);
    }
    
    private void GenerateSearchableHtmlContent(IndexingNodeDataEventArgs e) {
        var node = e.Node;
        var htmlContentPropertyAlias = "yourPropertyAlias";
    
        var htmlContentFromXmlNode = node.Descendants(htmlContentPropertyAlias).FirstOrDefault();
        if (htmlContentFromXmlNode != null && string.IsNullOrEmpty(htmlContentFromXmlNode .Value) == false) {
            var contentWhereTheClosingTagAndLinebreakTagsAreRemovedAndReplacedWithAnAdditionalSpace = Regex.Replace(elementContent.Value, @"(\</[a-z]+\>|\<br\/\>))", " ");
    
            var strippedHtml = umbraco.library.StripHtml(contentWhereTheClosingTagIsRemovedAndReplacedWithAnAdditionalSpace);
    
            var htmlTrimmedForWhitespaceToEnsureNotTooMuchWhitespaceIsLeftInTheResultingSearchField = Regex.Replace(strippedHtml, @"\s+", " ").Trim();
    
            e.Fields.Add("searchableSanitizedHtmlField", htmlTrimmedForWhitespaceToEnsureNotTooMuchWhitespaceIsLeftInTheResultingSearchField);
        }
    }
    

    I hope this helped. Note: I've not tested the code at all...

    Thanks,
    Jamie

Please Sign in or register to post replies

Write your reply to:

Draft