Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Jeroen Breuer 4908 posts 12265 karma points MVP 5x admin c-trib
    Feb 24, 2011 @ 16:51
    Jeroen Breuer
    0

    Examine returns wrong markup

    Hello,

    I'm using examine to search for content and display the found content. However the results which are returned from examine don't have the markup I expected.

    Here is the html in the RTE:

    <p>This is a test message.</p>
    <p>Link <a href="/{localLink:1528}" title="upgrade test">to</a> other page.</p>
    <p>How <a href="http://www.nu.nl">about</a> this?</p>

    Here is the result I get back from examine:

    \nThis is a test message.\n\nLink&nbsp;to other page.\n\nHow&nbsp;about this?\

    I would like to get back the actual html. Is this possible?

    Jeroen

  • Morten Christensen 596 posts 2773 karma points admin hq c-trib
    Feb 24, 2011 @ 16:57
    Morten Christensen
    0

    Examine strips htmls tags upon indexing, so thats why you are not getting the markup back as you expected.

    Its in the UmbracoContentService class if you look in the source ;)

    - Morten

  • Jeroen Breuer 4908 posts 12265 karma points MVP 5x admin c-trib
    Feb 24, 2011 @ 16:58
    Jeroen Breuer
    0

    Hmm that is a problem. So the only way to get back the actual html is to modify the source code?

    Jeroen

  • Morten Christensen 596 posts 2773 karma points admin hq c-trib
    Feb 24, 2011 @ 17:14
    Morten Christensen
    0

    The UmbracoContentService is called from OnGatheringNodeData, so you could write your own UmbracoIndexer, which like the standard implementation (UmbracoContentIndexer) implements BaseUmbracoIndexer - or you could extend the UmbracoIndexer and simple override OnGatheringNodeData, so it doesn't call the striphtml method.

    I have recently created a custom indexer for a project, which implements BaseUmbracoIndexer and it was pretty straight forward. Just remember to update the config files to use your custom indexer ;)

    - Morten

  • Jeroen Breuer 4908 posts 12265 karma points MVP 5x admin c-trib
    Feb 24, 2011 @ 17:24
    Jeroen Breuer
    0

    Ok seems like I need to dive into Examine. Never done anything you've just described. Is there some documentation about how to create or override the UmbracoIndexer? Why is the stiphtml method there in the first place? Should at least be optional to use...

    Jeroen

  • Morten Christensen 596 posts 2773 karma points admin hq c-trib
    Feb 24, 2011 @ 18:08
    Morten Christensen
    1

    I think something like this will do the trick - please note I haven't tried it myself, so its just a qualified guess:

    public class ExamineNodeIndexer : UmbracoContentIndexer
    {
    #region Overrides of UmbracoContentIndexer

    protected override IEnumerable<string> SupportedTypes
    {
    get
    {
    return new string[] { IndexTypes.Content };
    }
    }

    protected override void OnGatheringNodeData(IndexingNodeDataEventArgs e)
    {
    base.OnGatheringNodeData(e);

    //ensure the special path and node type alis fields is added to the dictionary to be saved to file
    var path = e.Node.Attribute("path").Value;
    if (!e.Fields.ContainsKey(IndexPathFieldName))
    e.Fields.Add(IndexPathFieldName, path);

    //this needs to support both schemas so get the nodeTypeAlias if it exists, otherwise the name
    var nodeTypeAlias = e.Node.Attribute("nodeTypeAlias") == null ? e.Node.Name.LocalName : e.Node.Attribute("nodeTypeAlias").Value;
    if (!e.Fields.ContainsKey(NodeTypeAliasFieldName))
    e.Fields.Add(NodeTypeAliasFieldName, nodeTypeAlias);
    }

    #endregion
    }

    Change type in ExamineSettings.config to namespace and assembly of the custom implemenation and try it out.

    - Morten

  • Morten Christensen 596 posts 2773 karma points admin hq c-trib
    Feb 24, 2011 @ 18:09
    Morten Christensen
    0

    Ups, delete this line otherwise you'll just end up with the stripped html again:

    base.OnGatheringNodeData(e);

    - Morten

  • James Telfer 65 posts 165 karma points
    Feb 25, 2011 @ 00:58
    James Telfer
    1

    I'd suggest a different approach, with respect.

    Instead of implementing the indexer, just attach yourself to the events already exposed by the current indexer implementation.

    So in the ApplicationBase:

    var index = ExamineManager.Instance.IndexProviderCollection[WEBCONTENT_INDEX] as UmbracoContentIndexer;
    index.GatheringNodeData += WebsiteContent_GatheringNodeData;

    Then later:

    private void WebsiteContent_GatheringNodeData(object sender, IndexingNodeDataEventArgs e)
    {
        try
        {
            var node = e.Node;
            // operate on node, set values in e.Fields
        }
        catch (Exception ex)
        {
            Log.Error(ex, "Failed to index items for node {0}", e.NodeId);
        }
    }

    In the IndexSet configuration, add the extra fields that you include here. For example, you might add a contentBodyRaw field as the HTML version of the contentBody field. When you set the e.Fields["contentBodyRaw"] in the GatheringNodeData event above, this content will then be indexed as you add it.

    There are ways to make this additional field 'store only' too, which might be interesting to you.

    Shannon and Slace have some excellent blog posts on FarmCode/LINQ2Fail that will help you with this.

  • Shannon Deminick 1526 posts 5272 karma points MVP 3x
    Feb 25, 2011 @ 01:39
    Shannon Deminick
    1

    You should let Examine strip the HTML because that is what gets indexed and you don't want to be searching against HTML markup

    If you want to store data in the Index just to retreive it out based on a search result, you should use the DocumentWriting event which allows you to directly manipulate how the data gets stored in Lucene.... An example of this is here: http://farmcode.org/post/2010/08/23/Text-casing-and-Examine.aspx

     

  • Jeroen Breuer 4908 posts 12265 karma points MVP 5x admin c-trib
    Feb 25, 2011 @ 10:44
    Jeroen Breuer
    0

    I've got the solution, but both Shannon and James their methods work. I think James his method is the best one. Here is the code for both methods:

    Using GatheringNodeData:

    var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["BlogIndexer"];
    indexer.GatheringNodeData += new EventHandler<IndexingNodeDataEventArgs>(indexer_GatheringNodeData);
    
    protected void indexer_GatheringNodeData(object sender, IndexingNodeDataEventArgs e)
    {
        var node = e.Node;
        XElement elementBodyText = node.Descendants("bodyText").FirstOrDefault();
        if (elementBodyText != null)
        {
            e.Fields.Add("__bodyText1", elementBodyText.Value);
        }
    }
    

    Using DocumentWriting:

    var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["BlogIndexer"];
    indexer.DocumentWriting += new EventHandler<DocumentWritingEventArgs>(indexer_DocumentWriting);
    
    protected void indexer_DocumentWriting(object sender, DocumentWritingEventArgs e)
    {
        var luceneDocument = e.Document;
        var umbracoDocument = new Document(Convert.ToInt32(e.Fields["__NodeId"]));
    
        Property p = umbracoDocument.getProperty("bodyText");
        if (p != null)
        {
            luceneDocument.Add(new Lucene.Net.Documents.Field("__bodyText2", p.Value.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
        }
    }
    

    In the first example I don't need to get the umbraco Document which is faster. So I guess in my situation I should use the GatheringNodeData event or is there a reason why I shouldn't?

    Jeroen

  • James Telfer 65 posts 165 karma points
    Feb 25, 2011 @ 13:25
    James Telfer
    0

    Slace's tweet was a pointer to the following line in the blog Shannon referred to:

    //also, we're telling Lucene to just put this data in, nothing more
    doc.Add(new Field("__bodyContent", content, Field.Store.YES, Field.Index.NOT_ANALYZED));

    The not analyzed means it won't be parsed. This would be slightly more efficient on the Lucene side but as you point out, getting the Document is less efficient on the Umbraco side.

    However, since you won't include this field in the criteria for your search it shouldn't matter, and it will be available to you in the results. It _does_ matter if you were looking to replace the contents of a field rather than add an extra.

    All that said, Shannon and Slace know a dirty great lot more about Examine/Lucene than I do, so I'd be happy to defer to their wisdom.

  • Jeroen Breuer 4908 posts 12265 karma points MVP 5x admin c-trib
    Feb 28, 2011 @ 14:19
    Jeroen Breuer
    0

    Hmm in my case I want to work with only published data so using the umbraco Document object isn't recommended in this situation. Think I'll keep using the GatheringNodeData method :).

    Jeroen

  • Jeroen Breuer 4908 posts 12265 karma points MVP 5x admin c-trib
    Mar 14, 2011 @ 11:08
    Jeroen Breuer
    0

    I tried a new method to get the latested published data, but it currently doesn't work:

    var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["ContentIndexer"];
    indexer.DocumentWriting += new EventHandler<DocumentWritingEventArgs>(Indexer_DocumentWriting);
    
    protected void Indexer_DocumentWriting(object sender, DocumentWritingEventArgs e)
    {
        var luceneDocument = e.Document;
        var node = new Node(e.NodeId);
    
        IProperty p = node.GetProperty("searchDescription");
        if (p != null)
        {
            luceneDocument.Add(new Lucene.Net.Documents.Field("searchDescriptionHtml", p.Value, Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
        }
    }
    

    On var node = new Node(e.NodeId); it throws an exception:

    System.NullReferenceException was unhandled by user code
      Message=Object reference not set to an instance of an object.
      Source=umbraco
      StackTrace:
           at umbraco.presentation.UmbracoContext.get_Current()
           at umbraco.library.GetXmlNodeById(String id)
           at umbraco.NodeFactory.Node..ctor(Int32 NodeId)
           at VanGoghBrabant.BLL.Default.VanGoghBrabantApplicationBase.Indexer_DocumentWriting(Object sender, DocumentWritingEventArgs e) in C:\SVN\vangoghbrabant\VanGoghBrabant.Extension\VanGoghBrabant.BLL\Default\VanGoghBrabantApplicationBase.cs:line 69
           at Examine.LuceneEngine.Providers.LuceneIndexer.OnDocumentWriting(DocumentWritingEventArgs docArgs)
           at Examine.LuceneEngine.Providers.LuceneIndexer.AddDocument(Dictionary`2 fields, IndexWriter writer, Int32 nodeId, String type)
           at Examine.LuceneEngine.Providers.LuceneIndexer.ProcessAddQueueItem(FileInfo x, IndexWriter writer)
           at Examine.LuceneEngine.Providers.LuceneIndexer.ForceProcessQueueItems()
      InnerException: 
    
    It seems the UmbracoContext is null. I've created a workitem for this: http://umbraco.codeplex.com/workitem/30161. Please vote for it.
    Jeroen
  • Shannon Deminick 1526 posts 5272 karma points MVP 3x
    Mar 14, 2011 @ 12:22
    Shannon Deminick
    1

    I've updated your codeplex bug. This isn't a bug, the indexer runs in a seperate thread, not the web thread and therefore there is no UmbracoContext. 

  • Jeroen Breuer 4908 posts 12265 karma points MVP 5x admin c-trib
    Mar 14, 2011 @ 12:51
    Jeroen Breuer
    0

    Ok I understand, but does this mean there is no way to get the published content in that event?

    Jeroen

  • Shannon Deminick 1526 posts 5272 karma points MVP 3x
    Mar 14, 2011 @ 12:56
    Shannon Deminick
    0

    What are you using UmbracoContext for?

  • Jeroen Breuer 4908 posts 12265 karma points MVP 5x admin c-trib
    Mar 14, 2011 @ 13:01
    Jeroen Breuer
    0

    If I try var node = new Node(e.NodeId); I get an error that UmbracoContext is null. All I want to do is get the latest published data.

    Jeroen

  • Shannon Deminick 1526 posts 5272 karma points MVP 3x
    Mar 14, 2011 @ 13:22
    Shannon Deminick
    0

    You might be using the wrong event for what you are trying to do. GatheringNodeData fires during the http request so you can put data into the dictionary then. The DocumentWriting event is used to put the data from the Dictionary into the index.

  • Jeroen Breuer 4908 posts 12265 karma points MVP 5x admin c-trib
    Mar 14, 2011 @ 13:45
  • Shannon Deminick 1526 posts 5272 karma points MVP 3x
    Mar 14, 2011 @ 14:09
    Shannon Deminick
    0

    You can do a bit of both. Here's the difference between the 2 events:

    • GatheringNodeData : gets the information from the data source (i.e. Umbraco) and stores this data to be indexed
    • DocumentWriting : puts the gathered data into the index
    So, you can gather whatever data you like and change the way it will be indexed in DocumentWriting. You have full access to the underlying Lucene document in DocumentWriting so you can do whatever you want to it (i.e. change whats in it already, delete whats in it, add to it, etc...)

Please Sign in or register to post replies

Write your reply to:

Draft