examine returns wrong markup

Jeroen Breuer 4909 posts 12266 karma points MVP 5x admin c-trib

Feb 24, 2011 @ 16:51

Hello,

I'm using examine to search for content and display the found content. However the results which are returned from examine don't have the markup I expected.

Here is the html in the RTE:

<p>This is a test message.</p>
<p>Link <a href="/{localLink:1528}" title="upgrade test">to</a> other page.</p>
<p>How <a href="http://www.nu.nl">about</a> this?</p>

Here is the result I get back from examine:

\nThis is a test message.\n\nLink&nbsp;to other page.\n\nHow&nbsp;about this?\

I would like to get back the actual html. Is this possible?

Jeroen

Copy Link

Morten Christensen 596 posts 2773 karma points admin hq c-trib

Feb 24, 2011 @ 16:57

Examine strips htmls tags upon indexing, so thats why you are not getting the markup back as you expected.

Its in the UmbracoContentService class if you look in the source ;)

- Morten

Copy Link

Jeroen Breuer 4909 posts 12266 karma points MVP 5x admin c-trib

Feb 24, 2011 @ 16:58

Hmm that is a problem. So the only way to get back the actual html is to modify the source code?

Jeroen

Copy Link

Morten Christensen 596 posts 2773 karma points admin hq c-trib

Feb 24, 2011 @ 17:14

The UmbracoContentService is called from OnGatheringNodeData, so you could write your own UmbracoIndexer, which like the standard implementation (UmbracoContentIndexer) implements BaseUmbracoIndexer - or you could extend the UmbracoIndexer and simple override OnGatheringNodeData, so it doesn't call the striphtml method.

I have recently created a custom indexer for a project, which implements BaseUmbracoIndexer and it was pretty straight forward. Just remember to update the config files to use your custom indexer ;)

- Morten

Copy Link

Jeroen Breuer 4909 posts 12266 karma points MVP 5x admin c-trib

Feb 24, 2011 @ 17:24

Ok seems like I need to dive into Examine. Never done anything you've just described. Is there some documentation about how to create or override the UmbracoIndexer? Why is the stiphtml method there in the first place? Should at least be optional to use...

Jeroen

Copy Link

Morten Christensen 596 posts 2773 karma points admin hq c-trib

Feb 24, 2011 @ 18:08

I think something like this will do the trick - please note I haven't tried it myself, so its just a qualified guess:

public class ExamineNodeIndexer : UmbracoContentIndexer
    {
        #region Overrides of UmbracoContentIndexer

        protected override IEnumerable<string> SupportedTypes
        {
            get
            {
                return new string[] { IndexTypes.Content };
            }
        }

        protected override void OnGatheringNodeData(IndexingNodeDataEventArgs e)
        {
            base.OnGatheringNodeData(e);

            //ensure the special path and node type alis fields is added to the dictionary to be saved to file
            var path = e.Node.Attribute("path").Value;
            if (!e.Fields.ContainsKey(IndexPathFieldName))
                e.Fields.Add(IndexPathFieldName, path);

            //this needs to support both schemas so get the nodeTypeAlias if it exists, otherwise the name
            var nodeTypeAlias = e.Node.Attribute("nodeTypeAlias") == null ? e.Node.Name.LocalName : e.Node.Attribute("nodeTypeAlias").Value;
            if (!e.Fields.ContainsKey(NodeTypeAliasFieldName))
                e.Fields.Add(NodeTypeAliasFieldName, nodeTypeAlias);
        }

        #endregion
    }

Change type in ExamineSettings.config to namespace and assembly of the custom implemenation and try it out.

- Morten

Copy Link

Morten Christensen 596 posts 2773 karma points admin hq c-trib

Feb 24, 2011 @ 18:09

Ups, delete this line otherwise you'll just end up with the stripped html again:

base.OnGatheringNodeData(e);

- Morten

Copy Link

James Telfer 65 posts 165 karma points

Feb 25, 2011 @ 00:58

I'd suggest a different approach, with respect.

Instead of implementing the indexer, just attach yourself to the events already exposed by the current indexer implementation.

So in the ApplicationBase:

var index = ExamineManager.Instance.IndexProviderCollection[WEBCONTENT_INDEX] as UmbracoContentIndexer;
index.GatheringNodeData += WebsiteContent_GatheringNodeData;

Then later:

private void WebsiteContent_GatheringNodeData(object sender, IndexingNodeDataEventArgs e)
{
    try
    {
        var node = e.Node;
        // operate on node, set values in e.Fields
    }
    catch (Exception ex)
    {
        Log.Error(ex, "Failed to index items for node {0}", e.NodeId);
    }
}

In the IndexSet configuration, add the extra fields that you include here. For example, you might add a contentBodyRaw field as the HTML version of the contentBody field. When you set the e.Fields["contentBodyRaw"] in the GatheringNodeData event above, this content will then be indexed as you add it.

There are ways to make this additional field 'store only' too, which might be interesting to you.

Shannon and Slace have some excellent blog posts on FarmCode/LINQ2Fail that will help you with this.

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Feb 25, 2011 @ 01:39

You should let Examine strip the HTML because that is what gets indexed and you don't want to be searching against HTML markup

If you want to store data in the Index just to retreive it out based on a search result, you should use the DocumentWriting event which allows you to directly manipulate how the data gets stored in Lucene.... An example of this is here: http://farmcode.org/post/2010/08/23/Text-casing-and-Examine.aspx

Copy Link

Jeroen Breuer 4909 posts 12266 karma points MVP 5x admin c-trib

Feb 25, 2011 @ 10:44

I've got the solution, but both Shannon and James their methods work. I think James his method is the best one. Here is the code for both methods:

Using GatheringNodeData:

var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["BlogIndexer"];
indexer.GatheringNodeData += new EventHandler<IndexingNodeDataEventArgs>(indexer_GatheringNodeData);

protected void indexer_GatheringNodeData(object sender, IndexingNodeDataEventArgs e)
{
    var node = e.Node;
    XElement elementBodyText = node.Descendants("bodyText").FirstOrDefault();
    if (elementBodyText != null)
    {
        e.Fields.Add("__bodyText1", elementBodyText.Value);
    }
}

Using DocumentWriting:

var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["BlogIndexer"];
indexer.DocumentWriting += new EventHandler<DocumentWritingEventArgs>(indexer_DocumentWriting);

protected void indexer_DocumentWriting(object sender, DocumentWritingEventArgs e)
{
    var luceneDocument = e.Document;
    var umbracoDocument = new Document(Convert.ToInt32(e.Fields["__NodeId"]));

    Property p = umbracoDocument.getProperty("bodyText");
    if (p != null)
    {
        luceneDocument.Add(new Lucene.Net.Documents.Field("__bodyText2", p.Value.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
    }
}

In the first example I don't need to get the umbraco Document which is faster. So I guess in my situation I should use the GatheringNodeData event or is there a reason why I shouldn't?

Jeroen

Copy Link

James Telfer 65 posts 165 karma points

Feb 25, 2011 @ 13:25

Slace's tweet was a pointer to the following line in the blog Shannon referred to:

//also, we're telling Lucene to just put this data in, nothing more
doc.Add(new Field("__bodyContent", content, Field.Store.YES, Field.Index.NOT_ANALYZED));

The not analyzed means it won't be parsed. This would be slightly more efficient on the Lucene side but as you point out, getting the Document is less efficient on the Umbraco side.

However, since you won't include this field in the criteria for your search it shouldn't matter, and it will be available to you in the results. It _does_ matter if you were looking to replace the contents of a field rather than add an extra.

All that said, Shannon and Slace know a dirty great lot more about Examine/Lucene than I do, so I'd be happy to defer to their wisdom.

Copy Link

Jeroen Breuer 4909 posts 12266 karma points MVP 5x admin c-trib

Feb 28, 2011 @ 14:19

Hmm in my case I want to work with only published data so using the umbraco Document object isn't recommended in this situation. Think I'll keep using the GatheringNodeData method :).

Jeroen

Copy Link

Jeroen Breuer 4909 posts 12266 karma points MVP 5x admin c-trib

Mar 14, 2011 @ 11:08

I tried a new method to get the latested published data, but it currently doesn't work:

var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["ContentIndexer"];
indexer.DocumentWriting += new EventHandler<DocumentWritingEventArgs>(Indexer_DocumentWriting);

protected void Indexer_DocumentWriting(object sender, DocumentWritingEventArgs e)
{
    var luceneDocument = e.Document;
    var node = new Node(e.NodeId);

    IProperty p = node.GetProperty("searchDescription");
    if (p != null)
    {
        luceneDocument.Add(new Lucene.Net.Documents.Field("searchDescriptionHtml", p.Value, Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
    }
}

On var node = new Node(e.NodeId); it throws an exception:

System.NullReferenceException was unhandled by user code
  Message=Object reference not set to an instance of an object.
  Source=umbraco
  StackTrace:
       at umbraco.presentation.UmbracoContext.get_Current()
       at umbraco.library.GetXmlNodeById(String id)
       at umbraco.NodeFactory.Node..ctor(Int32 NodeId)
       at VanGoghBrabant.BLL.Default.VanGoghBrabantApplicationBase.Indexer_DocumentWriting(Object sender, DocumentWritingEventArgs e) in C:\SVN\vangoghbrabant\VanGoghBrabant.Extension\VanGoghBrabant.BLL\Default\VanGoghBrabantApplicationBase.cs:line 69
       at Examine.LuceneEngine.Providers.LuceneIndexer.OnDocumentWriting(DocumentWritingEventArgs docArgs)
       at Examine.LuceneEngine.Providers.LuceneIndexer.AddDocument(Dictionary`2 fields, IndexWriter writer, Int32 nodeId, String type)
       at Examine.LuceneEngine.Providers.LuceneIndexer.ProcessAddQueueItem(FileInfo x, IndexWriter writer)
       at Examine.LuceneEngine.Providers.LuceneIndexer.ForceProcessQueueItems()
  InnerException:

It seems the UmbracoContext is null. I've created a workitem for this: http://umbraco.codeplex.com/workitem/30161. Please vote for it.

Jeroen

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Mar 14, 2011 @ 12:22

I've updated your codeplex bug. This isn't a bug, the indexer runs in a seperate thread, not the web thread and therefore there is no UmbracoContext.

Copy Link

Jeroen Breuer 4909 posts 12266 karma points MVP 5x admin c-trib

Mar 14, 2011 @ 12:51

Ok I understand, but does this mean there is no way to get the published content in that event?

Jeroen

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Mar 14, 2011 @ 12:56

What are you using UmbracoContext for?

Copy Link

Jeroen Breuer 4909 posts 12266 karma points MVP 5x admin c-trib

Mar 14, 2011 @ 13:01

If I try var node = new Node(e.NodeId); I get an error that UmbracoContext is null. All I want to do is get the latest published data.

Jeroen

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Mar 14, 2011 @ 13:22

You might be using the wrong event for what you are trying to do. GatheringNodeData fires during the http request so you can put data into the dictionary then. The DocumentWriting event is used to put the data from the Dictionary into the index.

Copy Link

Jeroen Breuer 4909 posts 12266 karma points MVP 5x admin c-trib

Mar 14, 2011 @ 13:45

I know that's also a solution. See this post: http://our.umbraco.org/forum/developers/api-questions/17737-Examine-returns-wrong-markup?p=0#comment66758.

But James said that might not be the best solution: http://our.umbraco.org/forum/developers/api-questions/17737-Examine-returns-wrong-markup?p=0#comment66783.

Jeroen

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Mar 14, 2011 @ 14:09

You can do a bit of both. Here's the difference between the 2 events:

GatheringNodeData : gets the information from the data source (i.e. Umbraco) and stores this data to be indexed
DocumentWriting : puts the gathered data into the index

So, you can gather whatever data you like and change the way it will be indexed in DocumentWriting. You have full access to the underlying Lucene document in DocumentWriting so you can do whatever you want to it (i.e. change whats in it already, delete whats in it, add to it, etc...)

Copy Link

is working on a reply...

Flag this post as spam?

Examine returns wrong markup