I'm using examine to search for content and display the found content. However the results which are returned from examine don't have the markup I expected.
Here is the html in the RTE:
<p>This is a test message.</p>
<p>Link <a href="/{localLink:1528}" title="upgrade test">to</a> other page.</p>
<p>How <a href="http://www.nu.nl">about</a> this?</p>
Here is the result I get back from examine:
\nThis is a test message.\n\nLink to other page.\n\nHow about this?\
I would like to get back the actual html. Is this possible?
The UmbracoContentService is called from OnGatheringNodeData, so you could write your own UmbracoIndexer, which like the standard implementation (UmbracoContentIndexer) implements BaseUmbracoIndexer - or you could extend the UmbracoIndexer and simple override OnGatheringNodeData, so it doesn't call the striphtml method.
I have recently created a custom indexer for a project, which implements BaseUmbracoIndexer and it was pretty straight forward. Just remember to update the config files to use your custom indexer ;)
Ok seems like I need to dive into Examine. Never done anything you've just described. Is there some documentation about how to create or override the UmbracoIndexer? Why is the stiphtml method there in the first place? Should at least be optional to use...
//ensure the special path and node type alis fields is added to the dictionary to be saved to file var path = e.Node.Attribute("path").Value; if (!e.Fields.ContainsKey(IndexPathFieldName)) e.Fields.Add(IndexPathFieldName, path);
//this needs to support both schemas so get the nodeTypeAlias if it exists, otherwise the name var nodeTypeAlias = e.Node.Attribute("nodeTypeAlias") == null ? e.Node.Name.LocalName : e.Node.Attribute("nodeTypeAlias").Value; if (!e.Fields.ContainsKey(NodeTypeAliasFieldName)) e.Fields.Add(NodeTypeAliasFieldName, nodeTypeAlias); }
#endregion }
Change type in ExamineSettings.config to namespace and assembly of the custom implemenation and try it out.
Instead of implementing the indexer, just attach yourself to the events already exposed by the current indexer implementation.
So in the ApplicationBase:
var index = ExamineManager.Instance.IndexProviderCollection[WEBCONTENT_INDEX] as UmbracoContentIndexer; index.GatheringNodeData += WebsiteContent_GatheringNodeData;
Then later:
private void WebsiteContent_GatheringNodeData(object sender, IndexingNodeDataEventArgs e) { try { var node = e.Node; // operate on node, set values in e.Fields } catch (Exception ex) { Log.Error(ex, "Failed to index items for node {0}", e.NodeId); } }
In the IndexSet configuration, add the extra fields that you include here. For example, you might add a contentBodyRaw field as the HTML version of the contentBody field. When you set the e.Fields["contentBodyRaw"] in the GatheringNodeData event above, this content will then be indexed as you add it.
There are ways to make this additional field 'store only' too, which might be interesting to you.
Shannon and Slace have some excellent blog posts on FarmCode/LINQ2Fail that will help you with this.
You should let Examine strip the HTML because that is what gets indexed and you don't want to be searching against HTML markup
If you want to store data in the Index just to retreive it out based on a search result, you should use the DocumentWriting event which allows you to directly manipulate how the data gets stored in Lucene.... An example of this is here: http://farmcode.org/post/2010/08/23/Text-casing-and-Examine.aspx
I've got the solution, but both Shannon and James their methods work. I think James his method is the best one. Here is the code for both methods:
Using GatheringNodeData:
var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["BlogIndexer"];
indexer.GatheringNodeData += new EventHandler<IndexingNodeDataEventArgs>(indexer_GatheringNodeData);
protected void indexer_GatheringNodeData(object sender, IndexingNodeDataEventArgs e)
{
var node = e.Node;
XElement elementBodyText = node.Descendants("bodyText").FirstOrDefault();
if (elementBodyText != null)
{
e.Fields.Add("__bodyText1", elementBodyText.Value);
}
}
Using DocumentWriting:
var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["BlogIndexer"];
indexer.DocumentWriting += new EventHandler<DocumentWritingEventArgs>(indexer_DocumentWriting);
protected void indexer_DocumentWriting(object sender, DocumentWritingEventArgs e)
{
var luceneDocument = e.Document;
var umbracoDocument = new Document(Convert.ToInt32(e.Fields["__NodeId"]));
Property p = umbracoDocument.getProperty("bodyText");
if (p != null)
{
luceneDocument.Add(new Lucene.Net.Documents.Field("__bodyText2", p.Value.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
}
}
In the first example I don't need to get the umbraco Document which is faster. So I guess in my situation I should use the GatheringNodeData event or is there a reason why I shouldn't?
Slace's tweet was a pointer to the following line in the blog Shannon referred to:
//also, we're telling Lucene to just put this data in, nothing more
doc.Add(new Field("__bodyContent", content, Field.Store.YES, Field.Index.NOT_ANALYZED));
The not analyzed means it won't be parsed. This would be slightly more efficient on the Lucene side but as you point out, getting the Document is less efficient on the Umbraco side.
However, since you won't include this field in the criteria for your search it shouldn't matter, and it will be available to you in the results. It _does_ matter if you were looking to replace the contents of a field rather than add an extra.
All that said, Shannon and Slace know a dirty great lot more about Examine/Lucene than I do, so I'd be happy to defer to their wisdom.
Hmm in my case I want to work with only published data so using the umbraco Document object isn't recommended in this situation. Think I'll keep using the GatheringNodeData method :).
I tried a new method to get the latested published data, but it currently doesn't work:
var indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["ContentIndexer"];
indexer.DocumentWriting += new EventHandler<DocumentWritingEventArgs>(Indexer_DocumentWriting);
protected void Indexer_DocumentWriting(object sender, DocumentWritingEventArgs e)
{
var luceneDocument = e.Document;
var node = new Node(e.NodeId);
IProperty p = node.GetProperty("searchDescription");
if (p != null)
{
luceneDocument.Add(new Lucene.Net.Documents.Field("searchDescriptionHtml", p.Value, Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
}
}
On var node = new Node(e.NodeId); it throws an exception:
System.NullReferenceException was unhandled by user code
Message=Object reference not set to an instance of an object.
Source=umbraco
StackTrace:
at umbraco.presentation.UmbracoContext.get_Current()
at umbraco.library.GetXmlNodeById(String id)
at umbraco.NodeFactory.Node..ctor(Int32 NodeId)
at VanGoghBrabant.BLL.Default.VanGoghBrabantApplicationBase.Indexer_DocumentWriting(Object sender, DocumentWritingEventArgs e) in C:\SVN\vangoghbrabant\VanGoghBrabant.Extension\VanGoghBrabant.BLL\Default\VanGoghBrabantApplicationBase.cs:line 69
at Examine.LuceneEngine.Providers.LuceneIndexer.OnDocumentWriting(DocumentWritingEventArgs docArgs)
at Examine.LuceneEngine.Providers.LuceneIndexer.AddDocument(Dictionary`2 fields, IndexWriter writer, Int32 nodeId, String type)
at Examine.LuceneEngine.Providers.LuceneIndexer.ProcessAddQueueItem(FileInfo x, IndexWriter writer)
at Examine.LuceneEngine.Providers.LuceneIndexer.ForceProcessQueueItems()
InnerException:
You might be using the wrong event for what you are trying to do. GatheringNodeData fires during the http request so you can put data into the dictionary then. The DocumentWriting event is used to put the data from the Dictionary into the index.
You can do a bit of both. Here's the difference between the 2 events:
GatheringNodeData : gets the information from the data source (i.e. Umbraco) and stores this data to be indexed
DocumentWriting : puts the gathered data into the index
So, you can gather whatever data you like and change the way it will be indexed in DocumentWriting. You have full access to the underlying Lucene document in DocumentWriting so you can do whatever you want to it (i.e. change whats in it already, delete whats in it, add to it, etc...)
Examine returns wrong markup
Hello,
I'm using examine to search for content and display the found content. However the results which are returned from examine don't have the markup I expected.
Here is the html in the RTE:
Here is the result I get back from examine:
I would like to get back the actual html. Is this possible?
Jeroen
Examine strips htmls tags upon indexing, so thats why you are not getting the markup back as you expected.
Its in the UmbracoContentService class if you look in the source ;)
- Morten
Hmm that is a problem. So the only way to get back the actual html is to modify the source code?
Jeroen
The UmbracoContentService is called from OnGatheringNodeData, so you could write your own UmbracoIndexer, which like the standard implementation (UmbracoContentIndexer) implements BaseUmbracoIndexer - or you could extend the UmbracoIndexer and simple override OnGatheringNodeData, so it doesn't call the striphtml method.
I have recently created a custom indexer for a project, which implements BaseUmbracoIndexer and it was pretty straight forward. Just remember to update the config files to use your custom indexer ;)
- Morten
Ok seems like I need to dive into Examine. Never done anything you've just described. Is there some documentation about how to create or override the UmbracoIndexer? Why is the stiphtml method there in the first place? Should at least be optional to use...
Jeroen
I think something like this will do the trick - please note I haven't tried it myself, so its just a qualified guess:
Change type in ExamineSettings.config to namespace and assembly of the custom implemenation and try it out.
- Morten
Ups, delete this line otherwise you'll just end up with the stripped html again:
- Morten
I'd suggest a different approach, with respect.
Instead of implementing the indexer, just attach yourself to the events already exposed by the current indexer implementation.
So in the ApplicationBase:
Then later:
In the IndexSet configuration, add the extra fields that you include here. For example, you might add a contentBodyRaw field as the HTML version of the contentBody field. When you set the e.Fields["contentBodyRaw"] in the GatheringNodeData event above, this content will then be indexed as you add it.
There are ways to make this additional field 'store only' too, which might be interesting to you.
Shannon and Slace have some excellent blog posts on FarmCode/LINQ2Fail that will help you with this.
You should let Examine strip the HTML because that is what gets indexed and you don't want to be searching against HTML markup
If you want to store data in the Index just to retreive it out based on a search result, you should use the DocumentWriting event which allows you to directly manipulate how the data gets stored in Lucene.... An example of this is here: http://farmcode.org/post/2010/08/23/Text-casing-and-Examine.aspx
I've got the solution, but both Shannon and James their methods work. I think James his method is the best one. Here is the code for both methods:
Using GatheringNodeData:
Using DocumentWriting:
In the first example I don't need to get the umbraco Document which is faster. So I guess in my situation I should use the GatheringNodeData event or is there a reason why I shouldn't?
Jeroen
Slace's tweet was a pointer to the following line in the blog Shannon referred to:
The not analyzed means it won't be parsed. This would be slightly more efficient on the Lucene side but as you point out, getting the Document is less efficient on the Umbraco side.
However, since you won't include this field in the criteria for your search it shouldn't matter, and it will be available to you in the results. It _does_ matter if you were looking to replace the contents of a field rather than add an extra.
All that said, Shannon and Slace know a dirty great lot more about Examine/Lucene than I do, so I'd be happy to defer to their wisdom.
Hmm in my case I want to work with only published data so using the umbraco Document object isn't recommended in this situation. Think I'll keep using the GatheringNodeData method :).
Jeroen
I tried a new method to get the latested published data, but it currently doesn't work:
On var node = new Node(e.NodeId); it throws an exception:
I've updated your codeplex bug. This isn't a bug, the indexer runs in a seperate thread, not the web thread and therefore there is no UmbracoContext.
Ok I understand, but does this mean there is no way to get the published content in that event?
Jeroen
What are you using UmbracoContext for?
If I try var node = new Node(e.NodeId); I get an error that UmbracoContext is null. All I want to do is get the latest published data.
Jeroen
You might be using the wrong event for what you are trying to do. GatheringNodeData fires during the http request so you can put data into the dictionary then. The DocumentWriting event is used to put the data from the Dictionary into the index.
I know that's also a solution. See this post: http://our.umbraco.org/forum/developers/api-questions/17737-Examine-returns-wrong-markup?p=0#comment66758.
But James said that might not be the best solution: http://our.umbraco.org/forum/developers/api-questions/17737-Examine-returns-wrong-markup?p=0#comment66783.
Jeroen
You can do a bit of both. Here's the difference between the 2 events:
is working on a reply...