what do i do at this point to get the pdf content and add to the content index? The IndexedNodeEventArgs does not have fields dictionary for me to get the pdf content also the ReIndexNode method which will be called on the content index takes an linq xml document where do i get that from?
and that injects in the extracted pdf content nicely. I also have method for IndexDeleted and so when media item is deleted it will remove it from the content index as well. The one downside to this is if I rebuild the content index i need to ensure that i also rebuild the pdf index to synchronise the 2.
Ismail, what file does that code live in? Is it a class in an external project that's build then copied over to the umbraco installation?
EDIT: I think I answered my own question - looking at Shannon's demo code from CG 2010 he has a separate class file, ExamineEvents.cs, that looks for the various Examine events and acts on them.
So WoiIndexer is your content index, and calling ReIndexNode on that with the xml you grabbed and modified from the PDF index created a new 'document' in that index with the PDF content?
The code lives in own class/project and is you have rightly deduced goes into the umbraco bin. Makes use of examine events on the different indexes. In this example woiindexer is my content index and Pdf index is pdf content and i inject in the pdf content.
There are 2 things to note with this the first is if you rebuild the content index you have to then rebuild the pdf index or else the pdf content will be missing. The second is you have 2 lots of data however so long as you dont have shed loads of pdfs its not really that big an issue. There are potentially 3 other ways round this problem:
1. Create your own indexer, probably will need to your own config so that you can tell it which indexes to mix however you have duplicate data issue
2. Create your own searcher, quite a bit of coding
3. Do 2 searches one on content the other on pdf and concat however ranking will not work by score each index result set will be ranked accordingly not as a collective.
This way though not ideal has been fairly straight forward to implement, if need more info just skype me on ismail_mayat
Actually if you could see your way clear to posting the class file you're using here it would really help out .NET hacks like me who are a little unsure about topics like events and delegation. I did find a tutorial that got me to about 90% comprehension of the concept but working examples are most constructive.
I gave cut down code as there is other stuff in those classes specific to project i am working on that would just confuse matters with regards to the getmedia code its
Thanks once again Ismail, I got it working. For any other .NET hacks/beginners out there, here's the class file that works for me.
MyPublicIndexer = the default index MyPublicPDFIndexer = the supplemental PDF indexer, defined in ExamineSettings.config & ExamineIndex.config
Note that I'm adding the PDF file contents to the main index as a 'content' index type so it will show up with my main lucene search.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Xml;
using System.Xml.Linq;
using umbraco.BusinessLogic;
using Examine;
using UmbracoExamine;
using umbraco.presentation.nodeFactory;
using System.Text;
using umbraco.cms.businesslogic;
using umbraco.cms.businesslogic.web;
namespace My_Controls
{
public class ExamineEvents : ApplicationBase
{
public ExamineEvents()
{
// Add event handler for 'NodeIndexing' on 'MyPublicPDFIndexer' // Handler calls the class method 'ExamineEvents_NodeIndexing' below ExamineManager.Instance.IndexProviderCollection["MyPublicPDFIndexer"].NodeIndexing
+= new EventHandler(ExamineEvents_NodeIndexing);
// simple example of how to write to the debug log Log.Add(LogTypes.Debug, 0, "in ExamineEvents Constructor");
}
// helper method to fetch node as XElement private XElement GetMediaItem(int nodeId)
{
var nodes = umbraco.library.GetMedia(nodeId, false);
return XElement.Parse(nodes.Current.OuterXml);
}
///
/// Event handler fired when Examine is indexing a node
///
void ExamineEvents_NodeIndexing(object sender, IndexingNodeEventArgs e)
{
// "I am here" logging. Don't laugh; it helped me :) Log.Add(LogTypes.Debug, 0, "in ExamineEvents_NodeIndexing");
// try to get the indexed PDF content; FileTextContent is where the UmbracoExamine.PDF indexer puts it by default string pdfContent = string.Empty; e.Fields.TryGetValue("FileTextContent", out pdfContent);
// If we found some content, add it to the main content index in a field called "contents" if (pdfContent != string.Empty)
{
XElement mediaXml = GetMediaItem(e.NodeId);
mediaXml.Add(new XElement("contents", pdfContent));
ExamineManager.Instance.IndexProviderCollection["MyPublicIndexer"].ReIndexNode(mediaXml, IndexTypes.Content);
}
}
}
}
I dont think its examine thats the issue its applicationeventhandler so just to confirm that can you wire up a document publish event and put in it some logging code see if that fires. I have seen this before in v6 and I changed from applicationeventhandler to IApplicaitoneventhandler even though that is older.
I tried that but assumed as there was an error Umbraco 7 didn't support it:
Compiler Error Message: CS0535: 'Umbraco.Extensions.EventHandlers.RegisterEvents' does not implement interface member 'Umbraco.Core.IApplicationEventHandler.OnApplicationInitialized(Umbraco.Core.UmbracoApplicationBase, Umbraco.Core.ApplicationContext)'
Source Error:
Line 18: namespace Umbraco.Extensions.EventHandlers
Line 19: {
Line 20: public class RegisterEvents : IApplicationEventHandler Line 21: {
Line 22:
Using ApplicationEventHandler not the interface? If you can use the original handler and see if it fires if not i would raise on issues as there is something else wrong with wiring up events.
Examine pdf index item inject into content index
Shannon,
I am trying as per your suggestion in pdf indexing topic:
"You could do it a bit 'dodgy' and just listen to the indexed event of the PDF indexer and add the results to your Content Indexer using ReIndexNode
This means that you'll have PDF data in two indexes... but it would be very little code to write."
This is what i have so far:
and the delegate is:
what do i do at this point to get the pdf content and add to the content index? The IndexedNodeEventArgs does not have fields dictionary for me to get the pdf content also the ReIndexNode method which will be called on the content index takes an linq xml document where do i get that from?
do i need to cast sender to something?
Regards
Ismail
In answer to my own question here is how I hacked this:
Updated the event to handle from NodeIndexed to NodeIndexing, in that method I have
and that injects in the extracted pdf content nicely. I also have method for IndexDeleted and so when media item is deleted it will remove it from the content index as well. The one downside to this is if I rebuild the content index i need to ensure that i also rebuild the pdf index to synchronise the 2.
Regards
Ismail
Ismail, what file does that code live in? Is it a class in an external project that's build then copied over to the umbraco installation?
EDIT: I think I answered my own question - looking at Shannon's demo code from CG 2010 he has a separate class file, ExamineEvents.cs, that looks for the various Examine events and acts on them.
So WoiIndexer is your content index, and calling ReIndexNode on that with the xml you grabbed and modified from the PDF index created a new 'document' in that index with the PDF content?
Andrew,
The code lives in own class/project and is you have rightly deduced goes into the umbraco bin. Makes use of examine events on the different indexes. In this example woiindexer is my content index and Pdf index is pdf content and i inject in the pdf content.
There are 2 things to note with this the first is if you rebuild the content index you have to then rebuild the pdf index or else the pdf content will be missing. The second is you have 2 lots of data however so long as you dont have shed loads of pdfs its not really that big an issue. There are potentially 3 other ways round this problem:
1. Create your own indexer, probably will need to your own config so that you can tell it which indexes to mix however you have duplicate data issue
2. Create your own searcher, quite a bit of coding
3. Do 2 searches one on content the other on pdf and concat however ranking will not work by score each index result set will be ranked accordingly not as a collective.
This way though not ideal has been fairly straight forward to implement, if need more info just skype me on ismail_mayat
Regards
Ismail
The only think I can't work out is
GetMediaItem(e.NodeId);
Where/what is GetMediaItem? I found a version in the UmbracoHelper library, but it doesn't seem to return the right thing.
Regards,
- Andrew
Actually if you could see your way clear to posting the class file you're using here it would really help out .NET hacks like me who are a little unsure about topics like events and delegation. I did find a tutorial that got me to about 90% comprehension of the concept but working examples are most constructive.
Cheers,
- Andrew
Andrew,
I gave cut down code as there is other stuff in those classes specific to project i am working on that would just confuse matters with regards to the getmedia code its
its pretty simple i get the linqtoxml node and then inject the pdf element into it and then pass to indexer which does all the rest.
Regards
Ismail
Thanks once again Ismail, I got it working. For any other .NET hacks/beginners out there, here's the class file that works for me.
MyPublicIndexer = the default index
MyPublicPDFIndexer = the supplemental PDF indexer, defined in ExamineSettings.config & ExamineIndex.config
Note that I'm adding the PDF file contents to the main index as a 'content' index type so it will show up with my main lucene search.
Thanks Ismail and Andrew, your posts got me there in the end! :) #h5yr
I'm trying to get this working in Umbraco 7. Any ideas why it wouldn't?
Dan,
What errors are you getting? Anything in the logfile? Also i thought pdf indexer was removed at some point?
Regards
Ismail
No errors in log table or log file. The event is not firing. I updated the code to use the new 6.1.0+ method - http://our.umbraco.org/documentation/Reference/Events/application-startup
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Xml;
using System.Xml.Linq;
using umbraco.BusinessLogic;
using Examine;
using UmbracoExamine;
using umbraco.presentation.nodeFactory;
using System.Text;
using umbraco.cms.businesslogic;
using umbraco.cms.businesslogic.web;
using Umbraco.Core;
namespace Umbraco.Extensions.EventHandlers
{
publicclassRegisterEvents : ApplicationEventHandler
{
protected override void ApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext)
{
// Add event handler for 'NodeIndexing' on 'MyPublicPDFIndexer'
// Handler calls the class method 'ExamineEvents_NodeIndexing' below
// ExamineManager.Instance.IndexProviderCollection["PDFIndexer"].NodeIndexing
// += new EventHandler(ExamineEvents_NodeIndexing);
ExamineManager.Instance.IndexProviderCollection["PDFIndexer"].NodeIndexing += ExamineEvents_NodeIndexing;
// simple example of how to write to the debug log
//Log.Add(LogTypes.Debug, 0, "in ExamineEvents Constructor");
Umbraco.Core.Logging.LogHelper.Debug(System.Reflection.MethodBase.GetCurrentMethod().DeclaringType, "in ExamineEvents Constructor");
}
// helper method to fetch node as XElement
private XElement GetMediaItem(int nodeId)
{
var nodes = umbraco.library.GetMedia(nodeId, false);
return XElement.Parse(nodes.Current.OuterXml);
}
///
/// Event handler fired when Examine is indexing a node
///
void ExamineEvents_NodeIndexing(object sender, IndexingNodeEventArgs e)
{
// "I am here" logging. Don't laugh; it helped me :)
//Log.Add(LogTypes.Debug, 0, "in ExamineEvents_NodeIndexing");
Umbraco.Core.Logging.LogHelper.Debug(System.Reflection.MethodBase.GetCurrentMethod().DeclaringType, "in ExamineEvents_NodeIndexing");
// try to get the indexed PDF content; FileTextContent is where the UmbracoExamine.PDF indexer puts it by default
string pdfContent = string.Empty;
e.Fields.TryGetValue("FileTextContent", out pdfContent);
// If we found some content, add it to the main content index in a field called "contents"
if (pdfContent != string.Empty)
{
XElement mediaXml = GetMediaItem(e.NodeId);
mediaXml.Add(new XElement("contents", pdfContent));
ExamineManager.Instance.IndexProviderCollection["ExternalIndexer"].ReIndexNode(mediaXml, IndexTypes.Content);
}
}
}
}
Does the event fire for your external index? Just want to narrow it down to whether only the pdf index is having the issue?
Regards
Ismail
ExamineEvents_NodeIndexing does not get fired at all.
I meant to say that both the external index and pdf index have content and are indexing seperately as I've checked via Examine Management.
Dan,
I dont think its examine thats the issue its applicationeventhandler so just to confirm that can you wire up a document publish event and put in it some logging code see if that fires. I have seen this before in v6 and I changed from applicationeventhandler to IApplicaitoneventhandler even though that is older.
Regards
Ismail
HI Ismail
I tried that but assumed as there was an error Umbraco 7 didn't support it:
Line 18: namespace Umbraco.Extensions.EventHandlers Line 19: { Line 20: public class RegisterEvents : IApplicationEventHandler Line 21: { Line 22:
Did you try document publish event?
Regards
Ismail
I did.
Dan
And that fired?
Sorry, no it didn't. I get the same error (above). It's not even getting that far. There's a compilation error.
Using ApplicationEventHandler not the interface? If you can use the original handler and see if it fires if not i would raise on issues as there is something else wrong with wiring up events.
OK, thanks. No it's not firing using ApplicationEventHandler
Thanks for your help.
Dan
is working on a reply...