examine pdf index item inject into content index

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Oct 13, 2010 @ 14:03
2

Examine pdf index item inject into content index
Shannon,

I am trying as per your suggestion in pdf indexing topic:

"You could do it a bit 'dodgy' and just listen to the indexed event of the PDF indexer and add the results to your Content Indexer using ReIndexNode

This means that you'll have PDF data in two indexes... but it would be very little code to write."

This is what i have so far:
```
 ExamineManager.Instance.IndexProviderCollection[PdfIndex].NodeIndexed += ExamineEvents_NodeIndexed;
```
and the delegate is:
```
void ExamineEvents_NodeIndexed(object sender, IndexedNodeEventArgs e)
        {

        }
```
what do i do at this point to get the pdf content and add to the content index? The IndexedNodeEventArgs does not have fields dictionary for me to get the pdf content also the ReIndexNode method which will be called on the content index takes an linq xml document where do i get that from?

do i need to cast sender to something?

Regards

Ismail
Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Oct 13, 2010 @ 15:33
1
In answer to my own question here is how I hacked this:

Updated the event to handle from NodeIndexed to NodeIndexing, in that method I have
```
string pdfContent = string.Empty;

            e.Fields.TryGetValue(PdfIndexContentFieldAlias, out pdfContent);

            if(pdfContent!=string.Empty)
            {
                 XElement mediaXml = GetMediaItem(e.NodeId);
                 mediaXml.Add(new XElement("contents", pdfContent));
                 ExamineManager.Instance.IndexProviderCollection[WoiIndexer].ReIndexNode(mediaXml, IndexTypes.Media);
            }
```
and that injects in the extracted pdf content nicely. I also have method for IndexDeleted and so when media item is deleted it will remove it from the content index as well. The one downside to this is if I rebuild the content index i need to ensure that i also rebuild the pdf index to synchronise the 2.

Regards

Ismail
Copy Link
Andrew Waegel 126 posts 126 karma points

Oct 13, 2010 @ 22:40

0

Ismail, what file does that code live in? Is it a class in an external project that's build then copied over to the umbraco installation?

EDIT: I think I answered my own question - looking at Shannon's demo code from CG 2010 he has a separate class file, ExamineEvents.cs, that looks for the various Examine events and acts on them.

Copy Link
Andrew Waegel 126 posts 126 karma points

Oct 14, 2010 @ 02:45

0

So WoiIndexer is your content index, and calling ReIndexNode on that with the xml you grabbed and modified from the PDF index created a new 'document' in that index with the PDF content?

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Oct 14, 2010 @ 10:18

0

Andrew,

The code lives in own class/project and is you have rightly deduced goes into the umbraco bin. Makes use of examine events on the different indexes. In this example woiindexer is my content index and Pdf index is pdf content and i inject in the pdf content.

There are 2 things to note with this the first is if you rebuild the content index you have to then rebuild the pdf index or else the pdf content will be missing. The second is you have 2 lots of data however so long as you dont have shed loads of pdfs its not really that big an issue. There are potentially 3 other ways round this problem:

1. Create your own indexer, probably will need to your own config so that you can tell it which indexes to mix however you have duplicate data issue

2. Create your own searcher, quite a bit of coding

3. Do 2 searches one on content the other on pdf and concat however ranking will not work by score each index result set will be ranked accordingly not as a collective.

This way though not ideal has been fairly straight forward to implement, if need more info just skype me on ismail_mayat

Regards

Ismail

Copy Link
Andrew Waegel 126 posts 126 karma points

Oct 14, 2010 @ 22:48

0

The only think I can't work out is

GetMediaItem(e.NodeId);

Where/what is GetMediaItem? I found a version in the UmbracoHelper library, but it doesn't seem to return the right thing.

Regards,
- Andrew

Copy Link
Andrew Waegel 126 posts 126 karma points

Oct 14, 2010 @ 23:03

0

Actually if you could see your way clear to posting the class file you're using here it would really help out .NET hacks like me who are a little unsure about topics like events and delegation. I did find a tutorial that got me to about 90% comprehension of the concept but working examples are most constructive.

Cheers,
- Andrew

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Oct 18, 2010 @ 10:06
0
Andrew,

I gave cut down code as there is other stuff in those classes specific to project i am working on that would just confuse matters with regards to the getmedia code its
```
        private XElement GetMediaItem(int nodeId)
        {
            var nodes = umbraco.library.GetMedia(nodeId, false);
            return XElement.Parse(nodes.Current.OuterXml);
        }
```
its pretty simple i get the linqtoxml node and then inject the pdf element into it and then pass to indexer which does all the rest.

Regards

Ismail
Copy Link

Andrew Waegel 126 posts 126 karma points

Oct 18, 2010 @ 21:07

Thanks once again Ismail, I got it working. For any other .NET hacks/beginners out there, here's the class file that works for me.

MyPublicIndexer = the default index
MyPublicPDFIndexer = the supplemental PDF indexer, defined in ExamineSettings.config & ExamineIndex.config

Note that I'm adding the PDF file contents to the main index as a 'content' index type so it will show up with my main lucene search.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Xml;
using System.Xml.Linq;
using umbraco.BusinessLogic;
using Examine;
using UmbracoExamine;
using umbraco.presentation.nodeFactory;
using System.Text;
using umbraco.cms.businesslogic;
using umbraco.cms.businesslogic.web;

namespace My_Controls
{  
    public class ExamineEvents : ApplicationBase
    {
        public ExamineEvents()
        {
            // Add event handler for 'NodeIndexing' on 'MyPublicPDFIndexer'  
            // Handler calls the class method 'ExamineEvents_NodeIndexing' below
            ExamineManager.Instance.IndexProviderCollection["MyPublicPDFIndexer"].NodeIndexing
                += new EventHandler(ExamineEvents_NodeIndexing);

            // simple example of how to write to the debug log
            Log.Add(LogTypes.Debug, 0, "in ExamineEvents Constructor");
        } 
    
        // helper method to fetch node as XElement   
        private XElement GetMediaItem(int nodeId)
        {
            var nodes = umbraco.library.GetMedia(nodeId, false);
            return XElement.Parse(nodes.Current.OuterXml);
        }

        /// 
        /// Event handler fired when Examine is indexing a node
        ///  
        void ExamineEvents_NodeIndexing(object sender, IndexingNodeEventArgs e)
        {
            // "I am here" logging. Don't laugh; it helped me :)
            Log.Add(LogTypes.Debug, 0, "in ExamineEvents_NodeIndexing");

            // try to get the indexed PDF content; FileTextContent is where the UmbracoExamine.PDF indexer puts it by default 
            string pdfContent = string.Empty;
            e.Fields.TryGetValue("FileTextContent", out pdfContent);           

            // If we found some content, add it to the main content index in a field called "contents"
            if (pdfContent != string.Empty)
            {
                XElement mediaXml = GetMediaItem(e.NodeId);
                mediaXml.Add(new XElement("contents", pdfContent));
                ExamineManager.Instance.IndexProviderCollection["MyPublicIndexer"].ReIndexNode(mediaXml, IndexTypes.Content);
            }            
        }  
    }
}

Allan James 20 posts 40 karma points

Jan 22, 2012 @ 18:21

0

Copy Link
David Conlisk 432 posts 1008 karma points

Jun 22, 2013 @ 14:20

0

Thanks Ismail and Andrew, your posts got me there in the end! :) #h5yr

Copy Link
Dan Evans 631 posts 1018 karma points

Jan 28, 2014 @ 12:47

0

I'm trying to get this working in Umbraco 7. Any ideas why it wouldn't?

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jan 28, 2014 @ 15:42

0

Dan,

What errors are you getting? Anything in the logfile? Also i thought pdf indexer was removed at some point?

Regards

Ismail

Copy Link
Dan Evans 631 posts 1018 karma points

Jan 28, 2014 @ 17:32

0

No errors in log table or log file. The event is not firing. I updated the code to use the new 6.1.0+ method - http://our.umbraco.org/documentation/Reference/Events/application-startup

using System;

using System.Collections.Generic;

using System.Linq;

using System.Web;

using System.Xml;

using System.Xml.Linq;

using umbraco.BusinessLogic;

using Examine;

using UmbracoExamine;

using umbraco.presentation.nodeFactory;

using System.Text;

using umbraco.cms.businesslogic;

using umbraco.cms.businesslogic.web;

using Umbraco.Core;

namespace Umbraco.Extensions.EventHandlers

{

publicclassRegisterEvents : ApplicationEventHandler

{

protected override void ApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext)

{

// Add event handler for 'NodeIndexing' on 'MyPublicPDFIndexer'

// Handler calls the class method 'ExamineEvents_NodeIndexing' below

// ExamineManager.Instance.IndexProviderCollection["PDFIndexer"].NodeIndexing

// += new EventHandler(ExamineEvents_NodeIndexing);

ExamineManager.Instance.IndexProviderCollection["PDFIndexer"].NodeIndexing += ExamineEvents_NodeIndexing;

// simple example of how to write to the debug log

//Log.Add(LogTypes.Debug, 0, "in ExamineEvents Constructor");

Umbraco.Core.Logging.LogHelper.Debug(System.Reflection.MethodBase.GetCurrentMethod().DeclaringType, "in ExamineEvents Constructor");

}

// helper method to fetch node as XElement

private XElement GetMediaItem(int nodeId)

{

var nodes = umbraco.library.GetMedia(nodeId, false);

return XElement.Parse(nodes.Current.OuterXml);

}

///

/// Event handler fired when Examine is indexing a node

///

void ExamineEvents_NodeIndexing(object sender, IndexingNodeEventArgs e)

{

// "I am here" logging. Don't laugh; it helped me :)

//Log.Add(LogTypes.Debug, 0, "in ExamineEvents_NodeIndexing");

Umbraco.Core.Logging.LogHelper.Debug(System.Reflection.MethodBase.GetCurrentMethod().DeclaringType, "in ExamineEvents_NodeIndexing");

// try to get the indexed PDF content; FileTextContent is where the UmbracoExamine.PDF indexer puts it by default

string pdfContent = string.Empty;

e.Fields.TryGetValue("FileTextContent", out pdfContent);

// If we found some content, add it to the main content index in a field called "contents"

if (pdfContent != string.Empty)

{

XElement mediaXml = GetMediaItem(e.NodeId);

mediaXml.Add(new XElement("contents", pdfContent));

ExamineManager.Instance.IndexProviderCollection["ExternalIndexer"].ReIndexNode(mediaXml, IndexTypes.Content);

}

}

}

}

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jan 28, 2014 @ 17:49

0

Does the event fire for your external index? Just want to narrow it down to whether only the pdf index is having the issue?

Regards

Ismail

Copy Link
Dan Evans 631 posts 1018 karma points

Jan 28, 2014 @ 18:02

0

ExamineEvents_NodeIndexing does not get fired at all.

I meant to say that both the external index and pdf index have content and are indexing seperately as I've checked via Examine Management.

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jan 29, 2014 @ 10:05

0

Dan,

I dont think its examine thats the issue its applicationeventhandler so just to confirm that can you wire up a document publish event and put in it some logging code see if that fires. I have seen this before in v6 and I changed from applicationeventhandler to IApplicaitoneventhandler even though that is older.

Regards

Ismail

Copy Link

Dan Evans 631 posts 1018 karma points

Jan 29, 2014 @ 10:27

HI Ismail

I tried that but assumed as there was an error Umbraco 7 didn't support it:


Compiler Error Message: CS0535: 'Umbraco.Extensions.EventHandlers.RegisterEvents' does not implement interface member 'Umbraco.Core.IApplicationEventHandler.OnApplicationInitialized(Umbraco.Core.UmbracoApplicationBase, Umbraco.Core.ApplicationContext)'

Source Error:

Line 18: namespace Umbraco.Extensions.EventHandlers
Line 19: {
Line 20: public class RegisterEvents : IApplicationEventHandler Line 21:     {
Line 22:

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jan 29, 2014 @ 10:46

0

Did you try document publish event?

Regards

Ismail

Copy Link
Dan Evans 631 posts 1018 karma points

Jan 29, 2014 @ 10:47

0

I did.

Dan

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jan 29, 2014 @ 10:48

0

And that fired?

Copy Link
Dan Evans 631 posts 1018 karma points

Jan 29, 2014 @ 10:50

0

Sorry, no it didn't. I get the same error (above). It's not even getting that far. There's a compilation error.

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jan 29, 2014 @ 10:51

0

Using ApplicationEventHandler not the interface? If you can use the original handler and see if it fires if not i would raise on issues as there is something else wrong with wiring up events.

Copy Link
Dan Evans 631 posts 1018 karma points

Jan 29, 2014 @ 11:35

0

OK, thanks. No it's not firing using ApplicationEventHandler

Thanks for your help.

Dan

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies