umbraco examine pdf indexing

Chris Koiak 700 posts 2626 karma points

Apr 22, 2010 @ 10:06

Umbraco Examine - PDF Indexing

Hi,

We're looking to incorporate PDF indexing into Umbraco Examine. Has anyone done this in the past and has a suggestion for the best approach?

Or ideally, an extension/package that is already built? :-D

I've read the suggestions on http://www.farmcode.org/post/2009/04/20/Umbraco-Examine-v4x-Powerful-Umbraco-Indexing.aspx, but I was wondering if the community has a recommended approach for this.

Thanks,

Chris

Copy Link

Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Apr 22, 2010 @ 10:46

Chirs - I had a package that extracted metadata from PDF files and set the metadata as properties of the media item. it doesn't however extract the body text of the PDF.

Alternatively you could use my XSL FO package to generate your PDF's from Umbraco content nodes and just index the content nodes. Obviously if you have existing PDFs they'd have to be migrated.

HTH.

Copy Link

Chris Koiak 700 posts 2626 karma points

Apr 22, 2010 @ 11:02

Hi Darren,

It's established PDF files, so creating the the PDFs from content nodes isn't an option. It's really the indexing of PDF text I'm looking for.

Nice Package though, I can see a number of uses for it in future projects.

Chris

Copy Link

Dirk De Grave 4541 posts 6021 karma points MVP 3x admin c-trib

Apr 22, 2010 @ 13:27

Chris,

Can use a number of IFilter implementations for extracting data from any document. Here's the one for pdf's. Seen some tweets re pdf indexing as well (i was all about PDFBox - be it java based...)

Cheers,

/Dirk

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Apr 22, 2010 @ 13:29

Dirk,

You seen any code or know where to plug the ifilter stuff into examine. Slace if your reading you going to do an examine grok session at cg10?

Regards

Isamil

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Apr 22, 2010 @ 14:15

In answer to does Examine (well, Lucene) support PDF indexing see this post - http://www.aaron-powell.com/lucene-net-overview

Copy Link

Chris Koiak 700 posts 2626 karma points

Apr 27, 2010 @ 12:24

Thanks for the feedback, if we build anything that can be packaged I'll make sure to post it.

Copy Link

Murray Roke 503 posts 967 karma points c-trib

Apr 28, 2010 @ 05:51

Hi Chris,

I'm trying to do exactly the same thing, I've found recommendations for iTextSharp to get the text from the PDF, but I'm not sure how to get that text into the index.

I think I want to combine my node data and my pdf data into one Lucene-Document. Is this easy to do?

This will mean search results bring up the page that includes the relevant attachment thus providing context, rather than bringing up the attachment directly.

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Apr 28, 2010 @ 06:16

How are you extracting the text via iTextSharp? According to all the documentation I've read it is not possible to get back blocks of text from a PDF document.

Quote:

You can't 'parse' an existing PDF file using iText, you can only 'read' it page per page.
What does this mean?
The pdf format is just a canvas where text and graphics are placed without any structure information. As such there aren't any 'iText-objects' in a PDF file. In each page there will probably be a number of 'Strings', but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines. In short: parsing the content of a PDF-file is NOT POSSIBLE with iText. Post your question on the newsgroup news://comp.text.pdf and maybe you will get some answers from people that have built tools that can parse PDF and extract some of its contents, but don't expect tools that will perform a bullet-proof conversion to structured text.
What iText DOES provide is the possibility to READ a PDF document and copy an entire page of this file into the PDF file you are constructing from scratch. This can be useful if you want to create a new document based on (an) existing document(s). You can add a Watermark, pagenumbers,...

See: http://itextsharp.sourceforge.net/tutorial/ch01.html

Copy Link

Murray Roke 503 posts 967 karma points c-trib

Apr 28, 2010 @ 06:34

I was looking at this:

http://stackoverflow.com/questions/83152/reading-pdf-documents-in-net/84410#84410

I haven't actually got it working yet, so I may have the problems? :-\

Do you recommend any other ways to get the text data?

Copy Link

Murray Roke 503 posts 967 karma points c-trib

Apr 28, 2010 @ 06:40

For searching purposes the text doesn't need to be pretty nor well structured, probably doesn't even need to be in the right order?

Copy Link

Casey Neehouse 1339 posts 483 karma points MVP 2x admin

Apr 28, 2010 @ 07:45

Several years ago (3 or 4), I had to implement a seach that indexed pdfs. I used Searcharoo code as a starting point (I was indexing pages, not the data source). I ended up using an iFilter implementation that indexed pdf and other binary documents given the iFilter was installed on the machine, or configured to load directly. At the time, I had to override the Adobe iFilter, as it was known to fail due to some pathing bugs.

Anyhow, I think it would be wonderful to implement the iFilter parsers where possible, and perhaps have configuration as to which files are parsed, with the ability to parse undefined "*" files with the iFilter by default.

Case

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Apr 28, 2010 @ 12:55

Once I get a few projects that have very immediate deadlines I'll be creating a built-in PDF indexer for Examine.

Copy Link

Chris Koiak 700 posts 2626 karma points

Apr 28, 2010 @ 14:16

Great news!

Any indication of when this would be available? Next couple of months?

Chris

Copy Link

Sebastiaan Janssen 5061 posts 15544 karma points MVP admin hq

Apr 28, 2010 @ 15:15

This is from an e-mail I sent to Aaron recently, if anybody wants to start implementing PDF searching now, without iFilters.

It gets all of the readable text from a PDF, you could store it in some node in Umbraco and then it's searchable through Examine immediately.

Just wanted to let you know that I've found a very simple way to extract text from a PDF file through a library called PDFBox.
Found this article http://www.codeproject.com/KB/string/pdf2text.aspx and tried it out, works as advertised.

I had to copy these to my bin folder:

FontBox-0.1.0-dev.dll

IKVM.GNU.Classpath.dll

IKVM.Runtime.dll

PDFBox-0.7.3.dll

But I only had to reference IKVM.GNU.Classpath.dll and PDFBox-0.7.3.dll to be able to build the code.

This is a nice solution without those ugly iFilters, so I hope it helps you for Lucene as well!

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Apr 28, 2010 @ 15:34

Sebastiaan,

where in examine did you have to make changes to implement pdf indexing?

Regards

Ismail

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Apr 29, 2010 @ 00:32

It'll be available once I get a few more pressing projects out of the way.

Examine will be running as Out-Of-Band releases to Umbraco (like ASP.NET MVC has done with Visual Studio) so there's no promises it'll make the 4.1 release.

But if someone wants to write their own indexer be my guest, it is a provider model and you're completely welcome to create your own, it's what it's design for ;)

Copy Link

Murray Roke 503 posts 967 karma points c-trib

Apr 29, 2010 @ 02:03

Here's what I've created so far, this indexes DOCX files because they're the simplest scenrio. Document text is rolled into the node that it is attached to.

I'm not really sure what I'm doing, but it seems to work so far, so critical feedback welcome.

Code:(using this codeproject sample to extract text from docx files)

    public class AttachmentAndSecurityAwareIndexer : UmbracoExamine.LuceneExamineIndexer
    {
        protected override Dictionary<string, string> GetDataToIndex(System.Xml.Linq.XElement node, Examine.IndexType type)
        {
            StringBuilder fileText = new StringBuilder();

            // find all files picked in the 'related downloads' property (multiple media picker)
            string values = node.Elements("data").Single(e => e.Attribute("alias").Value == "relatedDownloads").Value;
            foreach (var value in values.Split(','))
            {
                int mediaId;
                if (int.TryParse(value, out mediaId))
                {
                    Media media = new Media(mediaId);
                    if (media.Id == 0)
                        break;

                    string extension = (string) media.getProperty("umbracoExtension").Value;
                    string filename = HttpContext.Current.Server.MapPath((string)media.getProperty("umbracoFile").Value);

                    fileText.AppendLine();
                    // depending on the extension use various methods to extract the text that will go into the lucene index.
                    switch (extension.ToUpperInvariant())
                    {
                        case "DOCX":
                            fileText.Append((new DocxToText(filename)).ExtractText());
                            break;
                    }
                }
            }

            // Get the Base Data to index
            var result = base.GetDataToIndex(node, type);
            
            // add the file text to the data to index.
            if (!result.ContainsKey("bodyText"))
                result.Add("bodyText", fileText.ToString());
            else
                result["bodyText"] += fileText;
            return result;
        }
    }

Configuration changes to use the new class: (based on default configuration documentation)

<Examine>
    <ExamineIndexProviders enableDefaultEventHandler="true">
        <providers>
            <add name="GlobalMembersIndex" type="Terabyte.UmbracoWebsite.Models.AttachmentAndSecurityAwareIndexer, Terabyte.UmbracoWebsite"
                ...

Copy Link

Murray Roke 503 posts 967 karma points c-trib

Apr 29, 2010 @ 04:53

Add this case to index PDF files (this uses the libraries mentioned in sebastiaans post)

case "PDF":
  PDDocument doc = PDDocument.load(filename);
  PDFTextStripper stripper = new PDFTextStripper();
  fileText.Append(stripper.getText(doc));
  break;

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Apr 29, 2010 @ 10:31

Murray,

Take a look at Niel's code for the old http://umbracoext.codeplex.com/sourcecontrol/network/Show?projectName=umbracoext&changeSetId=49680">umbSearch goto umbSearch that makes use of factory pattern you implement IUmbracoSearchFileFilter so that way you can plug in your own extensions easily doc, pdf,rtf whatever.

Regards

Ismail

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Apr 30, 2010 @ 12:21

Slace,

Looking at murrays code i can see how he has supplied his own provider, the method

public class AttachmentAndSecurityAwareIndexer : UmbracoExamine.LuceneExamineIndexer
    {
        protected override Dictionary<string, string> GetDataToIndex(System.Xml.Linq.XElement node, Examine.IndexType type)
        {

GetDataToIndex will only nodes of type content be passed to it by examine or will it also receive nodes of type media. If it does not receive nodes of type media what do i need to do so that i can index media nodes? Could i do it with action handler for media after save and somehow add it to index using examine api?

Regards

Ismail

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

May 03, 2010 @ 01:05

Ismail - it'll get all the nodes (content and media) with the IndexType defining which one it is.

Shan and I just realised that there is no way to restrict and index to being just content or just media (unless you restrict the content types) so we may add that as a configuration property.

Copy Link

Murray Roke 503 posts 967 karma points c-trib

May 05, 2010 @ 05:40

An update to using this method, I spent a while figuring out I was adding fields for protected nodes, so this guard statement at the top of your GetDataToIndex method should ensure your Indexer plays nice when supportProtected="false", I'm not sure f there is a more elegant way of doing this?

        protected override Dictionary<string, string> GetDataToIndex(System.Xml.Linq.XElement node, Examine.IndexType type)
        {
            // Get the Base Data to index
            var result = base.GetDataToIndex(node, type);

            // check we have a result, if we have no fields this is probably a protected node and we shouldn't add anything else.
            if (result.Count == 0)
                return result;

Copy Link

Andrew Waegel 126 posts 126 karma points

Sep 15, 2010 @ 20:09

Any progress on this? I need a PDF indexing solution for an upcoming site and would much prefer to use a community-supported solution. My C# skills are not stellar but I'm happy to put some time and effort into it.

Copy Link

Andrew Waegel 126 posts 126 karma points

Sep 15, 2010 @ 20:19

Hang on, now I see that PDF indexing has been added to Examine RC3 on CodePlex. Anyone implement this successfully yet? I'll be trying soon.

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Sep 16, 2010 @ 01:35

It works fine in our test suite :P

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Sep 16, 2010 @ 06:02

The latest code of Examine has PDF indexing support, and it also exists in RC3.

I've published the DLLs of the latest checkin (57217) which surpasses RC3 and is simplified. If you'd like to try it, you can download it from:

http://shazwazza.com/Content/Downloads/UmbracoExamine.57217.zip

The PDF indexer provider looks like this:


      type="UmbracoExamine.PDF.PDFIndexer, UmbracoExamine.PDF" />

The PDF searcher provider looks like this:


      type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" />

The PDF index set is simple and looks like this:

All PDF data goes into it's own index because the content could be quite huge and is better left to it's own index. The PDF indexer will index media items only and will only index files that are '*.pdf' and are contained in a property called 'umbracoFile' (these 2 things can be overridden in the Index provider if necessary). If you need it to index PDFs that are in a content node, then you'll have to use the API to do this.

Hopefully we'll get the RTM out in the next week or two.

Copy Link

Neal Caselton 2 posts 22 karma points

Sep 23, 2010 @ 13:08

Great stuff ! I've managed to deploy the new DLLs and build up the indexes for web content and PDF's. However I'm having trouble searching against the newly created PDF Index.

Is there any documentation or examples of how to query against the PDF Index as when viewing the Index in an analyser it's not clear as to how this can be achieved?

Thanks in advance.

Please ignore the above, I was being a <DIV> as I hadn't copied across the UmbracoExamine.PDF.dll that meant the index wasn't created correctly...

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Sep 24, 2010 @ 08:47

Also, just in case you come across this... some PDFs are just not indexable/readable if they have been saved in certain ways with security, etc... You might come across this and you clients might complain but the fact is that some PDFs just can't be read.... at least with itextsharp anyways.

Copy Link

Andrew Waegel 126 posts 126 karma points

Oct 09, 2010 @ 03:26

Shannon, your example PDF index set didn't come through, can you repost? I know it's probably really simple but it would help those of us trying to get this going. Meanwhile I'll try to sort it out myself and post an example if it works.

- Andrew

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Oct 10, 2010 @ 09:41

PDF Indexer:

<add name="PDFIndexer" type="UmbracoExamine.PDF.PDFIndexer, UmbracoExamine.PDF" />

PDF Searcher:

<add name="PDFSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" />

PDF Index set is simple it is just:

<IndexSet SetName="PDFIndexSet" IndexPath="App_Data\PDFIndexSet" />

You don't need to define anything as it's automated. It will index all media items that have a property of umbracoFile (which is already the property name of the Image and File media types) where the umbracoFile is a PDF.

Please download latest Examine version here, there's a few bugs fixed. This will be released as v1.0 RTM this week.

http://shazwazza.com/Content/Downloads/UmbracoExamine57796.zip

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Oct 10, 2010 @ 14:26

Shannon,

When you want to search over both the content and pdf indexes what is the examine syntax? I know in lucene you can do cross index searching but couldnt quite see how to do it via examine?

Regards

Ismail

Copy Link

Andrew Waegel 126 posts 126 karma points

Oct 11, 2010 @ 06:46

+1 for the cross-index searching as well.

It seems like we would want to specify multiple SearchProviderCollections in an ExamineManager instance, but it's not clear how to do that - we only see the SearchProviderCollection[] property.

If I happen to make this work while monkeying with it I'll post some results while waiting for the devs to check in.

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Oct 11, 2010 @ 09:46

All you'd need to do is concatenate your searches between the providers:

var combinedResults = 
    ExamineManager.Instance.SearchProviderCollection["CWSSearcher"].Search("blah", true)
    .Concat(
        ExamineManager.Instance.SearchProviderCollection["PDFSearcher"].Search("blah", true));

You can use this same concept when searching with the Fluent API too.

Please be aware however, that that 'Score' value returned between 2 searches is not relavent. The 'Score' value is only relavent to the results of one search regardless of the index. So you couldn't compare the 'Score' value between the concatenated results.

Another approach would be to store your Content + PDF data into one index. The reason why we didn't implement this is because your PDF index could get really huge and we didn't want that to affect your Content/Media index. If you wanted however, you could use the API + events to get your PDF data into your Content/Media index.

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Oct 11, 2010 @ 10:44

Examine doesn't use MultiSearcher, if you needed that you'll have to implement a custom searcher.

Otherwise Shannon's solution is what you'll need to do.

Copy Link

Andrew Waegel 126 posts 126 karma points

Oct 11, 2010 @ 11:28

Thanks for the replies. I'd like to try combining the content & PDF data into one index; how would I do that?

Would I just make one IndexSet for everything, with two IndexProviders (one for PDF, one for regular content) and one SearchProvider?

And if so, what type would I use for the search provider?

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Oct 11, 2010 @ 11:35

Create you own indexer to combine the data, or create your own indexer that implements MultiSearcher

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Oct 11, 2010 @ 12:57

Slace,

Or tap into media events and push into index there?

Regards

Ismail

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Oct 11, 2010 @ 14:35

What events are you thinking of using?

I think it'd be easier to create either a custom indexer or searcher

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Oct 11, 2010 @ 14:58

Slace,

The media new, delete, update events tapping into those but looking at it logically creating your own indexer or searcher seems the better route.

Regards

Ismail

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Oct 11, 2010 @ 15:04

You could do it a bit 'dodgy' and just listen to the indexed event of the PDF indexer and add the results to your Content Indexer using ReIndexNode

This means that you'll have PDF data in two indexes... but it would be very little code to write.

Copy Link

Andrew Waegel 126 posts 126 karma points

Oct 11, 2010 @ 19:54

Thanks Shannon, I like the idea of a process that would put the indexed PDF text into the content indexer. The .NET part of this is a little advanced for me but I'm happy to give it a try.

I think this means I'm fetching the Umbraco Examine source, making a new UmbracoExamine.PDF.PDFIndexer that does what you say - adds the extracted text to the to the content indexer using ReIndexNode - then, rebuilding and copying over the DLL and using the new indexer method for my IndexProvider that works on the PDF files?

Sorry for the noob questions, hope to get this worked out, and happy to share the results if I do.

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Oct 12, 2010 @ 00:22

No, Examine raises events that you can add handlers to, like you would if you were adding one do the Document object in Umbraco.

Check out Shans CG10 slides for the event list

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Oct 12, 2010 @ 06:57

Also, RC3 is still an RC!!!!!!!!!! .

this is by no means final , there are bugs in the RC and there's alot since changed in the latest version which will become v1.0. Be mindful that there will be some breaking changes... for most people it should be painless to upgrade.

here's some release notes for v1.0

http://examine.codeplex.com/releases/view/50781

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Oct 12, 2010 @ 18:03

Shannon or Slace,

Is there an event for when an item is removed from index? I am looking at implementing shannons dirty hack of putting pdf stuff into content index so i am tapping into event

ExamineManager.Instance.IndexProviderCollection[PdfIndex].GatheringNodeData 
                += new System.EventHandler<IndexingNodeDataEventArgs>(ExamineEvents_MediaGatheringNodeData);

and at that point i will put the item into the content index. However when i remove the pdf i also need to remove it from my content index hence need to hit that event. Worse case I can tap into umbraco media delete event and do it from there.

Regards

Ismail

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Oct 13, 2010 @ 00:14

IndexDeleted event is fired when an index of a node is removed.

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Oct 13, 2010 @ 10:58

Slace,

The delete delegate has signature:

void ExamineEvents_IndexDeleted(object sender, DeleteIndexEventArgs e)

From e I cannot get the id of the node being delete so is it possible to get it from sender if so what can i cast sender to ? Or am I missing a trick?

Regards

Ismail

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Oct 13, 2010 @ 15:34

Slace,

Ignore last post i have figured it out:

 nodeId = e.DeletedTerm.Value;

which is the nodeid of the item being deleted.

Regards

Ismail

Copy Link

Euan Rae 11 posts 31 karma points

Nov 21, 2010 @ 23:35

I am using examine for site search + pdf searching; is it possible to set the PDFs so it only indexes (and searches on) the metadata for PDFs?

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Nov 22, 2010 @ 03:24

Nope, you'd have to write your own indexer to only insert the metadata from a PDF. I'm not really sure if iTextSharp (which we use) can extract metadata, I'd assume it does.

Copy Link

MikeD 92 posts 112 karma points

Aug 23, 2012 @ 21:02

Hi folks,

Way late to this conversation, but this thread has gotten me so close to implementing my client's request I can almost taste success. The one thing I do not understand is the combining of either indexers or searchers.

Here's what I ahve now (Umbraco 4.8.0)

In ExamineSettings:

<add name="RazorSiteIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
supportUnpublished="false"
supportProtected="false"
analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"/>

<add name="PDFIndexer" type="UmbracoExamine.PDF.PDFIndexer, UmbracoExamine.PDF"
supportUnpublished="false"
supportProtected="false" />

<add name="RazorSiteSearcher"
type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"/>

<add name="PDFSearcher" 
type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" />

and in my cshtml file:

var searcher = ExamineManager.Instance.SearchProviderCollection["RazorSiteSearcher"];

Both searchers appear to be doing their jobs, but I need to combine them both into my results page.

Any assistance would be very much appreciated.

-Mike D

Copy Link

MikeD 92 posts 112 karma points

Aug 23, 2012 @ 23:58

Hrm, it apprears there is a problem with the PDF Searcher. I really shoulda tested before I posted...

When I set the searcher collection to PDFSearcher, I get an error on the results page:

Error loading MacroEngine script (file: SearchResults.cshtml)

2 questions... first, how do I get more info in the error? That might help me figure out what is wrong, and 2... what could be wrong? lol

Copy Link

Nathan Woulfe 447 posts 1665 karma points MVP 5x hq c-trib

Aug 24, 2012 @ 01:21

Mike - append '?umbDebugShowTrace=true' to the url, and find the angry red text...

Copy Link

MikeD 92 posts 112 karma points

Aug 24, 2012 @ 02:55

Nathan,

Thanks for the reply... unfortunately still nothing... is there anything else I need to do to make that work?

Copy Link

Nathan Woulfe 447 posts 1665 karma points MVP 5x hq c-trib

Aug 24, 2012 @ 03:50

Is the below key present in your web config? That should be enough to enable the trace, which will show you where the problem is

<appSettngs>
...
<addkey="umbracoDebugMode"value="false"/>
...
</appSettings>

I haven't used the PDF indexer, so won't be any real help on that front!

Copy Link

MikeD 92 posts 112 karma points

Aug 24, 2012 @ 15:48

Still no more detail... grrr...

<add key="umbracoDebugMode" value="true" />

This is like trying to chase down a Windows error... lol

Copy Link

MikeD 92 posts 112 karma points

Aug 24, 2012 @ 20:25

Ok, I got my error messages... I'm an idiot... lol

I can now see what the problem is in my script... but figuring out a good way to fix it is going to require someone with much more knowledge of Examine than we possess. What I need to do is to combine 2 indexes into 1. The problem with my script is the pdf index does not have the any of the fields I need to sort out my results. If I could add the pdf index to the site content index, I could then sort and filter my results like the client wants. I can also exclude PDFs that have been "unpublished" via the content tree.

Gawds I hope that made sense... I've been looking at this code for too long and I need a beer or 12....

If there is anyone reading this thread that can help, please please contact me off list so I can try to explain what my client is looking for and how best to accomplish the task.

Thanks everyone...

-Mike D

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Aug 24, 2012 @ 20:35

Mike

The latest version does multi index search. With regards to sort field inject the field in using gatheringnode data event event.

Regards

Ismail

Copy Link

MikeD 92 posts 112 karma points

Aug 24, 2012 @ 20:51

Thanks fro the quick response Ismail...

I am already searching 2 indexes, that's part of my issue. There are no fields in the PDF index to work with other than nodeid... when I search both indexes, I get errors on the results page... my main output is sorted based on the NodeTypeAlias and that field does not exist in the PDF index, so it blows up. If I can combine everything into 1 index, that index will include all the fields I am currently working with.

Please note that I am using built in stuff here... no custom programming. It's all UmbracoExamine and Razor.

Also be advised... I am NOT a programmer. You have to use small words when explaining stuff to me... lol I know enough about programming to grasp concepts, but "inject the field in using gatheringnode data event event" is not something I understand. If you can explain, or give examples, maybe I can grasp the idea, then I can run with it and figure out how to do it. I really need several fields in the PDF index to do what my client wants. Without getting into specifics, it would be hard to explain, and I don't want to burden everyone with all that detail. I'm trying to give enough info to get my pointed in the right direction without writing a novel... lol

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Aug 24, 2012 @ 23:11

Mike,

Are you on Skype? I can talk you through it. What you are trying to achieve is doable having done something similar.

Examine has rich eventing system one of the events is gatheringnodedata you can tap into that event and inject in your own fields. So in you case when PDF indexing happens we can use the event and shove in a nodetypealias field also we can inject in anything else that is needed. My Skype is ismail_mayat if you add me I can talk you through Monday.

Ps can you download Luke it's a useful tool for looking at what is in an examine/ Lucene index just google Luke for Lucene it's a java app latest version is on google code site.

Regards

Ismail

Copy Link

MikeD 92 posts 112 karma points

Aug 27, 2012 @ 16:08

Ismail,

I am on Skype, I went to add you this morning and there are several Ismail Mayats... Are you the one in Preston, UK? That's the one I added... hope it's you. lol

I have Luke already downloaded, and have used it several times already in this learning process. Don't know how I would have gotten as far as I have without it. Anyone else cruising this thread should download it. Great tool to have.

Assuming I got the correct Skype account, send me a blip before you try to call. I really appreciate the fact you are willing to do this, you have no idea.

-Mike D

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Aug 27, 2012 @ 16:13

Mike

Apologies just realised its bank holiday today in uk. Ill be online tommorow you got the right skype user. I was surprised how many skype users already with my name lol thankfully i got twitter name bagged!

I love messing around with examine and your issue is very similar to what i did on fairbairnpb.co.uk

Regards

Ismail

Copy Link

MikeD 92 posts 112 karma points

Aug 27, 2012 @ 16:25

Is there a place on the web with maybe a list of all the events and stuff available in Examine? I have some really smart programmers on staff that I can bug if I just have a reference. And if I cannot get it figured out today I most welcome your assistance tomorrow.

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Aug 27, 2012 @ 16:32

There is links and docs on examine.codeplex.com also some umbraco tv vids and code garden vids see stream.umbraco.org

Copy Link

MikeD 92 posts 112 karma points

Aug 28, 2012 @ 17:40

Many many thanks to Ismail. Not many people would go to the lengths he did to help a complete stranger.

You are a rock star sir!

Copy Link

Matt Taylor 873 posts 2086 karma points

Jan 15, 2013 @ 15:32

In which version of Umbraco was the PDF indexer added to examine?

I have a 4.7.1 site I'd like to add PDF search to but don't know if I need to upgrade Umbraco first.

Thanks, Matt

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jan 15, 2013 @ 15:57

Matt,

4.7.1 has the pdfindexer out of the box.

Copy Link

Matt Taylor 873 posts 2086 karma points

Jan 15, 2013 @ 16:00

Thanks Ismail!

Copy Link

Matt Taylor 873 posts 2086 karma points

Feb 15, 2013 @ 14:52

Is the Examine PDF Indexer supposed to index the actual content of the PDF files or just the filenames?

I've tried the CogUmbracoExamineMediaIndexer package which indexes the content but the Examine PDF Indexer seems to only be returning matches on the filename and not the file content.

Cheers, Matt

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Feb 15, 2013 @ 17:31

The examine pdf indexer should do content as well but not any meta data. Is the data present when you look with luke should be field called FileTextContent?

Copy Link

Matt Taylor 873 posts 2086 karma points

Feb 26, 2013 @ 12:40

Sorry for the delay getting back to you Ismail,
This is mainly an excercise in increasing my understanding so took a back seat to some work I had to do for a couple of days.

I've looked in Luke at the index created by the CogUmbracoExamineMediaIndexer package which works great and I can see all the PDF content indexed:

The examine PDF index however has just indexed a bunch of numbers:

It's strange and I can assure you that both indexes are looking at the same PDF media files:

This is how the index is configured:

<IndexSet SetName="PDFIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/PDF/" IndexParentId="-1">
    <IndexAttributeFields>
      <add Name="id" />
      <add Name="nodeName" />
      <add Name="updateDate" />
      <add Name="writerName" />
      <add Name="path" />
      <add Name="nodeTypeAlias" />
      <add Name="parentID" />
    </IndexAttributeFields>
    <IncludeNodeTypes>
      <add Name="File" />
    </IncludeNodeTypes>
  </IndexSet>

The indexer:

<add name="PDFIndexer" type="UmbracoExamine.PDF.PDFIndexer, UmbracoExamine.PDF"/>

The searcher:

<add name="PDFSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcards="true" />

Regards,

Matt

Copy Link

d Thomas 13 posts 33 karma points

Mar 20, 2013 @ 12:32

@Ismail

Hi Ismail,

Could you please assist with examine.pdf configuration for search in the pdf content?

I am using umbraco 4.9 and copied the latest version of umbraco examine pdf from codeplex, placed the dlls in the bin, but got stuck to later configuration for searching with pdf content.

Thanks,

David

Copy Link

Matt Taylor 873 posts 2086 karma points

Mar 20, 2013 @ 13:34

I still haven't managed to get it working as expected either. :-(

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Mar 20, 2013 @ 14:59

David,

You are having problems searching or indexing? Can you paste your examineindex and settings config files. Also can you take a look at your pdf index using luke or http://our.umbraco.org/projects/backoffice-extensions/examine-inspector do you have any documents in the index?

Regards

Ismail

Copy Link

d Thomas 13 posts 33 karma points

Mar 20, 2013 @ 17:46

Hi Ismail,

I think I have a problem with the searching or maybe even only rendering the results in umbraco page.

I also installed luke (and ExamineIndexAdmin and Examine modules for developer side) and shows my documents indexed.

ExamineSettings:

<?xml version="1.0"?>
<!-- Umbraco examine is an extensible indexer and search engine. This configuration file can be extended to add your own search/index providers. Index sets can be defined in the ExamineIndex.config if you're using the standard provider model. More information and documentation can be found on CodePlex: http://umbracoexamine.codeplex.com -->
<Examine>
  <ExamineIndexProviders>
    <providers>
      <add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
           supportUnpublished="true"
           supportProtected="true"
           interval="10"
           analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

      <add name="InternalMemberIndexer" type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine"
           supportUnpublished="true"
           supportProtected="true"
           interval="10"
           analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"/>

        <!-- default external indexer, which excludes protected and published pages-->
        <add name="ExternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
            supportUnpublished="false"
            supportProtected="false"
            interval="10"
            analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>



        <add name="PDFIndexer" 
             type="UmbracoExamine.PDF.PDFIndexer, UmbracoExamine.PDF"
             extensions=".pdf"
             umbracoFileProperty="umbracoFile" runAsync="true"/>


    </providers>
  </ExamineIndexProviders>

  <ExamineSearchProviders defaultProvider="ExternalSearcher">
    <providers>
           <add name="PDFSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
             analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" enableLeadingWildcards="true"/>
      <add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
           analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

      <add name="ExternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
             analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" enableLeadingWildcards="true"/>

      <add name="InternalMemberSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
           analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcards="true"/>

     <add name="PDFSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
             analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" enableLeadingWildcards="true"/>

    </providers>
  </ExamineSearchProviders>




</Examine>

ExamineIndex:

<?xml version="1.0"?>
<!-- Umbraco examine is an extensible indexer and search engine. This configuration file can be extended to create your own index sets. Index/Search providers can be defined in the UmbracoSettings.config More information and documentation can be found on CodePlex: http://umbracoexamine.codeplex.com -->
<ExamineLuceneIndexSets>
  <!-- The internal index set used by Umbraco back-office - DO NOT REMOVE -->
  <IndexSet SetName="InternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/Internal/">
    <IndexAttributeFields>
      <add Name="id" />
      <add Name="nodeName" />
      <add Name="updateDate" />
      <add Name="writerName" />
      <add Name="path" />
      <add Name="nodeTypeAlias" />
      <add Name="parentID" />
    </IndexAttributeFields>
    <IndexUserFields />
    <IncludeNodeTypes/>
    <ExcludeNodeTypes />
  </IndexSet>

  <!-- The internal index set used by Umbraco back-office for indexing members - DO NOT REMOVE -->
  <IndexSet SetName="InternalMemberIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/InternalMember/">
    <IndexAttributeFields>
      <add Name="id" />
      <add Name="nodeName"/>
      <add Name="updateDate" />
      <add Name="writerName" />
      <add Name="loginName" />
      <add Name="email" />
      <add Name="nodeTypeAlias" />
    </IndexAttributeFields>
    <IndexUserFields/>
    <IncludeNodeTypes/>
    <ExcludeNodeTypes />
  </IndexSet>

  <!-- Default Indexset for external searches, this indexes all fields on all types of nodes-->
  <IndexSet SetName="ExternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/External/" />
    <IndexSet SetName="PDFIndexSet" IndexPath="~/App_Data/ExamineIndexes/PDFIndexSet" IndexParentId="-1"/>
</ExamineLuceneIndexSets>

I wish to rende the rsults with RAZOR script ( render results).

@using Examine

@using Examine.SearchCriteria

@using UmbracoExamine

@using UmbracoExamine.PDF

@{

var searchString = Request["searchString"];

var searchResults = ExamineManager.Instance.SearchProviderCollection["PDFSearcher"].Search(searchString.ToLower(),false).ToList();

}

Please let me know how you render the search/results in umbraco.

I have done it after this post, adapting the part for pdfsearcher.

http://joeriks.com/2011/03/15/ajax-enabled-search-in-umbraco-using-examine-and-razor/#comment-1269

Thanks,

David

Copy Link

d Thomas 13 posts 33 karma points

Mar 20, 2013 @ 17:50

Yes I have pdf files, I added 2 pdf files in media.

Copy Link

Damjan 12 posts 31 karma points

May 01, 2013 @ 13:38

Hi All

I seem to have the same problem as Matt, I looked into Luke only to see random numbers as values for File Text Content. Such as "1 1,1 1221/0 13 2 21 21/21 21/3 3 4 5 6 7" etc.
I searched for "1" and printed out the actual content and saw it was

!"# $% &'%'()*" +,-./ 0%"& %/ ,& %/"1221/0()*%"& %/ ,& %/"21/3%$& %/ ,& %/"#-4 -&21/&3()*%$& %/ ,& %/"'21/" +,-./ & %,& %/ ,& %/"'& ()*%,& %/ ,& %/"21/%5& %/ ,& %/"6&21/%5& %/ ,& %/"" +%,-./ 21/&%7&& %/ ,& %/"6%21/21/" +,-./ %2& %/ ,& %/"%21/&21/%74 ,& %/ ,& %/"21/21/%,& %/ ,& %/"21/" +%,-./ 21/&%21/,& %/ ,& %/"'8321/" +,-./ %,& %/ ,& %/"#9$%21/,& %/ ,& %/"'$& ! ! ! !$6::,& %/ ,& %/"$2;:,:"#$"#$"#$"#$%9 ;::::,:: <::::<'-"1<#%#%#%#%'-"1<#%#%#%#%'-1! %'-6%'-$0&"$&"$&"$&"$'()!'()!'()!'()!= &****9::# >?@@%:%:++++/&>+.+A.,+-,+-,+-,+-.).).).)++++;%B& ,& ,& ;%B& ,& ,& -7C,& ,& ,& -73 1,,-7,-%/-,& %/ ,& %/"1$1,1 !"!"D -&/6 !/ = &/ > --";/,-> /))-E??>&/):-<:E/";/,:-<:>+.F +A.??/# :-3:>;3?/% :-3:>+. +A.?>+.13+A.?0>*D+.;3+A.-<EGD+.";3+A.-<E/ D-<E?/&)(E+*(EHH>?

So, I assume it is ignoring all the symbols and indexing the numbers only. But the whole point of the PDFIndexer is to read the actual values right, not the encoded version:/ So I would like to ask if Matt resolved this issue or if someone else managed to index PDF content and has time to look into what I'm trying to do, it would be great.

Thanks

<add name="PdfIndexer" type="UmbracoExamine.PDF.PDFIndexer, UmbracoExamine.PDF" extensions=".pdf" umbracoFileProperty="umbracoFile" interval="10"/>

<add name="PdfSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" />

<IndexSet SetName="PdfIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/Pdf/" />

var searcher = ExamineManager.Instance.SearchProviderCollection["PdfSearcher"];
var searchCriteria = searcher.CreateSearchCriteria(BooleanOperation.Or);
var query = searchCriteria.GroupedOr(new string[] { "FileTextContent", "nodeName" }, searchTerm).Compile();
var searchResults = searcher.Search(query);
var noResults       = searchResults.Count();
<p>You searched for <em>@searchTerm</em>, and found @noResults results</p>
<ul class="search-results">
    @foreach (var result in searchResults)
    {
        <li>
            <a href="@umbraco.library.GetMedia(result.Id, false)">@result.Fields["FileTextContent"]</a>
        </li>
    }
</ul>

Copy Link

Matt Taylor 873 posts 2086 karma points

May 01, 2013 @ 13:43

No I didn't manage to resolve it.

It's quite frustrating but I was just researching for future projects so the priority wasn't there and eventually had to move on.

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

May 01, 2013 @ 13:52

Damjan,

Try using http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer package see if that indexes your pdf content.

Regards

Ismail

Copy Link

Matt Taylor 873 posts 2086 karma points

May 01, 2013 @ 13:57

The CogUmbracoExamineMediaIndexer worked great for me.

It's just a shame the out of the box stuff doesn't.

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

May 02, 2013 @ 14:48

Hi guys, I'm gonna be totally honest and say that I haven't read nearly any of all 8 pages of these issues but I will say that the Examine PDF indexer 'DOES' index PDF content. It definitely does not just only index PDF file names, otherswise that would be useless. The other thing to mention is that some PDFs are protected or created with some weird protection encoding which is why you might experience the strange chars. Examine's PDF indexer uses itextsharp to read PDF. The later the examine version, the later the itextsharp version so that might help. TBH I don't know anything about the Cogworks PDF indexer so not sure what it does beyond the normal examine Examine PDF indexer. There is documentation on the Examine site that does reference that it is not possible to index 'ALL' PDF data and that is because the PDF 'standard' is not standard and is pretty f%#$d in general. We realy on iTextSharp. If it can't do it that neither can we. However please let me know if there are issues with the PDF indexer otherwise, we have unit tests that pass but if it is not working for 'any' of your PDFs than maybe its a setting I've missed.

Copy Link

Damjan 12 posts 31 karma points

May 02, 2013 @ 14:59

Hello,

Thank you Matt and Ismail, the CogUmbracoExamineMediaIndexer indexes the content properly for me too. I will now try to implement the actual search box etc...

I'd just complain a bit about the procedure for the installation of the package, I got the same error as here:

http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer/bugs-support/37947-Could-not-load-file-or-assembly-IKVMOpenJDKBeans ;

and managed to fix it after adding and removing the libraries for a while, but I'd reccommend that you put a bit more in the Readme, since it's not enough to just install the package and add the tika-app-1.2.dll, but you also need a few more IKVM libraries in the bin folder.

Anyway, thank you again

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

May 02, 2013 @ 15:00

Shannon,

The Cogworks media indexer is just a wrapper around apache tikka so it will index everything tikka can handle that includes pdf. It will also rip out meta data and shove that in the index. Not sure why some of these pdfs are failing with pdf indexer ideally the people having problems need to send the failing pdfs to you see if you can re create?

Regards

Ismail

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

May 02, 2013 @ 15:03

for sure, ideally just log an issue on the tracker at examine.codeplex.com with the faulting PDF(s) and I'll see if I can replicate.

Copy Link

Damjan 12 posts 31 karma points

May 02, 2013 @ 15:07

Hi,

This is the PDF I tried and got the funny symbols and numbers with the built-in PDFIndexer, but got indexed OK with the Cogworks package. It's just the Razor cheat sheet I got from here:

http://our.umbraco.org/projects/developer-tools/razor-dynamicnode-cheat-sheet

So, maybe check if it is protected than Shannon's reason for the PdfIndexer not working makes sence to me too, but if it's not maybe there's a bug in the built-in PdfIndexer..?

Thanks,
Damjan

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

May 02, 2013 @ 15:17

Damjan,

Its probably charset range in that pdf the pdfindexer does have some code that tests range of characters its possible stuff is getting through however apache tikka is picking up.

Regards

Ismail

Copy Link

Matt Taylor 873 posts 2086 karma points

May 02, 2013 @ 15:26

Damjan, yes I also had problems missing IKVMOpenJDKBeans assemblies. There's another post somewhere where I list everything you need.

Funily enough I also used the PDF cheatsheet to test indexing but having considered it could be in a strange format I decided to create my own PDFs using OpenOffice to convert a doc I created. I figured that must be pretty standard but alas, no joy.

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

Umbraco Examine - PDF Indexing