Chirs - I had a package that extracted metadata from PDF files and set the metadata as properties of the media item. it doesn't however extract the body text of the PDF.
Alternatively you could use my XSL FO package to generate your PDF's from Umbraco content nodes and just index the content nodes. Obviously if you have existing PDFs they'd have to be migrated.
Can use a number of IFilter implementations for extracting data from any document. Here's the one for pdf's. Seen some tweets re pdf indexing as well (i was all about PDFBox - be it java based...)
I'm trying to do exactly the same thing, I've found recommendations for iTextSharp to get the text from the PDF, but I'm not sure how to get that text into the index.
I think I want to combine my node data and my pdf data into one Lucene-Document. Is this easy to do?
This will mean search results bring up the page that includes the relevant attachment thus providing context, rather than bringing up the attachment directly.
How are you extracting the text via iTextSharp? According to all the documentation I've read it is not possible to get back blocks of text from a PDF document.
Quote:
You can't 'parse' an existing PDF file using iText, you can only 'read' it page per page. What does this mean? The pdf format is just a canvas where text and graphics are placed without any structure information. As such there aren't any 'iText-objects' in a PDF file. In each page there will probably be a number of 'Strings', but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines. In short: parsing the content of a PDF-file is NOT POSSIBLE with iText. Post your question on the newsgroup news://comp.text.pdf and maybe you will get some answers from people that have built tools that can parse PDF and extract some of its contents, but don't expect tools that will perform a bullet-proof conversion to structured text. What iText DOES provide is the possibility to READ a PDF document and copy an entire page of this file into the PDF file you are constructing from scratch. This can be useful if you want to create a new document based on (an) existing document(s). You can add a Watermark, pagenumbers,...
Several years ago (3 or 4), I had to implement a seach that indexed pdfs. I used Searcharoo code as a starting point (I was indexing pages, not the data source). I ended up using an iFilter implementation that indexed pdf and other binary documents given the iFilter was installed on the machine, or configured to load directly. At the time, I had to override the Adobe iFilter, as it was known to fail due to some pathing bugs.
Anyhow, I think it would be wonderful to implement the iFilter parsers where possible, and perhaps have configuration as to which files are parsed, with the ability to parse undefined "*" files with the iFilter by default.
This is from an e-mail I sent to Aaron recently, if anybody wants to start implementing PDF searching now, without iFilters.
It gets all of the readable text from a PDF, you could store it in some node in Umbraco and then it's searchable through Examine immediately.
Just wanted to let you know that I've found a very simple way to extract text from a PDF file through a library called PDFBox. Found this article http://www.codeproject.com/KB/string/pdf2text.aspx and tried it out, works as advertised.
I had to copy these to my bin folder:
FontBox-0.1.0-dev.dll
IKVM.GNU.Classpath.dll
IKVM.Runtime.dll
PDFBox-0.7.3.dll
But I only had to reference IKVM.GNU.Classpath.dll and PDFBox-0.7.3.dll to be able to build the code.
This is a nice solution without those ugly iFilters, so I hope it helps you for Lucene as well!
It'll be available once I get a few more pressing projects out of the way.
Examine will be running as Out-Of-Band releases to Umbraco (like ASP.NET MVC has done with Visual Studio) so there's no promises it'll make the 4.1 release.
But if someone wants to write their own indexer be my guest, it is a provider model and you're completely welcome to create your own, it's what it's design for ;)
Here's what I've created so far, this indexes DOCX files because they're the simplest scenrio. Document text is rolled into the node that it is attached to.
I'm not really sure what I'm doing, but it seems to work so far, so critical feedback welcome.
public class AttachmentAndSecurityAwareIndexer : UmbracoExamine.LuceneExamineIndexer { protected override Dictionary<string, string> GetDataToIndex(System.Xml.Linq.XElement node, Examine.IndexType type) { StringBuilder fileText = new StringBuilder();
// find all files picked in the 'related downloads' property (multiple media picker) string values = node.Elements("data").Single(e => e.Attribute("alias").Value == "relatedDownloads").Value; foreach (var value in values.Split(',')) { int mediaId; if (int.TryParse(value, out mediaId)) { Media media = new Media(mediaId); if (media.Id == 0) break;
fileText.AppendLine(); // depending on the extension use various methods to extract the text that will go into the lucene index. switch (extension.ToUpperInvariant()) { case "DOCX": fileText.Append((new DocxToText(filename)).ExtractText()); break; } } }
// Get the Base Data to index var result = base.GetDataToIndex(node, type);
// add the file text to the data to index. if (!result.ContainsKey("bodyText")) result.Add("bodyText", fileText.ToString()); else result["bodyText"] += fileText; return result; } }
GetDataToIndex will only nodes of type content be passed to it by examine or will it also receive nodes of type media. If it does not receive nodes of type media what do i need to do so that i can index media nodes? Could i do it with action handler for media after save and somehow add it to index using examine api?
Ismail - it'll get all the nodes (content and media) with the IndexType defining which one it is.
Shan and I just realised that there is no way to restrict and index to being just content or just media (unless you restrict the content types) so we may add that as a configuration property.
An update to using this method, I spent a while figuring out I was adding fields for protected nodes, so this guard statement at the top of your GetDataToIndex method should ensure your Indexer plays nice when supportProtected="false", I'm not sure f there is a more elegant way of doing this?
protected override Dictionary<string, string> GetDataToIndex(System.Xml.Linq.XElement node, Examine.IndexType type) { // Get the Base Data to index var result = base.GetDataToIndex(node, type);
// check we have a result, if we have no fields this is probably a protected node and we shouldn't add anything else. if (result.Count == 0) return result;
Any progress on this? I need a PDF indexing solution for an upcoming site and would much prefer to use a community-supported solution. My C# skills are not stellar but I'm happy to put some time and effort into it.
All PDF data goes into it's own index because the content could be quite huge and is better left to it's own index. The PDF indexer will index media items only and will only index files that are '*.pdf' and are contained in a property called 'umbracoFile' (these 2 things can be overridden in the Index provider if necessary). If you need it to index PDFs that are in a content node, then you'll have to use the API to do this.
Hopefully we'll get the RTM out in the next week or two.
Great stuff ! I've managed to deploy the new DLLs and build up the indexes for web content and PDF's. However I'm having trouble searching against the newly created PDF Index.
Is there any documentation or examples of how to query against the PDF Index as when viewing the Index in an analyser it's not clear as to how this can be achieved?
Thanks in advance.
Please ignore the above, I was being a <DIV> as I hadn't copied across the UmbracoExamine.PDF.dll that meant the index wasn't created correctly...
Also, just in case you come across this... some PDFs are just not indexable/readable if they have been saved in certain ways with security, etc... You might come across this and you clients might complain but the fact is that some PDFs just can't be read.... at least with itextsharp anyways.
Shannon, your example PDF index set didn't come through, can you repost? I know it's probably really simple but it would help those of us trying to get this going. Meanwhile I'll try to sort it out myself and post an example if it works.
You don't need to define anything as it's automated. It will index all media items that have a property of umbracoFile (which is already the property name of the Image and File media types) where the umbracoFile is a PDF.
Please download latest Examine version here, there's a few bugs fixed. This will be released as v1.0 RTM this week.
When you want to search over both the content and pdf indexes what is the examine syntax? I know in lucene you can do cross index searching but couldnt quite see how to do it via examine?
It seems like we would want to specify multiple SearchProviderCollections in an ExamineManager instance, but it's not clear how to do that - we only see the SearchProviderCollection[] property.
If I happen to make this work while monkeying with it I'll post some results while waiting for the devs to check in.
All you'd need to do is concatenate your searches between the providers:
var combinedResults = ExamineManager.Instance.SearchProviderCollection["CWSSearcher"].Search("blah", true) .Concat( ExamineManager.Instance.SearchProviderCollection["PDFSearcher"].Search("blah", true));
You can use this same concept when searching with the Fluent API too.
Please be aware however, that that 'Score' value returned between 2 searches is not relavent. The 'Score' value is only relavent to the results of one search regardless of the index. So you couldn't compare the 'Score' value between the concatenated results.
Another approach would be to store your Content + PDF data into one index. The reason why we didn't implement this is because your PDF index could get really huge and we didn't want that to affect your Content/Media index. If you wanted however, you could use the API + events to get your PDF data into your Content/Media index.
Thanks Shannon, I like the idea of a process that would put the indexed PDF text into the content indexer. The .NET part of this is a little advanced for me but I'm happy to give it a try.
I think this means I'm fetching the Umbraco Examine source, making a new UmbracoExamine.PDF.PDFIndexer that does what you say - adds the extracted text to the to the content indexer using ReIndexNode - then, rebuilding and copying over the DLL and using the new indexer method for my IndexProvider that works on the PDF files?
Sorry for the noob questions, hope to get this worked out, and happy to share the results if I do.
this is by no means final , there are bugs in the RC and there's alot since changed in the latest version which will become v1.0. Be mindful that there will be some breaking changes... for most people it should be painless to upgrade.
Is there an event for when an item is removed from index? I am looking at implementing shannons dirty hack of putting pdf stuff into content index so i am tapping into event
ExamineManager.Instance.IndexProviderCollection[PdfIndex].GatheringNodeData
+= new System.EventHandler<IndexingNodeDataEventArgs>(ExamineEvents_MediaGatheringNodeData);
and at that point i will put the item into the content index. However when i remove the pdf i also need to remove it from my content index hence need to hit that event. Worse case I can tap into umbraco media delete event and do it from there.
Nope, you'd have to write your own indexer to only insert the metadata from a PDF. I'm not really sure if iTextSharp (which we use) can extract metadata, I'd assume it does.
Way late to this conversation, but this thread has gotten me so close to implementing my client's request I can almost taste success. The one thing I do not understand is the combining of either indexers or searchers.
Ok, I got my error messages... I'm an idiot... lol
I can now see what the problem is in my script... but figuring out a good way to fix it is going to require someone with much more knowledge of Examine than we possess. What I need to do is to combine 2 indexes into 1. The problem with my script is the pdf index does not have the any of the fields I need to sort out my results. If I could add the pdf index to the site content index, I could then sort and filter my results like the client wants. I can also exclude PDFs that have been "unpublished" via the content tree.
Gawds I hope that made sense... I've been looking at this code for too long and I need a beer or 12....
If there is anyone reading this thread that can help, please please contact me off list so I can try to explain what my client is looking for and how best to accomplish the task.
I am already searching 2 indexes, that's part of my issue. There are no fields in the PDF index to work with other than nodeid... when I search both indexes, I get errors on the results page... my main output is sorted based on the NodeTypeAlias and that field does not exist in the PDF index, so it blows up. If I can combine everything into 1 index, that index will include all the fields I am currently working with.
Please note that I am using built in stuff here... no custom programming. It's all UmbracoExamine and Razor.
Also be advised... I am NOT a programmer. You have to use small words when explaining stuff to me... lol I know enough about programming to grasp concepts, but "inject the field in using gatheringnode data event event" is not something I understand. If you can explain, or give examples, maybe I can grasp the idea, then I can run with it and figure out how to do it. I really need several fields in the PDF index to do what my client wants. Without getting into specifics, it would be hard to explain, and I don't want to burden everyone with all that detail. I'm trying to give enough info to get my pointed in the right direction without writing a novel... lol
Are you on Skype? I can talk you through it. What you are trying to achieve is doable having done something similar.
Examine has rich eventing system one of the events is gatheringnodedata you can tap into that event and inject in your own fields. So in you case when PDF indexing happens we can use the event and shove in a nodetypealias field also we can inject in anything else that is needed. My Skype is ismail_mayat if you add me I can talk you through Monday.
Ps can you download Luke it's a useful tool for looking at what is in an examine/ Lucene index just google Luke for Lucene it's a java app latest version is on google code site.
I am on Skype, I went to add you this morning and there are several Ismail Mayats... Are you the one in Preston, UK? That's the one I added... hope it's you. lol
I have Luke already downloaded, and have used it several times already in this learning process. Don't know how I would have gotten as far as I have without it. Anyone else cruising this thread should download it. Great tool to have.
Assuming I got the correct Skype account, send me a blip before you try to call. I really appreciate the fact you are willing to do this, you have no idea.
Apologies just realised its bank holiday today in uk. Ill be online tommorow you got the right skype user. I was surprised how many skype users already with my name lol thankfully i got twitter name bagged!
I love messing around with examine and your issue is very similar to what i did on fairbairnpb.co.uk
Is there a place on the web with maybe a list of all the events and stuff available in Examine? I have some really smart programmers on staff that I can bug if I just have a reference. And if I cannot get it figured out today I most welcome your assistance tomorrow.
Is the Examine PDF Indexer supposed to index the actual content of the PDF files or just the filenames?
I've tried the CogUmbracoExamineMediaIndexer package which indexes the content but the Examine PDF Indexer seems to only be returning matches on the filename and not the file content.
The examine pdf indexer should do content as well but not any meta data. Is the data present when you look with luke should be field called FileTextContent?
Sorry for the delay getting back to you Ismail, This is mainly an excercise in increasing my understanding so took a back seat to some work I had to do for a couple of days.
I've looked in Luke at the index created by the CogUmbracoExamineMediaIndexer package which works great and I can see all the PDF content indexed:
The examine PDF index however has just indexed a bunch of numbers:
It's strange and I can assure you that both indexes are looking at the same PDF media files:
Could you please assist with examine.pdf configuration for search in the pdf content?
I am using umbraco 4.9 and copied the latest version of umbraco examine pdf from codeplex, placed the dlls in the bin, but got stuck to later configuration for searching with pdf content.
I think I have a problem with the searching or maybe even only rendering the results in umbraco page.
I also installed luke (and ExamineIndexAdmin and Examine modules for developer side) and shows my documents indexed.
ExamineSettings:
<?xmlversion="1.0"?><!-- Umbraco examine is an extensible indexer and search engine. This configuration file can be extended to add your own search/index providers. Index sets can be defined in the ExamineIndex.config if you're using the standard provider model. More information and documentation can be found on CodePlex: http://umbracoexamine.codeplex.com --><Examine><ExamineIndexProviders><providers><addname="InternalIndexer"type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"supportUnpublished="true"supportProtected="true"interval="10"analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/><addname="InternalMemberIndexer"type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine"supportUnpublished="true"supportProtected="true"interval="10"analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"/><!-- default external indexer, which excludes protected and published pages--><addname="ExternalIndexer"type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"supportUnpublished="false"supportProtected="false"interval="10"analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/><addname="PDFIndexer"type="UmbracoExamine.PDF.PDFIndexer, UmbracoExamine.PDF"extensions=".pdf"umbracoFileProperty="umbracoFile"runAsync="true"/></providers></ExamineIndexProviders><ExamineSearchProvidersdefaultProvider="ExternalSearcher"><providers><addname="PDFSearcher"type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"enableLeadingWildcards="true"/><addname="InternalSearcher"type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/><addname="ExternalSearcher"type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"enableLeadingWildcards="true"/><addname="InternalMemberSearcher"type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"enableLeadingWildcards="true"/><addname="PDFSearcher"type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"enableLeadingWildcards="true"/></providers></ExamineSearchProviders></Examine>
ExamineIndex:
<?xmlversion="1.0"?><!-- Umbraco examine is an extensible indexer and search engine. This configuration file can be extended to create your own index sets. Index/Search providers can be defined in the UmbracoSettings.config More information and documentation can be found on CodePlex: http://umbracoexamine.codeplex.com --><ExamineLuceneIndexSets><!-- The internal index set used by Umbraco back-office - DO NOT REMOVE --><IndexSetSetName="InternalIndexSet"IndexPath="~/App_Data/TEMP/ExamineIndexes/Internal/"><IndexAttributeFields><addName="id"/><addName="nodeName"/><addName="updateDate"/><addName="writerName"/><addName="path"/><addName="nodeTypeAlias"/><addName="parentID"/></IndexAttributeFields><IndexUserFields/><IncludeNodeTypes/><ExcludeNodeTypes/></IndexSet><!-- The internal index set used by Umbraco back-office for indexing members - DO NOT REMOVE --><IndexSetSetName="InternalMemberIndexSet"IndexPath="~/App_Data/TEMP/ExamineIndexes/InternalMember/"><IndexAttributeFields><addName="id"/><addName="nodeName"/><addName="updateDate"/><addName="writerName"/><addName="loginName"/><addName="email"/><addName="nodeTypeAlias"/></IndexAttributeFields><IndexUserFields/><IncludeNodeTypes/><ExcludeNodeTypes/></IndexSet><!-- Default Indexset for external searches, this indexes all fields on all types of nodes--><IndexSetSetName="ExternalIndexSet"IndexPath="~/App_Data/TEMP/ExamineIndexes/External/"/><IndexSetSetName="PDFIndexSet"IndexPath="~/App_Data/ExamineIndexes/PDFIndexSet"IndexParentId="-1"/></ExamineLuceneIndexSets>
I wish to rende the rsults with RAZOR script ( render results).
@using Examine
@using Examine.SearchCriteria
@using UmbracoExamine
@using UmbracoExamine.PDF
@{
var searchString = Request["searchString"];
var searchResults = ExamineManager.Instance.SearchProviderCollection["PDFSearcher"].Search(searchString.ToLower(),false).ToList();
}
Please let me know how you render the search/results in umbraco.
I have done it after this post, adapting the part for pdfsearcher.
I seem to have the same problem as Matt, I looked into Luke only to see random numbers as values for File Text Content. Such as "1 1,1 1221/0 13 2 21 21/21 21/3 3 4 5 6 7" etc. I searched for "1" and printed out the actual content and saw it was
So, I assume it is ignoring all the symbols and indexing the numbers only. But the whole point of the PDFIndexer is to read the actual values right, not the encoded version:/ So I would like to ask if Matt resolved this issue or if someone else managed to index PDF content and has time to look into what I'm trying to do, it would be great.
var searcher = ExamineManager.Instance.SearchProviderCollection["PdfSearcher"];
var searchCriteria = searcher.CreateSearchCriteria(BooleanOperation.Or);
var query = searchCriteria.GroupedOr(new string[] { "FileTextContent", "nodeName" }, searchTerm).Compile();
var searchResults = searcher.Search(query);
var noResults = searchResults.Count();
<p>You searched for <em>@searchTerm</em>, and found @noResults results</p>
<ul class="search-results">
@foreach (var result in searchResults)
{
<li>
<a href="@umbraco.library.GetMedia(result.Id, false)">@result.Fields["FileTextContent"]</a>
</li>
}
</ul>
Hi guys, I'm gonna be totally honest and say that I haven't read nearly any of all 8 pages of these issues but I will say that the Examine PDF indexer 'DOES' index PDF content. It definitely does not just only index PDF file names, otherswise that would be useless. The other thing to mention is that some PDFs are protected or created with some weird protection encoding which is why you might experience the strange chars. Examine's PDF indexer uses itextsharp to read PDF. The later the examine version, the later the itextsharp version so that might help. TBH I don't know anything about the Cogworks PDF indexer so not sure what it does beyond the normal examine Examine PDF indexer. There is documentation on the Examine site that does reference that it is not possible to index 'ALL' PDF data and that is because the PDF 'standard' is not standard and is pretty f%#$d in general. We realy on iTextSharp. If it can't do it that neither can we. However please let me know if there are issues with the PDF indexer otherwise, we have unit tests that pass but if it is not working for 'any' of your PDFs than maybe its a setting I've missed.
Thank you Matt and Ismail, the CogUmbracoExamineMediaIndexer indexes the content properly for me too. I will now try to implement the actual search box etc...
I'd just complain a bit about the procedure for the installation of the package, I got the same error as here:
and managed to fix it after adding and removing the libraries for a while, but I'd reccommend that you put a bit more in the Readme, since it's not enough to just install the package and add the tika-app-1.2.dll, but you also need a few more IKVM libraries in the bin folder.
The Cogworks media indexer is just a wrapper around apache tikka so it will index everything tikka can handle that includes pdf. It will also rip out meta data and shove that in the index. Not sure why some of these pdfs are failing with pdf indexer ideally the people having problems need to send the failing pdfs to you see if you can re create?
This is the PDF I tried and got the funny symbols and numbers with the built-in PDFIndexer, but got indexed OK with the Cogworks package. It's just the Razor cheat sheet I got from here:
So, maybe check if it is protected than Shannon's reason for the PdfIndexer not working makes sence to me too, but if it's not maybe there's a bug in the built-in PdfIndexer..?
Its probably charset range in that pdf the pdfindexer does have some code that tests range of characters its possible stuff is getting through however apache tikka is picking up.
Damjan, yes I also had problems missing IKVMOpenJDKBeans assemblies. There's another post somewhere where I list everything you need.
Funily enough I also used the PDF cheatsheet to test indexing but having considered it could be in a strange format I decided to create my own PDFs using OpenOffice to convert a doc I created. I figured that must be pretty standard but alas, no joy.
Umbraco Examine - PDF Indexing
Hi,
We're looking to incorporate PDF indexing into Umbraco Examine. Has anyone done this in the past and has a suggestion for the best approach?
Or ideally, an extension/package that is already built? :-D
I've read the suggestions on http://www.farmcode.org/post/2009/04/20/Umbraco-Examine-v4x-Powerful-Umbraco-Indexing.aspx, but I was wondering if the community has a recommended approach for this.
Thanks,
Chris
Chirs - I had a package that extracted metadata from PDF files and set the metadata as properties of the media item. it doesn't however extract the body text of the PDF.
Alternatively you could use my XSL FO package to generate your PDF's from Umbraco content nodes and just index the content nodes. Obviously if you have existing PDFs they'd have to be migrated.
HTH.
Hi Darren,
It's established PDF files, so creating the the PDFs from content nodes isn't an option. It's really the indexing of PDF text I'm looking for.
Nice Package though, I can see a number of uses for it in future projects.
Chris
Chris,
Can use a number of IFilter implementations for extracting data from any document. Here's the one for pdf's. Seen some tweets re pdf indexing as well (i was all about PDFBox - be it java based...)
Cheers,
/Dirk
Dirk,
You seen any code or know where to plug the ifilter stuff into examine. Slace if your reading you going to do an examine grok session at cg10?
Regards
Isamil
In answer to does Examine (well, Lucene) support PDF indexing see this post - http://www.aaron-powell.com/lucene-net-overview
Thanks for the feedback, if we build anything that can be packaged I'll make sure to post it.
Hi Chris,
I'm trying to do exactly the same thing, I've found recommendations for iTextSharp to get the text from the PDF, but I'm not sure how to get that text into the index.
I think I want to combine my node data and my pdf data into one Lucene-Document. Is this easy to do?
This will mean search results bring up the page that includes the relevant attachment thus providing context, rather than bringing up the attachment directly.
How are you extracting the text via iTextSharp? According to all the documentation I've read it is not possible to get back blocks of text from a PDF document.
Quote:
You can't 'parse' an existing PDF file using iText, you can only 'read' it page per page.
What does this mean?
The pdf format is just a canvas where text and graphics are placed without any structure information. As such there aren't any 'iText-objects' in a PDF file. In each page there will probably be a number of 'Strings', but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines. In short: parsing the content of a PDF-file is NOT POSSIBLE with iText. Post your question on the newsgroup news://comp.text.pdf and maybe you will get some answers from people that have built tools that can parse PDF and extract some of its contents, but don't expect tools that will perform a bullet-proof conversion to structured text.
What iText DOES provide is the possibility to READ a PDF document and copy an entire page of this file into the PDF file you are constructing from scratch. This can be useful if you want to create a new document based on (an) existing document(s). You can add a Watermark, pagenumbers,...
See: http://itextsharp.sourceforge.net/tutorial/ch01.html
I was looking at this:
http://stackoverflow.com/questions/83152/reading-pdf-documents-in-net/84410#84410
I haven't actually got it working yet, so I may have the problems? :-\
Do you recommend any other ways to get the text data?
For searching purposes the text doesn't need to be pretty nor well structured, probably doesn't even need to be in the right order?
Several years ago (3 or 4), I had to implement a seach that indexed pdfs. I used Searcharoo code as a starting point (I was indexing pages, not the data source). I ended up using an iFilter implementation that indexed pdf and other binary documents given the iFilter was installed on the machine, or configured to load directly. At the time, I had to override the Adobe iFilter, as it was known to fail due to some pathing bugs.
Anyhow, I think it would be wonderful to implement the iFilter parsers where possible, and perhaps have configuration as to which files are parsed, with the ability to parse undefined "*" files with the iFilter by default.
Case
Once I get a few projects that have very immediate deadlines I'll be creating a built-in PDF indexer for Examine.
Great news!
Any indication of when this would be available? Next couple of months?
Chris
This is from an e-mail I sent to Aaron recently, if anybody wants to start implementing PDF searching now, without iFilters.
It gets all of the readable text from a PDF, you could store it in some node in Umbraco and then it's searchable through Examine immediately.
Just wanted to let you know that I've found a very simple way to extract text from a PDF file through a library called PDFBox.
Found this article http://www.codeproject.com/KB/string/pdf2text.aspx and tried it out, works as advertised.
Sebastiaan,
where in examine did you have to make changes to implement pdf indexing?
Regards
Ismail
It'll be available once I get a few more pressing projects out of the way.
Examine will be running as Out-Of-Band releases to Umbraco (like ASP.NET MVC has done with Visual Studio) so there's no promises it'll make the 4.1 release.
But if someone wants to write their own indexer be my guest, it is a provider model and you're completely welcome to create your own, it's what it's design for ;)
Here's what I've created so far, this indexes DOCX files because they're the simplest scenrio. Document text is rolled into the node that it is attached to.
I'm not really sure what I'm doing, but it seems to work so far, so critical feedback welcome.
Code:(using this codeproject sample to extract text from docx files)
Configuration changes to use the new class: (based on default configuration documentation)
Add this case to index PDF files (this uses the libraries mentioned in sebastiaans post)
Murray,
Take a look at Niel's code for the old http://umbracoext.codeplex.com/sourcecontrol/network/Show?projectName=umbracoext&changeSetId=49680">umbSearch goto umbSearch that makes use of factory pattern you implement IUmbracoSearchFileFilter so that way you can plug in your own extensions easily doc, pdf,rtf whatever.
Regards
Ismail
Slace,
Looking at murrays code i can see how he has supplied his own provider, the method
GetDataToIndex will only nodes of type content be passed to it by examine or will it also receive nodes of type media. If it does not receive nodes of type media what do i need to do so that i can index media nodes? Could i do it with action handler for media after save and somehow add it to index using examine api?
Regards
Ismail
Ismail - it'll get all the nodes (content and media) with the IndexType defining which one it is.
Shan and I just realised that there is no way to restrict and index to being just content or just media (unless you restrict the content types) so we may add that as a configuration property.
An update to using this method, I spent a while figuring out I was adding fields for protected nodes, so this guard statement at the top of your GetDataToIndex method should ensure your Indexer plays nice when supportProtected="false", I'm not sure f there is a more elegant way of doing this?
Any progress on this? I need a PDF indexing solution for an upcoming site and would much prefer to use a community-supported solution. My C# skills are not stellar but I'm happy to put some time and effort into it.
Hang on, now I see that PDF indexing has been added to Examine RC3 on CodePlex. Anyone implement this successfully yet? I'll be trying soon.
It works fine in our test suite :P
The latest code of Examine has PDF indexing support, and it also exists in RC3.
I've published the DLLs of the latest checkin (57217) which surpasses RC3 and is simplified. If you'd like to try it, you can download it from:
http://shazwazza.com/Content/Downloads/UmbracoExamine.57217.zip
The PDF indexer provider looks like this:
The PDF searcher provider looks like this:
The PDF index set is simple and looks like this:
All PDF data goes into it's own index because the content could be quite huge and is better left to it's own index. The PDF indexer will index media items only and will only index files that are '*.pdf' and are contained in a property called 'umbracoFile' (these 2 things can be overridden in the Index provider if necessary). If you need it to index PDFs that are in a content node, then you'll have to use the API to do this.
Hopefully we'll get the RTM out in the next week or two.
Great stuff ! I've managed to deploy the new DLLs and build up the indexes for web content and PDF's. However I'm having trouble searching against the newly created PDF Index.
Is there any documentation or examples of how to query against the PDF Index as when viewing the Index in an analyser it's not clear as to how this can be achieved?
Thanks in advance.
Please ignore the above, I was being a <DIV> as I hadn't copied across the UmbracoExamine.PDF.dll that meant the index wasn't created correctly...
Also, just in case you come across this... some PDFs are just not indexable/readable if they have been saved in certain ways with security, etc... You might come across this and you clients might complain but the fact is that some PDFs just can't be read.... at least with itextsharp anyways.
Shannon, your example PDF index set didn't come through, can you repost? I know it's probably really simple but it would help those of us trying to get this going. Meanwhile I'll try to sort it out myself and post an example if it works.
- Andrew
PDF Indexer:
PDF Searcher:
PDF Index set is simple it is just:
You don't need to define anything as it's automated. It will index all media items that have a property of umbracoFile (which is already the property name of the Image and File media types) where the umbracoFile is a PDF.
Please download latest Examine version here, there's a few bugs fixed. This will be released as v1.0 RTM this week.
http://shazwazza.com/Content/Downloads/UmbracoExamine57796.zip
Shannon,
When you want to search over both the content and pdf indexes what is the examine syntax? I know in lucene you can do cross index searching but couldnt quite see how to do it via examine?
Regards
Ismail
+1 for the cross-index searching as well.
It seems like we would want to specify multiple SearchProviderCollections in an ExamineManager instance, but it's not clear how to do that - we only see the SearchProviderCollection[] property.
If I happen to make this work while monkeying with it I'll post some results while waiting for the devs to check in.
All you'd need to do is concatenate your searches between the providers:
You can use this same concept when searching with the Fluent API too.
Please be aware however, that that 'Score' value returned between 2 searches is not relavent. The 'Score' value is only relavent to the results of one search regardless of the index. So you couldn't compare the 'Score' value between the concatenated results.
Another approach would be to store your Content + PDF data into one index. The reason why we didn't implement this is because your PDF index could get really huge and we didn't want that to affect your Content/Media index. If you wanted however, you could use the API + events to get your PDF data into your Content/Media index.
Examine doesn't use MultiSearcher, if you needed that you'll have to implement a custom searcher.
Otherwise Shannon's solution is what you'll need to do.
Thanks for the replies. I'd like to try combining the content & PDF data into one index; how would I do that?
Would I just make one IndexSet for everything, with two IndexProviders (one for PDF, one for regular content) and one SearchProvider?
And if so, what type would I use for the search provider?
Create you own indexer to combine the data, or create your own indexer that implements MultiSearcher
Slace,
Or tap into media events and push into index there?
Regards
Ismail
What events are you thinking of using?
I think it'd be easier to create either a custom indexer or searcher
Slace,
The media new, delete, update events tapping into those but looking at it logically creating your own indexer or searcher seems the better route.
Regards
Ismail
You could do it a bit 'dodgy' and just listen to the indexed event of the PDF indexer and add the results to your Content Indexer using ReIndexNode
This means that you'll have PDF data in two indexes... but it would be very little code to write.
Thanks Shannon, I like the idea of a process that would put the indexed PDF text into the content indexer. The .NET part of this is a little advanced for me but I'm happy to give it a try.
I think this means I'm fetching the Umbraco Examine source, making a new UmbracoExamine.PDF.PDFIndexer that does what you say - adds the extracted text to the to the content indexer using ReIndexNode - then, rebuilding and copying over the DLL and using the new indexer method for my IndexProvider that works on the PDF files?
Sorry for the noob questions, hope to get this worked out, and happy to share the results if I do.
No, Examine raises events that you can add handlers to, like you would if you were adding one do the Document object in Umbraco.
Check out Shans CG10 slides for the event list
Also, RC3 is still an RC!!!!!!!!!! .
this is by no means final , there are bugs in the RC and there's alot since changed in the latest version which will become v1.0. Be mindful that there will be some breaking changes... for most people it should be painless to upgrade.
here's some release notes for v1.0
http://examine.codeplex.com/releases/view/50781
Shannon or Slace,
Is there an event for when an item is removed from index? I am looking at implementing shannons dirty hack of putting pdf stuff into content index so i am tapping into event
and at that point i will put the item into the content index. However when i remove the pdf i also need to remove it from my content index hence need to hit that event. Worse case I can tap into umbraco media delete event and do it from there.
Regards
Ismail
IndexDeleted event is fired when an index of a node is removed.
Slace,
The delete delegate has signature:
From e I cannot get the id of the node being delete so is it possible to get it from sender if so what can i cast sender to ? Or am I missing a trick?
Regards
Ismail
Slace,
Ignore last post i have figured it out:
which is the nodeid of the item being deleted.
Regards
Ismail
I am using examine for site search + pdf searching; is it possible to set the PDFs so it only indexes (and searches on) the metadata for PDFs?
Nope, you'd have to write your own indexer to only insert the metadata from a PDF. I'm not really sure if iTextSharp (which we use) can extract metadata, I'd assume it does.
Hi folks,
Way late to this conversation, but this thread has gotten me so close to implementing my client's request I can almost taste success. The one thing I do not understand is the combining of either indexers or searchers.
Here's what I ahve now (Umbraco 4.8.0)
In ExamineSettings:
and in my cshtml file:
var searcher = ExamineManager.Instance.SearchProviderCollection["RazorSiteSearcher"];
Both searchers appear to be doing their jobs, but I need to combine them both into my results page.
Any assistance would be very much appreciated.
-Mike D
Hrm, it apprears there is a problem with the PDF Searcher. I really shoulda tested before I posted...
When I set the searcher collection to PDFSearcher, I get an error on the results page:
Error loading MacroEngine script (file: SearchResults.cshtml)
2 questions... first, how do I get more info in the error? That might help me figure out what is wrong, and 2... what could be wrong? lol
Mike - append '?umbDebugShowTrace=true' to the url, and find the angry red text...
Nathan,
Thanks for the reply... unfortunately still nothing... is there anything else I need to do to make that work?
Is the below key present in your web config? That should be enough to enable the trace, which will show you where the problem is
I haven't used the PDF indexer, so won't be any real help on that front!
Still no more detail... grrr...
This is like trying to chase down a Windows error... lol
Ok, I got my error messages... I'm an idiot... lol
I can now see what the problem is in my script... but figuring out a good way to fix it is going to require someone with much more knowledge of Examine than we possess. What I need to do is to combine 2 indexes into 1. The problem with my script is the pdf index does not have the any of the fields I need to sort out my results. If I could add the pdf index to the site content index, I could then sort and filter my results like the client wants. I can also exclude PDFs that have been "unpublished" via the content tree.
Gawds I hope that made sense... I've been looking at this code for too long and I need a beer or 12....
If there is anyone reading this thread that can help, please please contact me off list so I can try to explain what my client is looking for and how best to accomplish the task.
Thanks everyone...
-Mike D
Mike
The latest version does multi index search. With regards to sort field inject the field in using gatheringnode data event event.
Regards
Ismail
Thanks fro the quick response Ismail...
I am already searching 2 indexes, that's part of my issue. There are no fields in the PDF index to work with other than nodeid... when I search both indexes, I get errors on the results page... my main output is sorted based on the NodeTypeAlias and that field does not exist in the PDF index, so it blows up. If I can combine everything into 1 index, that index will include all the fields I am currently working with.
Please note that I am using built in stuff here... no custom programming. It's all UmbracoExamine and Razor.
Also be advised... I am NOT a programmer. You have to use small words when explaining stuff to me... lol I know enough about programming to grasp concepts, but "inject the field in using gatheringnode data event event" is not something I understand. If you can explain, or give examples, maybe I can grasp the idea, then I can run with it and figure out how to do it. I really need several fields in the PDF index to do what my client wants. Without getting into specifics, it would be hard to explain, and I don't want to burden everyone with all that detail. I'm trying to give enough info to get my pointed in the right direction without writing a novel... lol
Mike,
Are you on Skype? I can talk you through it. What you are trying to achieve is doable having done something similar.
Examine has rich eventing system one of the events is gatheringnodedata you can tap into that event and inject in your own fields. So in you case when PDF indexing happens we can use the event and shove in a nodetypealias field also we can inject in anything else that is needed. My Skype is ismail_mayat if you add me I can talk you through Monday.
Ps can you download Luke it's a useful tool for looking at what is in an examine/ Lucene index just google Luke for Lucene it's a java app latest version is on google code site.
Regards
Ismail
Ismail,
I am on Skype, I went to add you this morning and there are several Ismail Mayats... Are you the one in Preston, UK? That's the one I added... hope it's you. lol
I have Luke already downloaded, and have used it several times already in this learning process. Don't know how I would have gotten as far as I have without it. Anyone else cruising this thread should download it. Great tool to have.
Assuming I got the correct Skype account, send me a blip before you try to call. I really appreciate the fact you are willing to do this, you have no idea.
-Mike D
Mike
Apologies just realised its bank holiday today in uk. Ill be online tommorow you got the right skype user. I was surprised how many skype users already with my name lol thankfully i got twitter name bagged!
I love messing around with examine and your issue is very similar to what i did on fairbairnpb.co.uk
Regards
Ismail
Is there a place on the web with maybe a list of all the events and stuff available in Examine? I have some really smart programmers on staff that I can bug if I just have a reference. And if I cannot get it figured out today I most welcome your assistance tomorrow.
There is links and docs on examine.codeplex.com also some umbraco tv vids and code garden vids see stream.umbraco.org
Many many thanks to Ismail. Not many people would go to the lengths he did to help a complete stranger.
You are a rock star sir!
In which version of Umbraco was the PDF indexer added to examine?
I have a 4.7.1 site I'd like to add PDF search to but don't know if I need to upgrade Umbraco first.
Thanks, Matt
Matt,
4.7.1 has the pdfindexer out of the box.
Thanks Ismail!
Is the Examine PDF Indexer supposed to index the actual content of the PDF files or just the filenames?
I've tried the CogUmbracoExamineMediaIndexer package which indexes the content but the Examine PDF Indexer seems to only be returning matches on the filename and not the file content.
Cheers, Matt
The examine pdf indexer should do content as well but not any meta data. Is the data present when you look with luke should be field called FileTextContent?
Sorry for the delay getting back to you Ismail,
This is mainly an excercise in increasing my understanding so took a back seat to some work I had to do for a couple of days.
I've looked in Luke at the index created by the CogUmbracoExamineMediaIndexer package which works great and I can see all the PDF content indexed:
The examine PDF index however has just indexed a bunch of numbers:
It's strange and I can assure you that both indexes are looking at the same PDF media files:
This is how the index is configured:
The indexer:
The searcher:
Regards,
Matt
@Ismail
Hi Ismail,
Could you please assist with examine.pdf configuration for search in the pdf content?
I am using umbraco 4.9 and copied the latest version of umbraco examine pdf from codeplex, placed the dlls in the bin, but got stuck to later configuration for searching with pdf content.
Thanks,
David
I still haven't managed to get it working as expected either. :-(
David,
You are having problems searching or indexing? Can you paste your examineindex and settings config files. Also can you take a look at your pdf index using luke or http://our.umbraco.org/projects/backoffice-extensions/examine-inspector do you have any documents in the index?
Regards
Ismail
Hi Ismail,
I think I have a problem with the searching or maybe even only rendering the results in umbraco page.
I also installed luke (and ExamineIndexAdmin and Examine modules for developer side) and shows my documents indexed.
ExamineSettings:
ExamineIndex:
I wish to rende the rsults with RAZOR script ( render results).
@using Examine
@using Examine.SearchCriteria
@using UmbracoExamine
@using UmbracoExamine.PDF
@{
var searchString = Request["searchString"];
var searchResults = ExamineManager.Instance.SearchProviderCollection["PDFSearcher"].Search(searchString.ToLower(),false).ToList();
}
Please let me know how you render the search/results in umbraco.
I have done it after this post, adapting the part for pdfsearcher.
http://joeriks.com/2011/03/15/ajax-enabled-search-in-umbraco-using-examine-and-razor/#comment-1269
Thanks,
David
Yes I have pdf files, I added 2 pdf files in media.
Hi All
I seem to have the same problem as Matt, I looked into Luke only to see random numbers as values for File Text Content. Such as "1 1,1 1221/0 13 2 21 21/21 21/3 3 4 5 6 7" etc.
I searched for "1" and printed out the actual content and saw it was
!"# $% &'%'()*" +,-./ 0%"& %/ ,& %/"1221/0()*%"& %/ ,& %/"21/3%$& %/ ,& %/"#-4 -&21/&3()*%$& %/ ,& %/"'21/" +,-./ & %,& %/ ,& %/"'& ()*%,& %/ ,& %/"21/%5& %/ ,& %/"6&21/%5& %/ ,& %/"" +%,-./ 21/&%7&& %/ ,& %/"6%21/21/" +,-./ %2& %/ ,& %/"%21/&21/%74 ,& %/ ,& %/"21/21/%,& %/ ,& %/"21/" +%,-./ 21/&%21/,& %/ ,& %/"'8321/" +,-./ %,& %/ ,& %/"#9$%21/,& %/ ,& %/"'$& ! ! ! !$6::,& %/ ,& %/"$2;:,:"#$"#$"#$"#$%9 ;::::,:: <::::<'-"1<#%#%#%#%'-"1<#%#%#%#%'-1! %'-6%'-$0&"$&"$&"$&"$'()!'()!'()!'()!= &****9::# >?@@%:%:++++/&>+.+A.,+-,+-,+-,+-.).).).)++++;%B& ,& ,& ;%B& ,& ,& -7C,& ,& ,& -73 1,,-7,-%/-,& %/ ,& %/"1$1,1 !"!"D -&/6 !/ = &/ > --";/,-> /))-E??>&/):-<:E/";/,:-<:>+.F +A.??/# :-3:>;3?/% :-3:>+. +A.?>+.13+A.?0>*D+.;3+A.-<EGD+.";3+A.-<E/ D-<E?/&)(E+*(EHH>?
So, I assume it is ignoring all the symbols and indexing the numbers only. But the whole point of the PDFIndexer is to read the actual values right, not the encoded version:/ So I would like to ask if Matt resolved this issue or if someone else managed to index PDF content and has time to look into what I'm trying to do, it would be great.
Thanks
No I didn't manage to resolve it.
It's quite frustrating but I was just researching for future projects so the priority wasn't there and eventually had to move on.
Damjan,
Try using http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer package see if that indexes your pdf content.
Regards
Ismail
The CogUmbracoExamineMediaIndexer worked great for me.
It's just a shame the out of the box stuff doesn't.
Hi guys, I'm gonna be totally honest and say that I haven't read nearly any of all 8 pages of these issues but I will say that the Examine PDF indexer 'DOES' index PDF content. It definitely does not just only index PDF file names, otherswise that would be useless. The other thing to mention is that some PDFs are protected or created with some weird protection encoding which is why you might experience the strange chars. Examine's PDF indexer uses itextsharp to read PDF. The later the examine version, the later the itextsharp version so that might help. TBH I don't know anything about the Cogworks PDF indexer so not sure what it does beyond the normal examine Examine PDF indexer. There is documentation on the Examine site that does reference that it is not possible to index 'ALL' PDF data and that is because the PDF 'standard' is not standard and is pretty f%#$d in general. We realy on iTextSharp. If it can't do it that neither can we. However please let me know if there are issues with the PDF indexer otherwise, we have unit tests that pass but if it is not working for 'any' of your PDFs than maybe its a setting I've missed.
Hello,
Thank you Matt and Ismail, the CogUmbracoExamineMediaIndexer indexes the content properly for me too. I will now try to implement the actual search box etc...
I'd just complain a bit about the procedure for the installation of the package, I got the same error as here:
http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer/bugs-support/37947-Could-not-load-file-or-assembly-IKVMOpenJDKBeans ;
and managed to fix it after adding and removing the libraries for a while, but I'd reccommend that you put a bit more in the Readme, since it's not enough to just install the package and add the tika-app-1.2.dll, but you also need a few more IKVM libraries in the bin folder.
Anyway, thank you again
Shannon,
The Cogworks media indexer is just a wrapper around apache tikka so it will index everything tikka can handle that includes pdf. It will also rip out meta data and shove that in the index. Not sure why some of these pdfs are failing with pdf indexer ideally the people having problems need to send the failing pdfs to you see if you can re create?
Regards
Ismail
for sure, ideally just log an issue on the tracker at examine.codeplex.com with the faulting PDF(s) and I'll see if I can replicate.
Hi,
This is the PDF I tried and got the funny symbols and numbers with the built-in PDFIndexer, but got indexed OK with the Cogworks package. It's just the Razor cheat sheet I got from here:
http://our.umbraco.org/projects/developer-tools/razor-dynamicnode-cheat-sheet
So, maybe check if it is protected than Shannon's reason for the PdfIndexer not working makes sence to me too, but if it's not maybe there's a bug in the built-in PdfIndexer..?
Thanks,
Damjan
Damjan,
Its probably charset range in that pdf the pdfindexer does have some code that tests range of characters its possible stuff is getting through however apache tikka is picking up.
Regards
Ismail
Damjan, yes I also had problems missing IKVMOpenJDKBeans assemblies. There's another post somewhere where I list everything you need.
Funily enough I also used the PDF cheatsheet to test indexing but having considered it could be in a strange format I decided to create my own PDFs using OpenOffice to convert a doc I created. I figured that must be pretty standard but alas, no joy.
is working on a reply...