Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

  • karen 186 posts 461 karma points
    Feb 24, 2016 @ 17:20

    Examine PDF not indexing PDFs

    Using Umbraco v6.2.5.

    I just installed Umbraco Examine PDF using nuget. (Also tried uninstalling and then reinstalling as some other posts suggested trying).

    I am looking at the index using Luke, and I notice it does not look like is actually indexing any PDFs.

    Example, I see a NodeId of 1221, only NodeId=1221 is a folder containing PDFs. But none of the PDFs in that folder are indexed (based on looking at the nodeID of the PDF and what is displaying in Luke)

    Going down the list of NodeIds listed in Luke, it looks like they are all folders in the media section.

    Any thoughts on why that is happening? I have seen lots of unanswered questions regarding Examine and PDFs, but haven't seen any comments with this (example perhaps people saying they are only seeing the NodeId and no other fields are also having this issue, just not realizing it)

    I am using the default values put in the config from the nuget installer:

    <add name="PDFIndexer" type="UmbracoExamine.PDF.PDFIndexer, UmbracoExamine.PDF" extensions=".pdf" umbracoFileProperty="umbracoFile"/>
    <add name="PDFSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"/>
    <IndexSet SetName="PDFIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/PDFs"/>
  • karen 186 posts 461 karma points
    Feb 26, 2016 @ 19:26

    Well interesting follow up. I deleted the nuget package, and added the dlls manually from this page:

    Now I have a different set of indexes, including some with the 'FileTextContent' field. However it is still not indexing ALL the pdfs, I uploaded a new PDF and the index did not change, tried re-indexing it via the developer tab but still no changes (same number of items). Some of the items indexed are still folders, not actual pdf files.

    Going to grab the source and see what I can find out from there.

  • colin gray 16 posts 56 karma points
    May 23, 2016 @ 16:22
    colin gray

    I had got my developing site to index pdfs, after cms refresh from nuget (other issue), pdfs would no-longer index (dev > examine > indexing)... so "added the dlls manually from this page:" It would re-index again!

    your issue: if its indexing some, but not all, I suspect its the pdf structure of the failing files. pdf text is broken up by formatting within words? Try publishing a un-formatted version of the same pdf. If you need the fancy layout. try all the text, plain, white on white, 4pt on the last 3 page etc.

Please Sign in or register to post replies

Write your reply to: