examine pdf filetextcontent not present in searchresult

fino 14 posts 36 karma points

Sep 21, 2012 @ 10:25

Examine PDF : FileTextContent not present in SearchResult

Hello

I'm trying to implment Umbraco Examine PDF to search in PDF by following this example : http://examine.codeplex.com/wikipage?title=Full%20Configuration%20Markup%20%26%20Options&referringTitle=UmbracoExamine.
When I look the index with Luke, i've got in the index the "FileTextContent" field who contains the pdf's content.

When searching with the ExamineManager in my c# project, I can't get any result for the search term I used. When searching by node id, i get a result but what is strange is that the "FileTextContent" field is not present in the SearchResult.

Is there a special thing to do to have the the FileTextContent field present in my SearchResult ?

Umbraco 4.8.1
Luke : 1.0.1 , 3.5
Examine (and PDF binaries) : 1.4.2

Fino

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Sep 21, 2012 @ 15:38

Fino,

Can you paste your examine config files please. Just interested in the pdf indexer bits.

Regards

Ismail

Copy Link

fino 14 posts 36 karma points

Sep 24, 2012 @ 13:53

Hello Ismail

Thanks for your help. Here is my 2 Examine files :

ExamineSettings.config :

<Examine>
  <ExamineIndexProviders>
    <providers>
      <add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="true" supportProtected="true" interval="10" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />
      <add name="InternalMemberIndexer" type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine" supportUnpublished="true" supportProtected="true" interval="10" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
      <add name="ExternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="false" interval="10" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />
      <add name="PDFIndexer" type="UmbracoExamine.PDF.PDFIndexer, UmbracoExamine.PDF" extensions=".pdf" umbracoFileProperty="umbracoFile"/>      
      </providers>
  </ExamineIndexProviders>
  <ExamineSearchProviders defaultProvider="ExternalSearcher">
    <providers>
      <add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />
      <add name="ExternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" enableLeadingWildcards="true" />
      <add name="InternalMemberSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcards="true" />
      <add name="PDFSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" />      
      </providers>
  </ExamineSearchProviders>
</Examine>

ExamineIndex.config :

<ExamineLuceneIndexSets>
  <!-- The internal index set used by Umbraco back-office - DO NOT REMOVE -->
  <IndexSet SetName="InternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/Internal/">
    <IndexAttributeFields>
      <add Name="id" />
      <add Name="nodeName" />
      <add Name="updateDate" />
      <add Name="writerName" />
      <add Name="path" />
      <add Name="nodeTypeAlias" />
      <add Name="parentID" />
    </IndexAttributeFields>
    <IndexUserFields />
    <IncludeNodeTypes />
    <ExcludeNodeTypes />
  </IndexSet>
  <!-- The internal index set used by Umbraco back-office for indexing members - DO NOT REMOVE -->
  <IndexSet SetName="InternalMemberIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/InternalMember/">
    <IndexAttributeFields>
      <add Name="id" />
      <add Name="nodeName" />
      <add Name="updateDate" />
      <add Name="writerName" />
      <add Name="loginName" />
      <add Name="email" />
      <add Name="nodeTypeAlias" />
    </IndexAttributeFields>
    <IndexUserFields />
    <IncludeNodeTypes />
    <ExcludeNodeTypes />
  </IndexSet>
  <!-- Default Indexset for external searches, this indexes all fields on all types of nodes-->
  <IndexSet SetName="ExternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/External/" />
  <IndexSet SetName="PDFIndexSet" IndexPath="~/App_Data/PDFIndexSet" />  
</ExamineLuceneIndexSets>

I finally found something :
If I use the Search method, I get 3 fields : FileTextContent, __IndexType, __NodeId.

var resultsSearch = ExamineManager.Instance.SearchProviderCollection["PDFSearcher"].Search("Celtic", true).ToList();

But if I use the searchCriteria, I get only umbraco's node attributes (about 22 attributes) :

var searchCriteria = ExamineManager.Instance.SearchProviderCollection["PDFSearcher"].CreateSearchCriteria(UmbracoExamine.IndexTypes.Media);
searchCriteria.RawQuery("+FileTextContent:Celtic~");
var resultsRawQuery = ExamineManager.Instance.Search(searchCriteria).ToList();

Is there a tip when I use the searchCriteria to get also the FileTextContent attribute ?

Fino

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Sep 24, 2012 @ 15:36

fino,

when using searchCriteria what you do you get if you dont use raw query? What does ~ do in lucene query?

Regards

Ismail

Copy Link

fino 14 posts 36 karma points

Sep 25, 2012 @ 16:15

Hello Ismail

The ~ is used for fuzzy searches (https://lucene.apache.org/core/3_6_0/queryparsersyntax.html).

I've found my error. The code I've used to search was not correct.
Here is what I use now and it works.

var provider = (LuceneSearcher)ExamineManager.Instance.SearchProviderCollection["PDFSearcher"];
var criteria = provider.CreateSearchCriteria().RawQuery("+FileTextContent:Celtic~");
var results = provider.Search(criteria);

Thanks for your help

fino

Copy Link

d Thomas 13 posts 33 karma points

Mar 20, 2013 @ 12:51

Hi Ismail,

Could you please assist with examine.pdf configuration for search in the pdf content?

I am using umbraco 4.9 and copied the latest version of umbraco examine pdf from codeplex, placed the dlls in the bin, but got stuck to later configuration for searching with pdf content.

Thanks,

David

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

Examine PDF : FileTextContent not present in SearchResult