Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • fino 14 posts 36 karma points
    Sep 21, 2012 @ 10:25
    fino
    0

    Examine PDF : FileTextContent not present in SearchResult

    Hello

    I'm trying to implment Umbraco Examine PDF to search in PDF by following this example : http://examine.codeplex.com/wikipage?title=Full%20Configuration%20Markup%20%26%20Options&referringTitle=UmbracoExamine.
    When I look the index with Luke, i've got in the index the "FileTextContent" field who contains the pdf's content.

    When searching with the ExamineManager in my c# project, I can't get any result for the search term I used. When searching by node id, i get a result but what is strange is that the "FileTextContent" field is not present in the SearchResult.

    Is there a special thing to do to have the the FileTextContent field present in my SearchResult ?

    Umbraco 4.8.1
    Luke : 1.0.1 , 3.5
    Examine (and PDF binaries) : 1.4.2

    Fino

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Sep 21, 2012 @ 15:38
    Ismail Mayat
    0

    Fino,

    Can you paste your examine config files please. Just interested in the pdf indexer bits.

    Regards

    Ismail

  • fino 14 posts 36 karma points
    Sep 24, 2012 @ 13:53
    fino
    0

    Hello Ismail

    Thanks for your help. Here is my 2 Examine files :

    ExamineSettings.config :

    <Examine>
      <ExamineIndexProviders>
        <providers>
          <add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="true" supportProtected="true" interval="10" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />
          <add name="InternalMemberIndexer" type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine" supportUnpublished="true" supportProtected="true" interval="10" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
          <add name="ExternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="false" interval="10" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />
          <add name="PDFIndexer" type="UmbracoExamine.PDF.PDFIndexer, UmbracoExamine.PDF" extensions=".pdf" umbracoFileProperty="umbracoFile"/>     
          </providers>
      </ExamineIndexProviders>
      <ExamineSearchProviders defaultProvider="ExternalSearcher">
        <providers>
          <add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />
          <add name="ExternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" enableLeadingWildcards="true" />
          <add name="InternalMemberSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcards="true" />
          <add name="PDFSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" />     
          </providers>
      </ExamineSearchProviders>
    </Examine>

    ExamineIndex.config :

    <ExamineLuceneIndexSets>
      <!-- The internal index set used by Umbraco back-office - DO NOT REMOVE -->
      <IndexSet SetName="InternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/Internal/">
        <IndexAttributeFields>
          <add Name="id" />
          <add Name="nodeName" />
          <add Name="updateDate" />
          <add Name="writerName" />
          <add Name="path" />
          <add Name="nodeTypeAlias" />
          <add Name="parentID" />
        </IndexAttributeFields>
        <IndexUserFields />
        <IncludeNodeTypes />
        <ExcludeNodeTypes />
      </IndexSet>
      <!-- The internal index set used by Umbraco back-office for indexing members - DO NOT REMOVE -->
      <IndexSet SetName="InternalMemberIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/InternalMember/">
        <IndexAttributeFields>
          <add Name="id" />
          <add Name="nodeName" />
          <add Name="updateDate" />
          <add Name="writerName" />
          <add Name="loginName" />
          <add Name="email" />
          <add Name="nodeTypeAlias" />
        </IndexAttributeFields>
        <IndexUserFields />
        <IncludeNodeTypes />
        <ExcludeNodeTypes />
      </IndexSet>
      <!-- Default Indexset for external searches, this indexes all fields on all types of nodes-->
      <IndexSet SetName="ExternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/External/" />
      <IndexSet SetName="PDFIndexSet" IndexPath="~/App_Data/PDFIndexSet" /> 
    </ExamineLuceneIndexSets>


    I finally found something :
    If I use the Search method, I get 3 fields : FileTextContent, __IndexType, __NodeId.

    var resultsSearch = ExamineManager.Instance.SearchProviderCollection["PDFSearcher"].Search("Celtic", true).ToList();

    But if I use the searchCriteria, I get only umbraco's node attributes (about 22 attributes) :

    var searchCriteria = ExamineManager.Instance.SearchProviderCollection["PDFSearcher"].CreateSearchCriteria(UmbracoExamine.IndexTypes.Media);
    searchCriteria.RawQuery("+FileTextContent:Celtic~");
    var resultsRawQuery = ExamineManager.Instance.Search(searchCriteria).ToList();

     

    Is there a tip when I use the searchCriteria to get also the FileTextContent attribute ?

    Fino

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Sep 24, 2012 @ 15:36
    Ismail Mayat
    0

    fino,

    when using searchCriteria what you do you get if you dont use raw query? What does ~ do in lucene query? 

    Regards

    Ismail

  • fino 14 posts 36 karma points
    Sep 25, 2012 @ 16:15
    fino
    0

    Hello Ismail

    The ~ is used for fuzzy searches (https://lucene.apache.org/core/3_6_0/queryparsersyntax.html).

    I've found my error. The code I've used to search was not correct.
    Here is what I use now and it works.

    var provider = (LuceneSearcher)ExamineManager.Instance.SearchProviderCollection["PDFSearcher"];
    var criteria = provider.CreateSearchCriteria().RawQuery("+FileTextContent:Celtic~");
    var results = provider.Search(criteria);

    Thanks for your help

    fino

  • d Thomas 13 posts 33 karma points
    Mar 20, 2013 @ 12:51
    d Thomas
    0

    Hi Ismail, 

    Could you please assist with examine.pdf configuration for search in the pdf content? 

    I am using umbraco 4.9 and copied the latest version of umbraco examine pdf from codeplex, placed the dlls in the bin, but got stuck to later configuration for searching with pdf content. 

    Thanks,

    David

Please Sign in or register to post replies

Write your reply to:

Draft