Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • anthony hall 222 posts 536 karma points
    Apr 02, 2021 @ 14:33
    anthony hall
    0

    How index PDF text based on the node that is referencing the media item

    Consider the following.

    I have an "article" doctype named "My example page" that picks a media item "lorem-ipsum.pdf".

    When the user searches for the text "lorem ipsum". I would like to display the "article" "My example page". I could use "pdf-index" to search the pdf. However, this would not have knowledge of the article page.

    Interested in thoughts on how best to support this?

    One option I'm considering is:

    1. Create a new index "MySearchIndex"
    2. On the article "save event" I could query "pdf-index" to get the text from referenced pdfs and apply this to a new field "pdf-text".

    Is the above "heavy-handed" and would there be performance issues? I could disable this event on published all.

    Any there other options here?

  • Marc Goodson 2141 posts 14344 karma points MVP 8x c-trib
    Apr 06, 2021 @ 13:00
    Marc Goodson
    1

    Hi Anthony

    It depends a little on the content of the PDFs and the context of the search, and appreciate you might not have control over this.

    Generally speaking it depends on how important it is to 'match' every single word in a PDF, in a search on the site.

    If for example, the PDFs are additional to the article, or show technical data etc, then it might actually be a small subset of 'words' that an external user might be seaching for, that would expect to match, the article / link to the PDF document.

    So if it's a report for housing population densities in the North West of England 2021... having 'every word' of that report indexed, might be counter productive... if somebody is searching 'Housing Manchester', matching hundreds of reports might not be helpful, but if somebody searches 'population density' suddenly you want the report to be matched.

    So often if the PDF is this kind of data, I'll add a data summary field to the article document type, that includes a summary of the data including keywords, which aren't displayed to the outside world but are 'matched' when searching. It gives editors 'control' over how the PDF is surfaced when people are searching... not every word of the PDF is indexed... but the important ones are, that people might be searching for...

    But yes, if the nature of the PDFs is such that every word must be searchable, then add a new field to Examine's ExternalIndex, for the article and populate it during the TransformingIndexValues event, with the text contents of the PDF, and then include this custom field in any Examine searches...

    https://our.umbraco.com/documentation/reference/searching/examine/examine-events

    regards

    Marc

  • anthony hall 222 posts 536 karma points
    Apr 06, 2021 @ 13:39
    anthony hall
    0

    Thanks Marc!

    Appreciate the example you gave. Very helpful.

    Thanks Anthony

Please Sign in or register to post replies

Write your reply to:

Draft