how index pdf text based on the node that is referencing the media item

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

anthony hall 222 posts 536 karma points

Apr 02, 2021 @ 14:33
0

How index PDF text based on the node that is referencing the media item
Consider the following.

I have an "article" doctype named "My example page" that picks a media item "lorem-ipsum.pdf".

When the user searches for the text "lorem ipsum". I would like to display the "article" "My example page". I could use "pdf-index" to search the pdf. However, this would not have knowledge of the article page.

Interested in thoughts on how best to support this?

One option I'm considering is:
1. Create a new index "MySearchIndex"
2. On the article "save event" I could query "pdf-index" to get the text from referenced pdfs and apply this to a new field "pdf-text".
Is the above "heavy-handed" and would there be performance issues? I could disable this event on published all.

Any there other options here?
Copy Link
Marc Goodson 2157 posts 14434 karma points MVP 9x c-trib

Apr 06, 2021 @ 13:00

1

Hi Anthony

It depends a little on the content of the PDFs and the context of the search, and appreciate you might not have control over this.

Generally speaking it depends on how important it is to 'match' every single word in a PDF, in a search on the site.

If for example, the PDFs are additional to the article, or show technical data etc, then it might actually be a small subset of 'words' that an external user might be seaching for, that would expect to match, the article / link to the PDF document.

So if it's a report for housing population densities in the North West of England 2021... having 'every word' of that report indexed, might be counter productive... if somebody is searching 'Housing Manchester', matching hundreds of reports might not be helpful, but if somebody searches 'population density' suddenly you want the report to be matched.

So often if the PDF is this kind of data, I'll add a data summary field to the article document type, that includes a summary of the data including keywords, which aren't displayed to the outside world but are 'matched' when searching. It gives editors 'control' over how the PDF is surfaced when people are searching... not every word of the PDF is indexed... but the important ones are, that people might be searching for...

But yes, if the nature of the PDFs is such that every word must be searchable, then add a new field to Examine's ExternalIndex, for the article and populate it during the TransformingIndexValues event, with the text contents of the PDF, and then include this custom field in any Examine searches...

https://our.umbraco.com/documentation/reference/searching/examine/examine-events

regards

Marc

Copy Link
anthony hall 222 posts 536 karma points

Apr 06, 2021 @ 13:39

0

Thanks Marc!

Appreciate the example you gave. Very helpful.

Thanks Anthony

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

How index PDF text based on the node that is referencing the media item