How index PDF text based on the node that is referencing the media item
Consider the following.
I have an "article" doctype named "My example page" that picks a media item "lorem-ipsum.pdf".
When the user searches for the text "lorem ipsum". I would like to display the "article" "My example page". I could use "pdf-index" to search the pdf. However, this would not have knowledge of the article page.
Interested in thoughts on how best to support this?
One option I'm considering is:
Create a new index "MySearchIndex"
On the article "save event" I could query "pdf-index" to get the text from referenced pdfs and apply this to a new field "pdf-text".
Is the above "heavy-handed" and would there be performance issues? I could disable this event on published all.
It depends a little on the content of the PDFs and the context of the search, and appreciate you might not have control over this.
Generally speaking it depends on how important it is to 'match' every single word in a PDF, in a search on the site.
If for example, the PDFs are additional to the article, or show technical data etc, then it might actually be a small subset of 'words' that an external user might be seaching for, that would expect to match, the article / link to the PDF document.
So if it's a report for housing population densities in the North West of England 2021... having 'every word' of that report indexed, might be counter productive... if somebody is searching 'Housing Manchester', matching hundreds of reports might not be helpful, but if somebody searches 'population density' suddenly you want the report to be matched.
So often if the PDF is this kind of data, I'll add a data summary field to the article document type, that includes a summary of the data including keywords, which aren't displayed to the outside world but are 'matched' when searching. It gives editors 'control' over how the PDF is surfaced when people are searching... not every word of the PDF is indexed... but the important ones are, that people might be searching for...
But yes, if the nature of the PDFs is such that every word must be searchable, then add a new field to Examine's ExternalIndex, for the article and populate it during the TransformingIndexValues event, with the text contents of the PDF, and then include this custom field in any Examine searches...
How index PDF text based on the node that is referencing the media item
Consider the following.
I have an "article" doctype named "My example page" that picks a media item "lorem-ipsum.pdf".
When the user searches for the text "lorem ipsum". I would like to display the "article" "My example page". I could use "pdf-index" to search the pdf. However, this would not have knowledge of the article page.
Interested in thoughts on how best to support this?
One option I'm considering is:
Is the above "heavy-handed" and would there be performance issues? I could disable this event on published all.
Any there other options here?
Hi Anthony
It depends a little on the content of the PDFs and the context of the search, and appreciate you might not have control over this.
Generally speaking it depends on how important it is to 'match' every single word in a PDF, in a search on the site.
If for example, the PDFs are additional to the article, or show technical data etc, then it might actually be a small subset of 'words' that an external user might be seaching for, that would expect to match, the article / link to the PDF document.
So if it's a report for housing population densities in the North West of England 2021... having 'every word' of that report indexed, might be counter productive... if somebody is searching 'Housing Manchester', matching hundreds of reports might not be helpful, but if somebody searches 'population density' suddenly you want the report to be matched.
So often if the PDF is this kind of data, I'll add a data summary field to the article document type, that includes a summary of the data including keywords, which aren't displayed to the outside world but are 'matched' when searching. It gives editors 'control' over how the PDF is surfaced when people are searching... not every word of the PDF is indexed... but the important ones are, that people might be searching for...
But yes, if the nature of the PDFs is such that every word must be searchable, then add a new field to Examine's ExternalIndex, for the article and populate it during the TransformingIndexValues event, with the text contents of the PDF, and then include this custom field in any Examine searches...
https://our.umbraco.com/documentation/reference/searching/examine/examine-events
regards
Marc
Thanks Marc!
Appreciate the example you gave. Very helpful.
Thanks Anthony
is working on a reply...