I'm using UmbracoExamine.PDF to index PDFs, but I'm now aware that it indexes PDFs in the Media section whether they've been published on a content page or not.
Is there a way I can index only the PDFs that have been published on a content node?
So you could tap into gathering node event then somehow figure out if the PDF is being used on a content node that is published. If content node not published then cancel the indexing event.
Thanks Ismail. Was hoping there would be a simple config setting I could use. I'll check out that Nexu package. Do you know if it has an API I can code to?
It does. Its actually amazeballs. So you install the package. Then you rebuild the usage via dashboard. What that does is for most of your common data types like media picker rte it will find usages of content and media then it will fill relations table. The api uses relations api to show you usage. If you have custom data types then you will need to build resolvers for them but its all documented.
If you can't use my api directly you can always code against the normal Umbraco Relations API. So like Ismail set you can use it probably for you case.
Let us know how you get on because this is very interesting use case and if it all works i am going to include note on it on the examine course i wrote under the pdf indexing exercise.
So, I had a stab at it and it seems to work! Not much code and I'm sure it can be improved, but here it is...
public class PdfExamineEvents : ApplicationEventHandler
{
private IRelationService _relationService;
protected override void ApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext)
{
var helper = new UmbracoHelper(UmbracoContext.Current);
ExamineManager.Instance.IndexProviderCollection["PDFIndexer"].NodeIndexing += (sender, e) => NodeIndexing(sender, e, helper);
_relationService = applicationContext.Services.RelationService;
}
private void NodeIndexing(object sender, IndexingNodeEventArgs args, UmbracoHelper helper)
{
args.Cancel = !ShouldIndex(args.NodeId, helper);
}
private bool ShouldIndex(int nodeId, UmbracoHelper helper)
{
// Check if media item is PDF
if (!IsPdf(nodeId, helper)) return false;
// Check if if the PDF is in the Relations
if (!_relationService.IsRelated(nodeId)) return false;
// If any document with this PDF is published, add to index. If not, cancel
var relations = _relationService.GetByChildId(nodeId);
if (!AnyPublished(relations.Select(r => r.ParentId), helper)) return false;
return true;
}
private static bool IsPdf(int nodeId, UmbracoHelper helper)
{
var mediaItem = helper.TypedMedia(nodeId);
// Not sure if every IPublishedContent has this property
return mediaItem.HasProperty("umbracoExtension") && mediaItem.GetPropertyValue<string>("umbracoExtension") == "pdf";
}
private static bool AnyPublished(IEnumerable<int> nodeIds, UmbracoHelper helper)
{
if (nodeIds.Select(id => helper.TypedContent(id)).Any(n => n != null)) return true;
return false;
}
}
One more thing you will need to handle content un publish and delete events. So if page a contains pdf b and page a is then unpublished you will need to tap into that event and remove pdf b from index.
For the sake of completion, here is my code to remove a PDF from an index if the page it is on becomes unpublished...
private void ContentService_UnPublished(IPublishingStrategy sender, PublishEventArgs<IContent> e)
{
var pdfIndexer = ExamineManager.Instance.IndexProviderCollection["PDFIndexer"];
foreach (var item in e.PublishedEntities)
{
// Get all relations where the current node is the parent
var relations = _relationService.GetByParentId(item.Id).ToList();
foreach (var relation in relations)
{
var mediaNode = _helper.TypedMedia(relation.ChildId);
if (mediaNode == null) continue;
if (mediaNode["umbracoExtension"].ToString() != "pdf") continue;
// Get all relations for this PDF not including the current relation
var otherRelations = _relationService.GetByChildId(mediaNode.Id).Where(rel => rel.ParentId != relation.ParentId);
// If this PDF is used on other published pages, do nothing and leave PDF in index
if (otherRelations.Any(i => _helper.TypedContent(i.ParentId) != null)) continue;
pdfIndexer.DeleteFromIndex(mediaNode.Id.ToString());
}
}
}
Any suggestions for improvements to this code are most welcome.
UmbracoExamine.PDF indexing published Media
Hi guys
I'm using UmbracoExamine.PDF to index PDFs, but I'm now aware that it indexes PDFs in the Media section whether they've been published on a content page or not.
Is there a way I can index only the PDFs that have been published on a content node?
Mitch,
So you could tap into gathering node event then somehow figure out if the PDF is being used on a content node that is published. If content node not published then cancel the indexing event.
You may be able to determine media usage using nexu package https://our.umbraco.org/projects/backoffice-extensions/nexu/
Regards
Ismail
Thanks Ismail. Was hoping there would be a simple config setting I could use. I'll check out that Nexu package. Do you know if it has an API I can code to?
Thanks
Mitch,
It does. Its actually amazeballs. So you install the package. Then you rebuild the usage via dashboard. What that does is for most of your common data types like media picker rte it will find usages of content and media then it will fill relations table. The api uses relations api to show you usage. If you have custom data types then you will need to build resolvers for them but its all documented.
So in theory after installing nexu then rebuilding you can using gathering node event on pdf index you can via id of current media item you are indexing do a lookup. See https://github.com/dawoe/umbraco-nexu/blob/develop/Source/Our.Umbraco.Nexu.Core.Tests/NexuApiControllerTests.cs there is api of sorts but worse case you can use relations api as the package will create relations in the db.
I have not done this before but reckon with a bit of work you can solve your issue.
Regards
Ismail
Hi Mitch,
If you can't use my api directly you can always code against the normal Umbraco Relations API. So like Ismail set you can use it probably for you case.
Dave
Thank you so much guys. Very useful advice. I'll give it a go right now.
Mitch,
Let us know how you get on because this is very interesting use case and if it all works i am going to include note on it on the examine course i wrote under the pdf indexing exercise.
Regards
Ismail
I also have plans to extend the API with methods that can be used in your project just for this kinds of use cases.
But need to fix some other things first..;and of course find time :-)
Dave
So, I had a stab at it and it seems to work! Not much code and I'm sure it can be improved, but here it is...
Hi Mitch,
Nice to see that it works.
Dave
Mitch,
One improvement you could make is test for pdf extension using the args provided then you dont need to instantiate a new media item so:
Something along those lines.
Other than that looks good. Good to see you got it working. Im adding this as notes to my examine course ftw!
Regards
Ismail
Mitch,
One more thing you will need to handle content un publish and delete events. So if page a contains pdf b and page a is then unpublished you will need to tap into that event and remove pdf b from index.
Regards
Ismail
Thanks Ismail. I'll make those improvements. Glad you can find some use for it too!
For the sake of completion, here is my code to remove a PDF from an index if the page it is on becomes unpublished...
Any suggestions for improvements to this code are most welcome.
is working on a reply...