umbracoexaminepdf indexing published media

Mitch 44 posts 159 karma points

Jul 19, 2017 @ 10:38

UmbracoExamine.PDF indexing published Media

Hi guys

I'm using UmbracoExamine.PDF to index PDFs, but I'm now aware that it indexes PDFs in the Media section whether they've been published on a content page or not.

Is there a way I can index only the PDFs that have been published on a content node?

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jul 19, 2017 @ 11:04

Mitch,

So you could tap into gathering node event then somehow figure out if the PDF is being used on a content node that is published. If content node not published then cancel the indexing event.

You may be able to determine media usage using nexu package https://our.umbraco.org/projects/backoffice-extensions/nexu/

Regards

Ismail

Copy Link

Mitch 44 posts 159 karma points

Jul 20, 2017 @ 09:29

Thanks Ismail. Was hoping there would be a simple config setting I could use. I'll check out that Nexu package. Do you know if it has an API I can code to?

Thanks

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jul 20, 2017 @ 09:57

Mitch,

It does. Its actually amazeballs. So you install the package. Then you rebuild the usage via dashboard. What that does is for most of your common data types like media picker rte it will find usages of content and media then it will fill relations table. The api uses relations api to show you usage. If you have custom data types then you will need to build resolvers for them but its all documented.

So in theory after installing nexu then rebuilding you can using gathering node event on pdf index you can via id of current media item you are indexing do a lookup. See https://github.com/dawoe/umbraco-nexu/blob/develop/Source/Our.Umbraco.Nexu.Core.Tests/NexuApiControllerTests.cs there is api of sorts but worse case you can use relations api as the package will create relations in the db.

I have not done this before but reckon with a bit of work you can solve your issue.

Regards

Ismail

Copy Link

Dave Woestenborghs 3504 posts 12135 karma points MVP 9x admin c-trib

Jul 20, 2017 @ 10:00

Hi Mitch,

If you can't use my api directly you can always code against the normal Umbraco Relations API. So like Ismail set you can use it probably for you case.

Dave

Copy Link

Mitch 44 posts 159 karma points

Jul 20, 2017 @ 10:20

Thank you so much guys. Very useful advice. I'll give it a go right now.

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jul 20, 2017 @ 10:44

Mitch,

Let us know how you get on because this is very interesting use case and if it all works i am going to include note on it on the examine course i wrote under the pdf indexing exercise.

Regards

Ismail

Copy Link

Dave Woestenborghs 3504 posts 12135 karma points MVP 9x admin c-trib

Jul 20, 2017 @ 10:46

I also have plans to extend the API with methods that can be used in your project just for this kinds of use cases.

But need to fix some other things first..;and of course find time :-)

Dave

Copy Link

Mitch 44 posts 159 karma points

Jul 20, 2017 @ 16:03

So, I had a stab at it and it seems to work! Not much code and I'm sure it can be improved, but here it is...

public class PdfExamineEvents : ApplicationEventHandler
{
    private IRelationService _relationService;

    protected override void ApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext)
    {
        var helper = new UmbracoHelper(UmbracoContext.Current);

        ExamineManager.Instance.IndexProviderCollection["PDFIndexer"].NodeIndexing += (sender, e) => NodeIndexing(sender, e, helper);

        _relationService = applicationContext.Services.RelationService;
    }

    private void NodeIndexing(object sender, IndexingNodeEventArgs args, UmbracoHelper helper)
    {
        args.Cancel = !ShouldIndex(args.NodeId, helper);
    }

    private bool ShouldIndex(int nodeId, UmbracoHelper helper)
    {
        // Check if media item is PDF
        if (!IsPdf(nodeId, helper)) return false;

        // Check if if the PDF is in the Relations
        if (!_relationService.IsRelated(nodeId)) return false;

        // If any document with this PDF is published, add to index. If not, cancel
        var relations = _relationService.GetByChildId(nodeId);
        if (!AnyPublished(relations.Select(r => r.ParentId), helper)) return false;

        return true;
    }

    private static bool IsPdf(int nodeId, UmbracoHelper helper)
    {
        var mediaItem = helper.TypedMedia(nodeId);

        // Not sure if every IPublishedContent has this property
        return mediaItem.HasProperty("umbracoExtension") && mediaItem.GetPropertyValue<string>("umbracoExtension") == "pdf";
    }

    private static bool AnyPublished(IEnumerable<int> nodeIds, UmbracoHelper helper)
    {
        if (nodeIds.Select(id => helper.TypedContent(id)).Any(n => n != null)) return true;
        return false;
    }
}

Copy Link

Dave Woestenborghs 3504 posts 12135 karma points MVP 9x admin c-trib

Jul 20, 2017 @ 16:07

Hi Mitch,

Nice to see that it works.

Dave

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jul 20, 2017 @ 16:23

Mitch,

One improvement you could make is test for pdf extension using the args provided then you dont need to instantiate a new media item so:

args.Fields["umbracoExtention"]="pdf"

Something along those lines.

Other than that looks good. Good to see you got it working. Im adding this as notes to my examine course ftw!

Regards

Ismail

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jul 21, 2017 @ 08:30

Mitch,

One more thing you will need to handle content un publish and delete events. So if page a contains pdf b and page a is then unpublished you will need to tap into that event and remove pdf b from index.

Regards

Ismail

Copy Link

Mitch 44 posts 159 karma points

Jul 25, 2017 @ 08:30

Thanks Ismail. I'll make those improvements. Glad you can find some use for it too!

Copy Link

Mitch 44 posts 159 karma points

Jul 25, 2017 @ 13:32

For the sake of completion, here is my code to remove a PDF from an index if the page it is on becomes unpublished...

private void ContentService_UnPublished(IPublishingStrategy sender, PublishEventArgs<IContent> e)
    {
        var pdfIndexer = ExamineManager.Instance.IndexProviderCollection["PDFIndexer"];

        foreach (var item in e.PublishedEntities)
        {
            // Get all relations where the current node is the parent
            var relations = _relationService.GetByParentId(item.Id).ToList();

            foreach (var relation in relations)
            {
                var mediaNode = _helper.TypedMedia(relation.ChildId);
                if (mediaNode == null) continue;

                if (mediaNode["umbracoExtension"].ToString() != "pdf") continue;

                // Get all relations for this PDF not including the current relation
                var otherRelations = _relationService.GetByChildId(mediaNode.Id).Where(rel => rel.ParentId != relation.ParentId);

                // If this PDF is used on other published pages, do nothing and leave PDF in index
                if (otherRelations.Any(i => _helper.TypedContent(i.ParentId) != null)) continue;

                pdfIndexer.DeleteFromIndex(mediaNode.Id.ToString());
            }
        }
    }

Any suggestions for improvements to this code are most welcome.

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

UmbracoExamine.PDF indexing published Media