Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Topic author was deleted

    Apr 06, 2016 @ 14:36

    Examine, indexing rich and complex property editors

    Hi,

    For a current project I'm looking at indexing pretty complex pages that consist of deeply nested archetype properties.

    And I'm looking for a generic way to index the pages.

    Tips and code examples appreciated.

    Cheers, Tim

  • Ismail Mayat 4511 posts 10091 karma points MVP 2x admin c-trib
    Apr 06, 2016 @ 15:39
    Ismail Mayat
    2

    Tim,

    So this was for pages that had grid, the grid has all sorts of complex editors not just simple title , content etc things like promo boxes and to get that into index using gatheringnode code would have been messy.

    So what i did was still use gathering node but for any doctype that had a grid i would do web request to page and parse it using html agility pack then extract out grid divs and for that content strip out html then inject into examine.

    I guess its a poor man's web crawler.

    So in Application eventhandler class i have

    public void OnApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext){
            var indexer = (UmbracoContentIndexer)ExamineManager.Instance.IndexProviderCollection["ExternalIndexer"];
    
            var helper = new UmbracoHelper(UmbracoContext.Current);
    
            ExamineManager.Instance.IndexProviderCollection["ExternalIndexer"].GatheringNodeData += (sender, e) => ExternalIndexerGatheringNodeData(sender,e,helper);
        }
    

    And in event:

    if (e.IndexType == IndexTypes.Content){  ExamineHelper.AddAllFieldsToContentField(e,helper);}
    

    Then

     public static void AddAllFieldsToContentField(IndexingNodeDataEventArgs indexingNodeDataEventArgs, UmbracoHelper helper)
        {
            StringBuilder sb = new StringBuilder();
    
            try
            {
                //gets not with some search settings 
                var searchSettings = ServiceFactory.GetGlobalConfig(indexingNodeDataEventArgs.NodeId).SiteSearchSettings;
    
                //search is turned on therefore index
                if (searchSettings.ShowSearch)
                {
                    IWebScraper scraper = ServiceFactory.GetWebScraper();
    
                    foreach (var field in indexingNodeDataEventArgs.Fields)
                    {
                        if (IsGridField(field.Key))
                        {
                            LogHelper.Debug(typeof(ExamineHelper), string.Format("processing node {0} found grid property", indexingNodeDataEventArgs.NodeId));
    
                            // we have full url here with first assinged domain this could be anything and incorrect
                            string contentUrlWithDomain = helper.NiceUrl(indexingNodeDataEventArgs.NodeId);
    
                            Uri siteUri = new Uri(contentUrlWithDomain);
    
                            //cannot get this from IPublishedContent Url property because for some reason its null could be something to do with 
                            //fact we are in gatheringnode??
    
                            string pageUrl = siteUri.AbsolutePath;
    
                            //SiteUrlForIndexingScraper comes from global config 
                            string urlToScrape = searchSettings.SiteUrlForIndexingScraper + pageUrl;
    
                            LogHelper.Debug(typeof(ExamineHelper), string.Format("scraping url: {0}", urlToScrape));
    
                            string scrapedGridContent = scraper.ScrapeByClass(urlToScrape, "umb-grid");
    
                            sb.AppendLine(scrapedGridContent);
                        }
                        else
                        {
                            if (IsContent(field.Value))
                            {
                                sb.AppendLine(field.Value);
                            }
                        }
                    }
                }
    
    
            }
            catch (Exception ex)
            {
                //common error could be missing global config stuff
                //for market 
                LogHelper.Error<Exception>("error indexing double check global config search settings ",ex);
            }
    
            indexingNodeDataEventArgs.Fields.Add("contents",sb.ToString());
        }
    
         private static bool IsGridField(string key)
        {
            if (key == "grid")
            {
                return true;
            }
            return false;
        }
    

    The webscraper looks like:

    public interface IWebScraper
    {
        string ScrapeByClass(string url, string cssClass);
    }
    
    public class WebScraper:IWebScraper
    {
    
        public string ScrapeByClass(string url, string cssClass)
        {
            StringBuilder sb=new StringBuilder();
    
            try
            {
                var html = new HtmlDocument();
    
                html.LoadHtml(new WebClient().DownloadString(url));
    
                var root = html.DocumentNode;
    
                //remove picture and svg nodes                
                RemoveNodes(root, "picture");
    
                RemoveNodes(root, "svg");
    
                var grids = root.Descendants().Where(n => n.GetAttributeValue("class", "").Equals(cssClass));
    
                LogHelper.Debug(typeof(WebScraper),string.Format("found {0} grid/s", grids.Count()));
    
                foreach (var grid in grids)
                {
                    sb.AppendLine(grid.InnerText);
                    sb.Append(" "); //need these for tokenisation of say li content
                    if (grid.HasChildNodes)
                    {
                        ProcessChildNodes(sb, grid.ChildNodes);
                    }
                }
            }
            catch (Exception ex)
            {
                LogHelper.Error<Exception>(string.Format("error scraping url {0} found grid property", url),ex);
            }
    
            var content = System.Web.HttpUtility.HtmlDecode(sb.ToString().Replace("\r", string.Empty).Replace("\n", string.Empty));
    
            return content;            
        }
    
        private void RemoveNodes(HtmlNode root, string elementTypeToRemove)
        {
            var emptyImages = root.Descendants(elementTypeToRemove)  
                                  .Select(x => x.XPath)
                                  .ToList();
    
            emptyImages.ForEach(xpath => {
                var node = root.SelectSingleNode(xpath);
                if (node != null) { node.Remove(); }
            });
        }
    
        private void ProcessChildNodes(StringBuilder sb, HtmlNodeCollection childNodes)
        {
            foreach (var childNode in childNodes)
            {
                sb.AppendLine(childNode.InnerText);
                sb.Append(" ");
                if (childNode.HasChildNodes)
                {
                    ProcessChildNodes(sb,childNode.ChildNodes);
                }
            }
        }
    }
    

    Regards

    Ismail

  • Comment author was deleted

    Apr 06, 2016 @ 18:29

    Super, thanks for sharing!

  • Tom van Enckevort 107 posts 429 karma points
    Apr 07, 2016 @ 08:05
    Tom van Enckevort
    2

    Instead of using the WebClient to request the page, could you use the RenderTemplate method in the UmbracoHelper class?

    So it would be something like this inside the GatheringNodeData:

    var umbHelper = new UmbracoHelper(UmbracoContext.Current);
    var content = umbHelper.RenderTemplate(indexingNodeDataEventArgs.NodeId);
    

    You can then pass that into the HtmlAgilityPack classes to extract the required HTML content.

    It would save having to make an HTTP request to the server, and instead keep it all within the current code execution cycle.

  • Ismail Mayat 4511 posts 10091 karma points MVP 2x admin c-trib
    Apr 07, 2016 @ 08:14
    Ismail Mayat
    0

    Tom,

    Genius will remember for next time.

    Regards

    Ismail

  • Ismail Mayat 4511 posts 10091 karma points MVP 2x admin c-trib
    Apr 07, 2016 @ 08:16
    Ismail Mayat
    0

    Tom,

    Have you tried this within context of gatheringnode data event are you able to get an Umbraco context? I think you have to use Ensure context first if i remember rightly?

    Regards

    Ismail

  • Tom van Enckevort 107 posts 429 karma points
    Apr 07, 2016 @ 09:07
    Tom van Enckevort
    0

    Yes, it's safest to make sure EnsureContext is called. It seemed to work without when I tried it, but better to be safe than sorry :-)

  • Comment author was deleted

    Apr 07, 2016 @ 08:12

    Ah yeah that's a nicer solution, instead of scraping the page, will give it a shot, thanks!

  • Jay 425 posts 652 karma points
    Feb 28, 2017 @ 16:23
    Jay
    0

    Hey Ismail,

    When you say check for EnsureContext, do you have an example on how to do that?

    Building something similar

    Thanks

  • Jay 425 posts 652 karma points
    Feb 28, 2017 @ 18:09
    Jay
    1

    Found how to do the EnsureContext,

    Another question,

    I'm using umbHelper.RenderTemplate(indexingNodeDataEventArgs.NodeId)

    instead of webclient, any idea how to use it when you have your pages with custom RenderMvcController.

    It keep saying that there's no parameterless constructor

  • Ismail Mayat 4511 posts 10091 karma points MVP 2x admin c-trib
    Mar 01, 2017 @ 09:26
    Ismail Mayat
    0

    JLon,

    When i tried rendertemplate on page with grid and macro it kept blowing up hence i did screen scrape.

    Regards

    Ismail

  • Tom 713 posts 954 karma points
    Sep 15, 2017 @ 00:52
    Tom
    0

    Would this approach also work where you have angular driven custom property editors? I.e. I have a repeated collection of form elements as an editor that is not in the grid editor but a custom data type.. Would you suggest scraping as well?

    Did anyone get it working using rendertemplate and macros? Thanks :)

  • Craig100 1136 posts 2523 karma points c-trib
    Aug 13, 2018 @ 16:37
    Craig100
    0

    Hi,

    Just resurrecting this thread because I'm working on an old site that has this code in it. Wondering if anyone can cast any light on when debugging why, at ** the debugger calls for timing.cs, which it can't find and when browsing for it, says it should be in c:\github\SamSaffron\MiniProfiler\StackExchange.Profiling\Timing.cs.

    namespace myproject.co.uk.ExamineAddons
    {
        public class UmbracoEvents : ApplicationEventHandler
        {
            protected override void ApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext)
            {
                var helper = new UmbracoHelper(UmbracoContext.Current);
                ExamineManager.Instance.IndexProviderCollection["ExternalIndexer"].GatheringNodeData += (sender, e) => ExternalIndexerGatheringNodeData(sender, e, helper);
            }
    ****
            void ExternalIndexerGatheringNodeData(object sender, IndexingNodeDataEventArgs e, UmbracoHelper helper)
            {
                if (e.IndexType == IndexTypes.Content)
                    ExamineHelper.AddAllFieldsToContentField(e, helper);
            }
        }
    }
    

    I understand it's something to do with MiniProfiler. The Dll is there but I can't seem to get past this.

    Any advice would be appreciated.

    Craig

Please Sign in or register to post replies

Write your reply to:

Draft