So this was for pages that had grid, the grid has all sorts of complex editors not just simple title , content etc things like promo boxes and to get that into index using gatheringnode code would have been messy.
So what i did was still use gathering node but for any doctype that had a grid i would do web request to page and parse it using html agility pack then extract out grid divs and for that content strip out html then inject into examine.
I guess its a poor man's web crawler.
So in Application eventhandler class i have
public void OnApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext){
var indexer = (UmbracoContentIndexer)ExamineManager.Instance.IndexProviderCollection["ExternalIndexer"];
var helper = new UmbracoHelper(UmbracoContext.Current);
ExamineManager.Instance.IndexProviderCollection["ExternalIndexer"].GatheringNodeData += (sender, e) => ExternalIndexerGatheringNodeData(sender,e,helper);
}
And in event:
if (e.IndexType == IndexTypes.Content){ ExamineHelper.AddAllFieldsToContentField(e,helper);}
Then
public static void AddAllFieldsToContentField(IndexingNodeDataEventArgs indexingNodeDataEventArgs, UmbracoHelper helper)
{
StringBuilder sb = new StringBuilder();
try
{
//gets not with some search settings
var searchSettings = ServiceFactory.GetGlobalConfig(indexingNodeDataEventArgs.NodeId).SiteSearchSettings;
//search is turned on therefore index
if (searchSettings.ShowSearch)
{
IWebScraper scraper = ServiceFactory.GetWebScraper();
foreach (var field in indexingNodeDataEventArgs.Fields)
{
if (IsGridField(field.Key))
{
LogHelper.Debug(typeof(ExamineHelper), string.Format("processing node {0} found grid property", indexingNodeDataEventArgs.NodeId));
// we have full url here with first assinged domain this could be anything and incorrect
string contentUrlWithDomain = helper.NiceUrl(indexingNodeDataEventArgs.NodeId);
Uri siteUri = new Uri(contentUrlWithDomain);
//cannot get this from IPublishedContent Url property because for some reason its null could be something to do with
//fact we are in gatheringnode??
string pageUrl = siteUri.AbsolutePath;
//SiteUrlForIndexingScraper comes from global config
string urlToScrape = searchSettings.SiteUrlForIndexingScraper + pageUrl;
LogHelper.Debug(typeof(ExamineHelper), string.Format("scraping url: {0}", urlToScrape));
string scrapedGridContent = scraper.ScrapeByClass(urlToScrape, "umb-grid");
sb.AppendLine(scrapedGridContent);
}
else
{
if (IsContent(field.Value))
{
sb.AppendLine(field.Value);
}
}
}
}
}
catch (Exception ex)
{
//common error could be missing global config stuff
//for market
LogHelper.Error<Exception>("error indexing double check global config search settings ",ex);
}
indexingNodeDataEventArgs.Fields.Add("contents",sb.ToString());
}
private static bool IsGridField(string key)
{
if (key == "grid")
{
return true;
}
return false;
}
The webscraper looks like:
public interface IWebScraper
{
string ScrapeByClass(string url, string cssClass);
}
public class WebScraper:IWebScraper
{
public string ScrapeByClass(string url, string cssClass)
{
StringBuilder sb=new StringBuilder();
try
{
var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString(url));
var root = html.DocumentNode;
//remove picture and svg nodes
RemoveNodes(root, "picture");
RemoveNodes(root, "svg");
var grids = root.Descendants().Where(n => n.GetAttributeValue("class", "").Equals(cssClass));
LogHelper.Debug(typeof(WebScraper),string.Format("found {0} grid/s", grids.Count()));
foreach (var grid in grids)
{
sb.AppendLine(grid.InnerText);
sb.Append(" "); //need these for tokenisation of say li content
if (grid.HasChildNodes)
{
ProcessChildNodes(sb, grid.ChildNodes);
}
}
}
catch (Exception ex)
{
LogHelper.Error<Exception>(string.Format("error scraping url {0} found grid property", url),ex);
}
var content = System.Web.HttpUtility.HtmlDecode(sb.ToString().Replace("\r", string.Empty).Replace("\n", string.Empty));
return content;
}
private void RemoveNodes(HtmlNode root, string elementTypeToRemove)
{
var emptyImages = root.Descendants(elementTypeToRemove)
.Select(x => x.XPath)
.ToList();
emptyImages.ForEach(xpath => {
var node = root.SelectSingleNode(xpath);
if (node != null) { node.Remove(); }
});
}
private void ProcessChildNodes(StringBuilder sb, HtmlNodeCollection childNodes)
{
foreach (var childNode in childNodes)
{
sb.AppendLine(childNode.InnerText);
sb.Append(" ");
if (childNode.HasChildNodes)
{
ProcessChildNodes(sb,childNode.ChildNodes);
}
}
}
}
Have you tried this within context of gatheringnode data event are you able to get an Umbraco context? I think you have to use Ensure context first if i remember rightly?
Would this approach also work where you have angular driven custom property editors? I.e. I have a repeated collection of form elements as an editor that is not in the grid editor but a custom data type.. Would you suggest scraping as well?
Did anyone get it working using rendertemplate and macros?
Thanks :)
Just resurrecting this thread because I'm working on an old site that has this code in it. Wondering if anyone can cast any light on when debugging why, at ** the debugger calls for timing.cs, which it can't find and when browsing for it, says it should be in c:\github\SamSaffron\MiniProfiler\StackExchange.Profiling\Timing.cs.
namespace myproject.co.uk.ExamineAddons
{
public class UmbracoEvents : ApplicationEventHandler
{
protected override void ApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext)
{
var helper = new UmbracoHelper(UmbracoContext.Current);
ExamineManager.Instance.IndexProviderCollection["ExternalIndexer"].GatheringNodeData += (sender, e) => ExternalIndexerGatheringNodeData(sender, e, helper);
}
****
void ExternalIndexerGatheringNodeData(object sender, IndexingNodeDataEventArgs e, UmbracoHelper helper)
{
if (e.IndexType == IndexTypes.Content)
ExamineHelper.AddAllFieldsToContentField(e, helper);
}
}
}
I understand it's something to do with MiniProfiler. The Dll is there but I can't seem to get past this.
Topic author was deleted
Examine, indexing rich and complex property editors
Hi,
For a current project I'm looking at indexing pretty complex pages that consist of deeply nested archetype properties.
And I'm looking for a generic way to index the pages.
Tips and code examples appreciated.
Cheers, Tim
Tim,
So this was for pages that had grid, the grid has all sorts of complex editors not just simple title , content etc things like promo boxes and to get that into index using gatheringnode code would have been messy.
So what i did was still use gathering node but for any doctype that had a grid i would do web request to page and parse it using html agility pack then extract out grid divs and for that content strip out html then inject into examine.
I guess its a poor man's web crawler.
So in Application eventhandler class i have
And in event:
Then
The webscraper looks like:
Regards
Ismail
Comment author was deleted
Super, thanks for sharing!
Instead of using the WebClient to request the page, could you use the
RenderTemplate
method in the UmbracoHelper class?So it would be something like this inside the GatheringNodeData:
You can then pass that into the HtmlAgilityPack classes to extract the required HTML content.
It would save having to make an HTTP request to the server, and instead keep it all within the current code execution cycle.
Tom,
Genius will remember for next time.
Regards
Ismail
Tom,
Have you tried this within context of gatheringnode data event are you able to get an Umbraco context? I think you have to use Ensure context first if i remember rightly?
Regards
Ismail
Yes, it's safest to make sure EnsureContext is called. It seemed to work without when I tried it, but better to be safe than sorry :-)
Comment author was deleted
Ah yeah that's a nicer solution, instead of scraping the page, will give it a shot, thanks!
Hey Ismail,
When you say check for EnsureContext, do you have an example on how to do that?
Building something similar
Thanks
Found how to do the EnsureContext,
Another question,
I'm using umbHelper.RenderTemplate(indexingNodeDataEventArgs.NodeId)
instead of webclient, any idea how to use it when you have your pages with custom RenderMvcController.
It keep saying that there's no parameterless constructor
JLon,
When i tried rendertemplate on page with grid and macro it kept blowing up hence i did screen scrape.
Regards
Ismail
Would this approach also work where you have angular driven custom property editors? I.e. I have a repeated collection of form elements as an editor that is not in the grid editor but a custom data type.. Would you suggest scraping as well?
Did anyone get it working using rendertemplate and macros? Thanks :)
Hi,
Just resurrecting this thread because I'm working on an old site that has this code in it. Wondering if anyone can cast any light on when debugging why, at ** the debugger calls for timing.cs, which it can't find and when browsing for it, says it should be in c:\github\SamSaffron\MiniProfiler\StackExchange.Profiling\Timing.cs.
I understand it's something to do with MiniProfiler. The Dll is there but I can't seem to get past this.
Any advice would be appreciated.
Craig
is working on a reply...