I was wondering if it was possible to get an image that was inserted inside the bodyText of my "article" document type's property...I would like to keep the "article" property as simple as possible, and custom adding an "image" property to my "article" document type is what I am trying to avoid, mostly because I'd like for the article editors to work using WLW (or MS Word) in which I was unable to find a way to assign a picture to such document type properties that are not simple and/or richtext fields...
Best practice solution would be great...but one that avoids using umbraco's backend by article editors please...thanks in advance.
Unfortunatelly the feature to edit posts seems to be missing...I'd like to get the first article image..if there are multiple images in the article (just in case it's unclear:)
Umm huh :( ...figured as much...but was kinda hoping that there was some umbraco.library GetMedia-like thingy :)
Could I propose that umbraco db gets extended so that you can figure out which media was posted along with which article (content node)...that way this could be easily done...additional (rather useful IMHO) feature although in terms of reuse of media items you'd be pressed to resort back to something else.
Here's an idea for a solution to your specific problem. Code can be developed to hook into the Document.BeforePublish event to examine the "article" (body text) value for any HTML images, extract the first one and assign it to different property.
I couldn't help myself! I've gone with the Regular Expression approach - only to keep all the code self-contained in this snippet... and within the .NET framework. Personally I'd go with Html Agility Pack, but that's too much effort (explaining references, etc) for this code snippet.
namespace Bodenko.Umbraco.ApplicationEvents
{
using System;
using System.Text.RegularExpressions;
using umbraco.BusinessLogic;
using umbraco.cms.businesslogic;
using umbraco.cms.businesslogic.property;
using umbraco.cms.businesslogic.web;
public class ExtractImageAssignProperty : ApplicationBase
{
public ExtractImageAssignProperty()
{
Document.BeforePublish += new Document.PublishEventHandler(Document_BeforePublish);
}
void Document_BeforePublish(Document sender, PublishEventArgs e)
{
try
{
// get the article property from the document
Property bodyText = sender.getProperty("article");
// check that the property exists
if (bodyText != null && bodyText.Value != null)
{
// grab the value
String html = bodyText.Value.ToString();
// set the regular expressions
Regex regImages = new Regex(@"<img\s[^>]*>", RegexOptions.IgnoreCase);
Regex regSrc = new Regex(@"src=(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))", RegexOptions.IgnoreCase | RegexOptions.Singleline);
// get the matches from the regular expressions
MatchCollection images = regImages.Matches(html);
// if it has any matches, then continue
if (images.Count > 0)
{
// loop through each of the image matches (we can't assume the first one is valid)
foreach (Match image in images)
{
// check if it has a 'src' attribute
if (regSrc.IsMatch(image.Groups[0].Value))
{
// get the 'src' attribute
Match src = regSrc.Match(image.Groups[0].Value);
// check if the 'src' attribute has a value
if (!String.IsNullOrEmpty(src.Groups["src"].Value))
{
// grab the value (which should be the image URL)
String url = src.Groups["src"].Value;
// get the image property from the document
Property docImage = sender.getProperty("image");
// check that the property exists
if (docImage != null)
{
// assign the image URL to the document property.
docImage.Value = url;
// since we are only interested in the first image tag,
// break out of the foreach loop
break;
}
}
}
}
}
}
}
catch
{
// if we catch an exception - we still want the document to be published
// and we don't want a YSoD - so handle however you prefer here. (i.e. ELMAH or other logging)
}
}
}
}
I haven't tested this in any way - it should work ... but I'd suggest that you test it out on a dev site/server first!!! (that is if you want to try it out? Feel free to say no).
For anyone else who finds this code useful... then WTFPL applies nicely! ;-)
Nice solution a quick suggestion maybe overkill but will reduce the size of the image extraction method, you could load the into htmlagility kit and xpath it out.
Extract image from bodyText using XSLT
Hi guys,
I was wondering if it was possible to get an image that was inserted inside the bodyText of my "article" document type's property...I would like to keep the "article" property as simple as possible, and custom adding an "image" property to my "article" document type is what I am trying to avoid, mostly because I'd like for the article editors to work using WLW (or MS Word) in which I was unable to find a way to assign a picture to such document type properties that are not simple and/or richtext fields...
Best practice solution would be great...but one that avoids using umbraco's backend by article editors please...thanks in advance.
Unfortunatelly the feature to edit posts seems to be missing...I'd like to get the first article image..if there are multiple images in the article (just in case it's unclear:)
I think I would go for creating an xslt extension for that purpose, and use RegEx to find the first img element and get the src attribute from that.
Umm huh :( ...figured as much...but was kinda hoping that there was some umbraco.library GetMedia-like thingy :)
Could I propose that umbraco db gets extended so that you can figure out which media was posted along with which article (content node)...that way this could be easily done...additional (rather useful IMHO) feature although in terms of reuse of media items you'd be pressed to resort back to something else.
Thanks anyway Morten
Sorry to disappoint you :-)
But the only reference saved to the media is the string in the html.
Hi stc,
Here's an idea for a solution to your specific problem. Code can be developed to hook into the Document.BeforePublish event to examine the "article" (body text) value for any HTML images, extract the first one and assign it to different property.
http://our.umbraco.org/wiki/reference/api-cheatsheet/using-applicationbase-to-register-events
Usually, I'd suggest using a regular expression to get the <img> tags from the HTML... but now I'd recommend the Html Agility Pack:
http://www.codeplex.com/htmlagilitypack
Here's a quick snippet from StackOverflow on how to extract <img> tags from HTML:
http://stackoverflow.com/questions/790559/how-to-extract-image-urls-from-html-file-in-c/790566#790566
Obviously this is just an idea... I haven't written any code to do this ... and if you're not a .NET developer, then it can seem very very daunting!
I don't think this is something that is required in the Umbraco core, but is specific to your problem (which many others would probably find useful).
Cheers, Lee.
Hi stc,
I couldn't help myself! I've gone with the Regular Expression approach - only to keep all the code self-contained in this snippet... and within the .NET framework. Personally I'd go with Html Agility Pack, but that's too much effort (explaining references, etc) for this code snippet.
I haven't tested this in any way - it should work ... but I'd suggest that you test it out on a dev site/server first!!! (that is if you want to try it out? Feel free to say no).
For anyone else who finds this code useful... then WTFPL applies nicely! ;-)
Cheers, Lee.
Lee,
Nice solution a quick suggestion maybe overkill but will reduce the size of the image extraction method, you could load the into htmlagility kit and xpath it out.
Regards
Ismail
Hi Ismail, I mention HTML Agility Pack just before the code snippet. ;-)
I used RegEx in the snippet as an example and self-contained within the .NET framework! But yes, HTML Agility Pack is awesome for this kind of thing!
Cheers, Lee.
Lee,
Doh need to read things properly LOL!
is working on a reply...