extract image from bodytext using xslt

stc 72 posts 101 karma points

Feb 28, 2010 @ 17:02

Extract image from bodyText using XSLT

Hi guys,

I was wondering if it was possible to get an image that was inserted inside the bodyText of my "article" document type's property...I would like to keep the "article" property as simple as possible, and custom adding an "image" property to my "article" document type is what I am trying to avoid, mostly because I'd like for the article editors to work using WLW (or MS Word) in which I was unable to find a way to assign a picture to such document type properties that are not simple and/or richtext fields...

Best practice solution would be great...but one that avoids using umbraco's backend by article editors please...thanks in advance.

Copy Link

stc 72 posts 101 karma points

Feb 28, 2010 @ 17:04

Unfortunatelly the feature to edit posts seems to be missing...I'd like to get the first article image..if there are multiple images in the article (just in case it's unclear:)

Copy Link

Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Feb 28, 2010 @ 17:15

I think I would go for creating an xslt extension for that purpose, and use RegEx to find the first img element and get the src attribute from that.

Copy Link

stc 72 posts 101 karma points

Feb 28, 2010 @ 19:07

Umm huh :( ...figured as much...but was kinda hoping that there was some umbraco.library GetMedia-like thingy :)

Could I propose that umbraco db gets extended so that you can figure out which media was posted along with which article (content node)...that way this could be easily done...additional (rather useful IMHO) feature although in terms of reuse of media items you'd be pressed to resort back to something else.

Thanks anyway Morten

Copy Link

Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Feb 28, 2010 @ 19:11

Sorry to disappoint you :-)

But the only reference saved to the media is the string in the html.

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Mar 01, 2010 @ 06:05

Hi stc,

Here's an idea for a solution to your specific problem. Code can be developed to hook into the Document.BeforePublish event to examine the "article" (body text) value for any HTML images, extract the first one and assign it to different property.

http://our.umbraco.org/wiki/reference/api-cheatsheet/using-applicationbase-to-register-events

Usually, I'd suggest using a regular expression to get the <img> tags from the HTML... but now I'd recommend the Html Agility Pack:

http://www.codeplex.com/htmlagilitypack

Here's a quick snippet from StackOverflow on how to extract <img> tags from HTML:

http://stackoverflow.com/questions/790559/how-to-extract-image-urls-from-html-file-in-c/790566#790566

Obviously this is just an idea... I haven't written any code to do this ... and if you're not a .NET developer, then it can seem very very daunting!

I don't think this is something that is required in the Umbraco core, but is specific to your problem (which many others would probably find useful).

Cheers, Lee.

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Mar 01, 2010 @ 06:47

Hi stc,

I couldn't help myself! I've gone with the Regular Expression approach - only to keep all the code self-contained in this snippet... and within the .NET framework. Personally I'd go with Html Agility Pack, but that's too much effort (explaining references, etc) for this code snippet.

namespace Bodenko.Umbraco.ApplicationEvents
{
    using System;
    using System.Text.RegularExpressions;
    using umbraco.BusinessLogic;
    using umbraco.cms.businesslogic;
    using umbraco.cms.businesslogic.property;
    using umbraco.cms.businesslogic.web;

    public class ExtractImageAssignProperty : ApplicationBase
    {
        public ExtractImageAssignProperty()
        {
            Document.BeforePublish += new Document.PublishEventHandler(Document_BeforePublish);
        }

        void Document_BeforePublish(Document sender, PublishEventArgs e)
        {
            try
            {
                // get the article property from the document
                Property bodyText = sender.getProperty("article");

                // check that the property exists
                if (bodyText != null && bodyText.Value != null)
                {
                    // grab the value
                    String html = bodyText.Value.ToString();

                    // set the regular expressions
                    Regex regImages = new Regex(@"<img\s[^>]*>", RegexOptions.IgnoreCase);
                    Regex regSrc = new Regex(@"src=(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))", RegexOptions.IgnoreCase | RegexOptions.Singleline);

                    // get the matches from the regular expressions
                    MatchCollection images = regImages.Matches(html);

                    // if it has any matches, then continue
                    if (images.Count > 0)
                    {
                        // loop through each of the image matches (we can't assume the first one is valid)
                        foreach (Match image in images)
                        {
                            // check if it has a 'src' attribute
                            if (regSrc.IsMatch(image.Groups[0].Value))
                            {
                                // get the 'src' attribute
                                Match src = regSrc.Match(image.Groups[0].Value);

                                // check if the 'src' attribute has a value
                                if (!String.IsNullOrEmpty(src.Groups["src"].Value))
                                {
                                    // grab the value (which should be the image URL)
                                    String url = src.Groups["src"].Value;

                                    // get the image property from the document
                                    Property docImage = sender.getProperty("image");

                                    // check that the property exists
                                    if (docImage != null)
                                    {
                                        // assign the image URL to the document property.
                                        docImage.Value = url;

                                        // since we are only interested in the first image tag,
                                        // break out of the foreach loop
                                        break;
                                    }
                                }
                            }
                        }
                    }
                }
            }
            catch
            {
                // if we catch an exception - we still want the document to be published
                // and we don't want a YSoD - so handle however you prefer here. (i.e. ELMAH or other logging)
            }
        }
    }
}

I haven't tested this in any way - it should work ... but I'd suggest that you test it out on a dev site/server first!!! (that is if you want to try it out? Feel free to say no).

For anyone else who finds this code useful... then WTFPL applies nicely! ;-)

Cheers, Lee.

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jun 04, 2010 @ 11:53

Lee,

Nice solution a quick suggestion maybe overkill but will reduce the size of the image extraction method, you could load the into htmlagility kit and xpath it out.

Regards

Ismail

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Jun 04, 2010 @ 12:02

Hi Ismail, I mention HTML Agility Pack just before the code snippet. ;-)

I used RegEx in the snippet as an example and self-contained within the .NET framework! But yes, HTML Agility Pack is awesome for this kind of thing!

Cheers, Lee.

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jun 04, 2010 @ 16:05

Lee,

Doh need to read things properly LOL!

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

Extract image from bodyText using XSLT