strip certain html tags

Stefan 117 posts 215 karma points

Nov 23, 2011 @ 10:17

Strip certain html tags

Hi.

I'm in a situation where I would like to strip certain HTML tags from the output of bodyText with XSLT. More precisely it's the header tags I dont want to show.

This is the compromise I could come up with.

            <p class="description">
              <span class="description">
                <xsl:value-of select="umbraco.library:StripHtml(umbraco.library:TruncateString(bodyText,250,'...'))" />
              </span>
            </p>

An example could be something like this:

<h2>Header</h2><p>Paragraph text</p>

Which should be:

Paragraph text

By the way, I did find this thread, but I wonder if it can be done in an easier way?
http://our.umbraco.org/forum/developers/xslt/10272-Remove-attributes-from-html-tags-in-xslt

Thanks in advance!

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Nov 23, 2011 @ 10:52

Hi Stefan,

If you happen to be using uComponents, then you could try the XML XsltExtension method called Parse(). This will take the HTML from your 'bodyText' property and convert it to XML.

<xsl:variable name="html" select="ucomponents.xml.Parse(bodyText)" />

Then you can use the variable to select the XML (HTML) nodes that you want...

<xsl:value-of select="$html/p" />

Cheers, Lee.

Copy Link

Rodion Novoselov 694 posts 859 karma points

Nov 23, 2011 @ 10:52

Hmmm. My first initial idea:

#some-content-wrapper h2 {
  display: none;
}

:-)

Copy Link

Stefan 117 posts 215 karma points

Nov 23, 2011 @ 16:15

Thank you for your replies!

Lee, I think uComponents is the way to do it. I have installed the package, but I can't get it to work.
I have registered the extension in xsltExtensions.config as:

<ext assembly="uComponents.Core" type="uComponents.Core.XsltExtensions.Xml" alias="ucomponents.xml" />

and added the following prefix attributes to the xsl:stylesheet element:

xmlns:ucomponents.xml="urn:ucomponents.xml"
ucomponents.xml

The error I'm getting when trying to save my xslt file is:

System.Xml.Xsl.XslLoadException: 'ucomponents.xml.Parse()' is an unknown
XSLT function. An error occurred at C:\Users\Stefan\Documents\My Web 
Sites\CD\xslt\634576611715742106_temp.xslt(15,1).
at System.Xml.Xsl.XslCompiledTransform.LoadInternal(Object stylesheet, XsltSettings settings, XmlResolver stylesheetResolver)
at umbraco.presentation.webservices.codeEditorSave.SaveXslt(String 
fileName, String oldName, String fileContents, Boolean ignoreDebugging)

What am I missing?

Rodion, I didn't even think of using CSS for that - keeping that in mind will prove useful in other situations, but unfortuatenely it can't be done in this situation :(

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Nov 23, 2011 @ 16:20

Hi Stefan,

Sorry, it was a typo in my example (I was coding by hand) ... it should be:

<xsl:variable name="html" select="ucomponents.xml:Parse(bodyText)" />

(I'd put a period "." instead of a colon ":" - doh!)

Cheers, Lee.

Copy Link

Stefan 117 posts 215 karma points

Nov 23, 2011 @ 16:49

Well, that happens when you (=me) is copy-pasting without paying attention...!

I'm getting soem strange errors that I cant interpret.

I have put this textarea right after the beginning of a for-each loop for testing purposes:

<textarea><xsl:copy-of select="ucomponents.xml:Parse(bodyText)" /></textarea>

When bodyText only contains a paragraph with text inside (lets say <p>This is a test</p>,
everything works fine and <p>This is a test</p> shows up in the textarea.

When bodyText contains any other html inside the <p></p> tags, I get an error saying:

<Exception Type="System.Xml.XmlException">
    <Message>There are multiple root elements. Line 4, position 2.</Message>
        <StackTrace>
            <Frame>System.Xml.XmlTextReaderImpl.Throw(Exception e)</Frame>
            <Frame>System.Xml.XmlTextReaderImpl.Throw(String res, String arg)</Frame><Frame>System.Xml.XmlTextReaderImpl.ParseDocumentContent()</Frame>
            <Frame>System.Xml.XmlTextReaderImpl.Read()</Frame>
            <Frame>System.Xml.XPath.XPathDocument.LoadFromReader(XmlReader reader, XmlSpace space)</Frame>
            <Frame>System.Xml.XPath.XPathDocument..ctor(TextReader textReader)</Frame>
            <Frame>uComponents.Core.XsltExtensions.Xml.Parse(String xml)</Frame>
        </StackTrace>
</Exception>

Do you have any clues about what's causing that?

Thanks again!

Copy Link

Stefan 117 posts 215 karma points

Nov 23, 2011 @ 17:02

I have just learned that it's because that bodyText contains more than one root element.

Can I overcome this in any way, and still strip all tags other than the paragraphs?

For example, this will fail on line 4, position 2:

<bodyText>
<p>Paragraph text paragraph text paragraph text paragraph text 
 paragraph text paragraph text paragraph text...</p>

<ul class="bullet-rt">
   <li>Test list 1</li>
   <li>Test list 2</li>
</ul>
</bodyText>

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Nov 23, 2011 @ 17:43

Hi Stefan,

Ah yes, it must be valid XML, so would need a single root tag... try this:

<textarea><xsl:copy-of select="ucomponents.xml:Parse(concat('&lt;html&gt;', bodyText, '&lt;/html&gt;'))" /></textarea>

It's a little bit hacky, but had to encode the angle-brackets :-$

Cheers, Lee.

Copy Link

Chriztian Steinmeier 2800 posts 8791 karma points MVP 8x admin c-trib

Nov 23, 2011 @ 17:50

Hi guys,

I'll just chip in with another gotcha you might run into (sorry Lee, I KNOW I should have submitted bugs long ago for these :-)

- bodyText may at some point contain the dreaded   non-breaking space, and THAT will wreak havoc again...

I've wrapped up most of this into a nice little include file that I use - it's available as a Gist for now: https://gist.github.com/1171897

/Chriztian

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Nov 23, 2011 @ 17:53

@Chriztian: With the next (major) version of uComponents (v4.x) I'm planning on using HtmlAgilityPack to parse the HTML - that should handle all the quirks much better! In the meantime, any bugs, etc ... CodePlex me! (oooo how rude! LOL)

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Nov 23, 2011 @ 17:56

@Chriztian: Forgot to say - about your gist snippet ... the "EditorContent" entity is very very clever and cool!

Copy Link

Chriztian Steinmeier 2800 posts 8791 karma points MVP 8x admin c-trib

Nov 23, 2011 @ 18:03

Hi Lee,

Now look - I just went and reported TWO issues in the same day (even same hour :-). "How do you like them apples?"

Thanks!

/Chriztian

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Nov 23, 2011 @ 18:45

oooh I like apples!

Copy Link

Stefan 117 posts 215 karma points

Nov 23, 2011 @ 19:27

Thanks again for your replies!

And Chriztian, you were right, the   sure made havoc again!

I have included the xslt file, but because of my lack of experience with templates in xslt, I can't figure out how to include it :/

Copy Link

Chriztian Steinmeier 2800 posts 8791 karma points MVP 8x admin c-trib

Nov 23, 2011 @ 19:42

Hi Stefan,

OK - here's a complete sample that should get you going:

<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet
    version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:umb="urn:umbraco.library"
    exclude-result-prefixes="umb"
>

    <xsl:output method="xml" indent="yes" omit-xml-declaration="yes" />

    <xsl:param name="currentPage" />

    <xsl:template match="/">
        <div class="maincontent">
            <xsl:apply-templates select="$currentPage/bodyText" mode="WYSIWYG" />
        </div>
    </xsl:template>

    <!-- Call the cavalry -->
    <xsl:include href="_WYSIWYG.xslt" />

</xsl:stylesheet>

The crucial line is the one I've highlighted, which tells the processor to basically use the entry template in the _WYSIWYG.xslt file (because it also has the mode="WYSIWYG" specified).

From there, you can add templates for specific things, e.g. you wanted to skip the <h2>'s - just add an empty template for them then:

<xsl:template match="h2" /><!-- Sorry, no rooom for you... -->

/Chriztian

Copy Link

Stefan 117 posts 215 karma points

Nov 23, 2011 @ 23:51

Thank you yet again :-)

Unfortunately I'm still left in the dark with a few questions - which I hope you will answer.

1. What can I do to make the _WYSIWYG.xslt skip every html tag but the paragraphs? Now it will only skip <h2> and include everything else (images, lists etc.).
2. What about stripping paragraph classes?
3. How can I apply the template on bodyText in conjunction with umbraco.library:TruncateString?
4. Before using the xslt include, I tried using uComponents as suggested by Lee.
Can the two solutions be used together (for example to take care of   when using the parse() function from uComponents) to prevent this from failing?

<xsl:variable name="html" select="ucomponents.xml:Parse(concat('&lt;html&gt;', bodyText, '&lt;/html&gt;'))" />
<xsl:value-of select="umbraco.library:TruncateString($html/html/p,500,'...')" />

Sorry for asking all these questions, but templates and xslt extensions is pretty new to me. Hopefully these questions will prove useful for others too!

PS: Learning a lot of useful stuff right now :-)

Copy Link

Stefan 117 posts 215 karma points

Nov 24, 2011 @ 21:13

Anyone?

Copy Link

Chriztian Steinmeier 2800 posts 8791 karma points MVP 8x admin c-trib

Nov 24, 2011 @ 21:44

Hi Stefan,

Thanks for the nudge :-)

Here goes:

1. One way to do this is to replace the Identity Template (match="* | text()") with a new template that basically just bypasses elements and text - then add another one for those elements you *do* want to copy:

<xsl:template match="*">
    <xsl:apply-templates select="*" />
</xsl:template>

<xsl:template match="p | strong">
    <xsl:copy>
        <xsl:apply-templates />
    </xsl:copy>
</xsl:template>

2. Already solved with the above...

3. That's rather tricky - the template with mode="WYSIWYG.excerpt" tries to do a similar thing, whereby only selecting the first paragraph - but it needs tweaking to your particular situation.

4. The _WYSIWYG.xslt already takes care of those two issues (multiple root elements and the   thing) if you're executing like in the highlighted line in my previous answer.

Let us now how it goes!

/Chriztian

Copy Link

Ashley Andersen 45 posts 88 karma points

Sep 25, 2013 @ 20:35

I know this is an old topic and I apologize. But I am using this for our client's mobile site due to the design. Everything works except the instances where we have macros in the RTE. These are unavoidable due to the clients' design restrictions and desire for control.

Is there a way I can render the RTE content fully before parsing it in my macro?
If not, can I target it to be excluded as well.

Maybe I do not understand the protocol. But currently all pages but those work fine and they are throwing this error:

Unexpected end of file while parsing PI has occurred. Line 6, position 613. System.Xml.XmlTextReaderImpl.Throw(String res, String arg) System.Xml.XmlTextReaderImpl.ParsePIValue(Int32& outStartPos, Int32& outEndPos) System.Xml.XmlTextReaderImpl.ParsePI(StringBuilder piInDtdStringBuilder) System.Xml.XmlTextReaderImpl.ParseElementContent() System.Xml.XPath.XPathDocument.LoadFromReader(XmlReader reader, XmlSpace space) System.Xml.XPath.XPathDocument..ctor(TextReader textReader) uComponents.XsltExtensions.Xml.ParseXml(String xml, String xpath)

Copy Link

Chriztian Steinmeier 2800 posts 8791 karma points MVP 8x admin c-trib

Sep 25, 2013 @ 21:09

Hi Ashley,

I've had the same problem once in a while and I just dug out one of the "solutions" I've been using - basically, I sacrifice the WYSIWYG handling when there's a macro on the page, which of course is a call you can only make when you know your solution well.

Here goes:

<!-- Let's make a variable for this -->
<xsl:variable name="macroStart" select="'&lt;?UMBRACO_MACRO '" />

<!-- Any macros on the page? -->
<xsl:if test="contains($currentPage/bodyText, $macroStart)">
    <xsl:value-of select="umbraco.library:RenderMacroContent($currentPage/bodyText, $currentPage/@id)" disable-output-escaping="yes" />
</xsl:if>

<!-- Otherwise, handle WYSIWYG content... -->
<xsl:apply-templates select="$currentPage/bodyText[normalize-space()][not(contains(., $macroStart))]" mode="WYSIWYG" />

(Yes, I know about the <xsl:choose> construct — I just try not to use it for simple stuff like this A/B case :-)

Hope it helps,

/Chriztian

Copy Link

Ashley Andersen 45 posts 88 karma points

Sep 25, 2013 @ 21:14

I was afraid of that but it makes sense. Thank you!

Copy Link

Chriztian Steinmeier 2800 posts 8791 karma points MVP 8x admin c-trib

Sep 25, 2013 @ 21:42

Come to think of it— it should actually be possible to have the _WYSIWYG.xslt handle this automatically, by detecting the macro(s) an then use RenderMacroContent() first — Hmmmm???!!... (evil laughing ensue :-)

/Chriztian

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

Strip certain html tags