I'm trying to strip all attributes (specifically 'style') from html-tags in my xslt for my RSS-feed. I want to keep all html-tags (<p>, <strong> etc), so umbraco.library.StripHtml wont do it.
ie, I want "<p style='margin:10px'>some text</p>" to become "<p>some text</p>". How can I achieve this?
Not sure that you're going to be able to do this purely with XSLT. (It might be possible, but reckon you'll burn hours trying to achieve it!)
My suggestion is to write an XSLT extension to perform a RegEx against the bodyText, removing specific attributes.
i.e.
public static string CleanHtml(string html)
{
// start by completely removing all unwanted tags
html = Regex.Replace(html, @"<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", RegexOptions.IgnoreCase);
// then run another pass over the html (twice), removing unwanted attributes
html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase);
html = Regex.Replace(html, @"<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>","<$1$2>", RegexOptions.IgnoreCase);
return html;
}
Yes, HTML Agility Pack is excellent for navigating/traversing/manipulating (and more) with HTML objects (DOM). You could remove all attributes with it - but it's an extra dependency, when a quick-n-dirty RegEx can (could) take care of it. (Since RegEx is already in the .NET framework).
Would my suggestion not work? Or is it that bodyText unless in cdata with have entities etc that will cause it to go boom? Ps that idea with @* remembered it from a tridion project where we had to clean out some word crap.
Thinking about this some more, are we not over complicating things, we could just update tinymce config file so that for p elements only allowed attribute is class. true you would have issues with updates becuase you would end up overwriting tinymce config but in theory that should sort it?
The xslt extension thing seems to me to be the best solution (really need to start a umbraco video subscription so I can se the end of Nielses video;).
Isamail: Thanks for your help but modifying the RTE is not a solution to me because I need the style-tags for the web presentation, the stripped version is only for the RSS.
@Ismail: I've had a quick test of trying to parse the 'bodyText' string as XML, but kept hitting various entity-encoding errors. I'm sure there is a way to do it, but I keep getting cross-eyed with the entities. I recall an old forum post about trying to achieve the same thing, and whoever it was ended up using an XSLT extension to convert the content/string to an XPathNodeIterator. (If I find the topic, I'll post here).
Remove attributes from html-tags in xslt
Hello,
I'm trying to strip all attributes (specifically 'style') from html-tags in my xslt for my RSS-feed. I want to keep all html-tags (<p>, <strong> etc), so umbraco.library.StripHtml wont do it.
ie, I want "<p style='margin:10px'>some text</p>" to become "<p>some text</p>". How can I achieve this?
Thanks.
Sledger,
Just found this on google not tested but you need something like
<xsl:template match="p">
<p>
<xsl:for-each select="@*">
</xsl:for-each>
<xsl:value-of select="./text()"/>
</p>
</xsl:template>
that will loop through all attributes and we dont write out anything in the for-each hence they will get ignored.
Regards
Ismail
Thanks for the reply. But the string I want to format is from the bodyText-field, like this:
...
<content:encoded>
<xsl:value-of select="concat('<![CDATA[ ', ./data [@alias='bodyText'],']]>')" disable-output-escaping="yes"/>
</content:encoded>
...
Is it possible to apply your method to this as well?
Sledger,
Not sure if this will work but could you do something like
<xsl:copy-of select="./data [@alias='bodyText']"/>
then do what you need to do, again not tested just an idea.
Regards
Ismail
Hm, I will give that a try, but I need it toremove attributes from all html-tags and not just <p>-tags.
Hi Sledger,
Not sure that you're going to be able to do this purely with XSLT. (It might be possible, but reckon you'll burn hours trying to achieve it!)
My suggestion is to write an XSLT extension to perform a RegEx against the bodyText, removing specific attributes.
i.e.
Reference to source: http://tim.mackey.ie/CleanWordHTMLUsingRegularExpressions.aspx
Good luck, Lee.
I've never used it but isn't http://htmlagilitypack.codeplex.com/ ideal for this type of thing?
Rich
Yes, HTML Agility Pack is excellent for navigating/traversing/manipulating (and more) with HTML objects (DOM). You could remove all attributes with it - but it's an extra dependency, when a quick-n-dirty RegEx can (could) take care of it. (Since RegEx is already in the .NET framework).
Lee,
Would my suggestion not work? Or is it that bodyText unless in cdata with have entities etc that will cause it to go boom? Ps that idea with @* remembered it from a tridion project where we had to clean out some word crap.
Regards
Ismail
Guys,
Thinking about this some more, are we not over complicating things, we could just update tinymce config file so that for p elements only allowed attribute is class. true you would have issues with updates becuase you would end up overwriting tinymce config but in theory that should sort it?
Regards
Ismail
The xslt extension thing seems to me to be the best solution (really need to start a umbraco video subscription so I can se the end of Nielses video;).
Isamail: Thanks for your help but modifying the RTE is not a solution to me because I need the style-tags for the web presentation, the stripped version is only for the RSS.
Thanks to all.
@Ismail: I've had a quick test of trying to parse the 'bodyText' string as XML, but kept hitting various entity-encoding errors. I'm sure there is a way to do it, but I keep getting cross-eyed with the entities. I recall an old forum post about trying to achieve the same thing, and whoever it was ended up using an XSLT extension to convert the content/string to an XPathNodeIterator. (If I find the topic, I'll post here).
Cheers, Lee.
is working on a reply...