parse body text as xml and errors due to none standard xml entities like nbsp - XSLT

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Mike Chambers 636 posts 1253 karma points c-trib

Sep 02, 2010 @ 00:58

parse body text as XML... (and errors due to none standard xml entities like  )

Umbraco 4.5.2, .net 3.5, windows 2008 server, iis7

I'm trying to get the bodytext into an xml object in a c# class. All is ok unless I get none xml standard entities occuring (eg   whilst ok for xhtml is not for xml). So I thought I could set the entity-encoding on the tinymce to numeric and that would solve the issue.

So it looks ok in the html source of tinymce   gets correctly represented as  

However, when it then comes to the front end, I think HtmlTidy is getting in the way as that   is back as  

Looking into HtmlTidy there seems to be a couple of options that I could be setting to get around this.

output-xml	Top
Type: Boolean Default: no Example: y/n, yes/no, t/f, true/false, 1/0
This option specifies if Tidy should pretty print output, writing it as well-formed XML. Any entities not defined in XML 1.0 will be written as numeric entities to allow them to be parsed by a XML parser. The original case of tags and attributes will be preserved, regardless of other options.

quote-nbsp	Top
Type: Boolean Default: yes Example: y/n, yes/no, t/f, true/false, 1/0
This option specifies if Tidy should output non-breaking space characters as entities, rather than as the Unicode character value 160 (decimal).

However, I can't see anywhere in the umbraco configs that allows me to set these? The only option I have is HTMLtidy used or not...

Is there anywhere to specify this more granular code... as I don't really want to have to reinvent the wheel and implement something like Alain COUTHURES lightHTMLtoXML which isn't really necessary as both tinymce and htmltidy can be set to produce xml compliant markup.

Copy Link

Steen Tøttrup 191 posts 291 karma points c-trib

Sep 02, 2010 @ 06:52
1
I'm not really sure what your problem is, but I think you might have to look at the place where you put the text into your xml document in c#.

When you put text into an xml document, it's generally a good idea to put it into a CDATA tag, that should handle any "illegal" characters without failing.
```
  node.AppendChild(xml.CreateCDataSection(textWithIllegalChars));
```
And this should only be a challenge if the text is actually xml and should be available as xml in xslt.
Copy Link
Mike Chambers 636 posts 1253 karma points c-trib

Sep 02, 2010 @ 10:21

0

Steen thanks for the response.

I must not have explained myself too well. It's not a problem with rendering the page. What I want to do is take the content that was entered into the tinymce area and parse that as xml so that I can traverse the dom to pull out things like, the first image, the first paragraph in the xslt.

so the node for the page is something like

<MasterPage id="1087" parentID="1086" level="4" writerID="0" creatorID="0" nodeType="1042" template="1043" sortOrder="0" createDate="2010-08-23T14:26:48" updateDate="2010-09-02T00:06:43" nodeName="File1" urlName="file1" writerName="Administrator" creatorName="Administrator" path="-1,1047,1085,1086,1087" isDoc="">
 <abstract><![CDATA[
Content to Apear in Lists for example
]]></abstract>
 <pdfOfPage>1094</pdfOfPage>
 <hideInNavigation>0</hideInNavigation>
 <title>Briefing Note - Shareholders' Rights</title>
 <description><![CDATA[]]></description>
 <keywords><![CDATA[]]></keywords>
 <titlebarText />
 <altTitle />
 <showInFooter>0</showInFooter>
 <content><![CDATA[
<a href="/media/1010/testdoc.pdf" target="_blank"
title="PDF:  view briefing note on Shareholders' Rights (opens in a new window)">
PDF: view briefing note on Shareholders' Rights (opens in a new
window)</a>

Download our briefing note detailing Companies (Shareholders'
Rights) Regulations that came into force on 3 August 2009.

<img src="/media/1562/civil_construc_fibs2.jpg" width="142" height="111" alt="Civil Construction fibs"/><img src="/media/306/home_498x310.jpg" width="498" height="310" alt="HomePageFlashAlternative"/>
]]></content>

If you try to parse the content node into xml it errors as the stricter xml dom says   is not a known entity. (xhtml allows it so no xhtml validation issues front end)

But actually I have already altered tinymce so that in the admin the content has no   it would be   valid xml so there seems to be something going on between what is in the database and what ends up in the umbraco.config xml structure. I think it is htmlTidy that has it's default to be change   to   so like I say I want to change the setting on tidy either via the output-xml or quote-nbsp otions.

Hopefully that explains what I'm after more verbosely....

Copy Link
is working on a reply...

Please Sign in or register to post replies

Flag this post as spam?

parse body text as XML... (and errors due to none standard xml entities like &nbsp;)

parse body text as XML... (and errors due to none standard xml entities like )