parse body text as XML... (and errors due to none standard xml entities like )
Umbraco 4.5.2, .net 3.5, windows 2008 server, iis7
I'm trying to get the bodytext into an xml object in a c# class. All is ok unless I get none xml standard entities occuring (eg whilst ok for xhtml is not for xml). So I thought I could set the entity-encoding on the tinymce to numeric and that would solve the issue.
So it looks ok in the html source of tinymce gets correctly represented as  
However, when it then comes to the front end, I think HtmlTidy is getting in the way as that   is back as
Looking into HtmlTidy there seems to be a couple of options that I could be setting to get around this.
Type: Boolean Default: no Example: y/n, yes/no, t/f, true/false, 1/0
This option specifies if Tidy should pretty print output, writing it as well-formed XML. Any entities not defined in XML 1.0 will be written as numeric entities to allow them to be parsed by a XML parser. The original case of tags and attributes will be preserved, regardless of other options.
This option specifies if Tidy should output non-breaking space characters as entities, rather than as the Unicode character value 160 (decimal).
However, I can't see anywhere in the umbraco configs that allows me to set these? The only option I have is HTMLtidy used or not...
Is there anywhere to specify this more granular code... as I don't really want to have to reinvent the wheel and implement something like Alain COUTHURES lightHTMLtoXML which isn't really necessary as both tinymce and htmltidy can be set to produce xml compliant markup.
I'm not really sure what your problem is, but I think you might have to look at the place where you put the text into your xml document in c#.
When you put text into an xml document, it's generally a good idea to put it into a CDATA tag, that should handle any "illegal" characters without failing.
I must not have explained myself too well. It's not a problem with rendering the page. What I want to do is take the content that was entered into the tinymce area and parse that as xml so that I can traverse the dom to pull out things like, the first image, the first paragraph in the xslt.
so the node for the page is something like
<MasterPage id="1087" parentID="1086" level="4" writerID="0" creatorID="0" nodeType="1042" template="1043" sortOrder="0" createDate="2010-08-23T14:26:48" updateDate="2010-09-02T00:06:43" nodeName="File1" urlName="file1" writerName="Administrator" creatorName="Administrator" path="-1,1047,1085,1086,1087" isDoc=""> <abstract><![CDATA[ <p>Content to Apear in Lists for example</p> ]]></abstract> <pdfOfPage>1094</pdfOfPage> <hideInNavigation>0</hideInNavigation> <title>Briefing Note - Shareholders' Rights</title> <description><![CDATA[]]></description> <keywords><![CDATA[]]></keywords> <titlebarText /> <altTitle /> <showInFooter>0</showInFooter> <content><![CDATA[ <p><a href="/media/1010/testdoc.pdf" target="_blank" title="PDF: view briefing note on Shareholders' Rights (opens in a new window)"> PDF: view briefing note on Shareholders' Rights (opens in a new window)</a></p>
<p>Download our briefing note detailing Companies (Shareholders' Rights) Regulations that came into force on 3 August 2009.</p>
If you try to parse the content node into xml it errors as the stricter xml dom says is not a known entity. (xhtml allows it so no xhtml validation issues front end)
But actually I have already altered tinymce so that in the admin the content has no it would be   valid xml so there seems to be something going on between what is in the database and what ends up in the umbraco.config xml structure. I think it is htmlTidy that has it's default to be change   to so like I say I want to change the setting on tidy either via the output-xml or quote-nbsp otions.
Hopefully that explains what I'm after more verbosely....
parse body text as XML... (and errors due to none standard xml entities like )
Umbraco 4.5.2, .net 3.5, windows 2008 server, iis7
I'm trying to get the bodytext into an xml object in a c# class. All is ok unless I get none xml standard entities occuring (eg whilst ok for xhtml is not for xml). So I thought I could set the entity-encoding on the tinymce to numeric and that would solve the issue.
So it looks ok in the html source of tinymce gets correctly represented as  
However, when it then comes to the front end, I think HtmlTidy is getting in the way as that   is back as
Looking into HtmlTidy there seems to be a couple of options that I could be setting to get around this.
Default: no
Example: y/n, yes/no, t/f, true/false, 1/0
Default: yes
Example: y/n, yes/no, t/f, true/false, 1/0
However, I can't see anywhere in the umbraco configs that allows me to set these? The only option I have is HTMLtidy used or not...
Is there anywhere to specify this more granular code... as I don't really want to have to reinvent the wheel and implement something like Alain COUTHURES lightHTMLtoXML which isn't really necessary as both tinymce and htmltidy can be set to produce xml compliant markup.
I'm not really sure what your problem is, but I think you might have to look at the place where you put the text into your xml document in c#.
When you put text into an xml document, it's generally a good idea to put it into a CDATA tag, that should handle any "illegal" characters without failing.
And this should only be a challenge if the text is actually xml and should be available as xml in xslt.
Steen thanks for the response.
I must not have explained myself too well. It's not a problem with rendering the page. What I want to do is take the content that was entered into the tinymce area and parse that as xml so that I can traverse the dom to pull out things like, the first image, the first paragraph in the xslt.
so the node for the page is something like
<MasterPage id="1087" parentID="1086" level="4" writerID="0" creatorID="0" nodeType="1042" template="1043" sortOrder="0" createDate="2010-08-23T14:26:48" updateDate="2010-09-02T00:06:43" nodeName="File1" urlName="file1" writerName="Administrator" creatorName="Administrator" path="-1,1047,1085,1086,1087" isDoc="">
<abstract><![CDATA[
<p>Content to Apear in Lists for example</p>
]]></abstract>
<pdfOfPage>1094</pdfOfPage>
<hideInNavigation>0</hideInNavigation>
<title>Briefing Note - Shareholders' Rights</title>
<description><![CDATA[]]></description>
<keywords><![CDATA[]]></keywords>
<titlebarText />
<altTitle />
<showInFooter>0</showInFooter>
<content><![CDATA[
<p><a href="/media/1010/testdoc.pdf" target="_blank"
title="PDF: view briefing note on Shareholders' Rights (opens in a new window)">
PDF: view briefing note on Shareholders' Rights (opens in a new
window)</a></p>
<p>Download our briefing note detailing Companies (Shareholders'
Rights) Regulations that came into force on 3 August 2009.</p>
<p><img src="/media/1562/civil_construc_fibs2.jpg" width="142" height="111" alt="Civil Construction fibs"/><img src="/media/306/home_498x310.jpg" width="498" height="310" alt="HomePageFlashAlternative"/></p>
]]></content>
If you try to parse the content node into xml it errors as the stricter xml dom says is not a known entity. (xhtml allows it so no xhtml validation issues front end)
But actually I have already altered tinymce so that in the admin the content has no it would be   valid xml so there seems to be something going on between what is in the database and what ends up in the umbraco.config xml structure. I think it is htmlTidy that has it's default to be change   to so like I say I want to change the setting on tidy either via the output-xml or quote-nbsp otions.
Hopefully that explains what I'm after more verbosely....
is working on a reply...