Razor outputs all text with HTML encoding, unless Html.Raw is used. For example:
<h1>@Model.PageTitle</h1>
Will automatically HTML encode the contents of PageTitle. This is fine, however Razor also will encode a number of special characters as HTML entities, like German Umlaute, as HTML numeric entities, which is unnecessary, as these can be represented perfectly fine in the UTF-8. The only characters that really need HTML entities are usually < > and &
I don't think it is "by design", and it is also not a safety feature, as any and all characters outside the ASCII range get encoded to a numeric entity without a check whether the page encoding could carry the character without being encoded in an entity.
Imagine you're doing Chinese texts, this would waste a good amount of space and bandwidth just to encode every single character into an entity, while browsers could easily read it as UTF-8 or UTF-16.
So, is there a way to influence the behavior, or do I actually have to provide my own encoding, and thus completely defeat the purpose of this feature?
it is by design, razor encodes the strings automatically, speak to Microsoft if you disagree. However razor should not be break umlaute or other special languages, I have a site in mvc razor that quite happily displays arabic.
Just a guess, but maybe the issue is that Umbraco should not be encoding pagetitle as you are basically encoding the output twice
It's only a single conversion, & is displayed fine and with a single entity encoding. It's just encoding like the output was only ASCII and actually needed characters outside the ASCII codepage to be encoded.
Razor generally encoding strings is fine, it's just the problem that it encodes characters that wouldn't really need any encoding at all.
I can't say I have had issue before, do you have an example of some characters that are rendered incorrectly, I will do some tests in my other app to see if it behaves the same or not
I understand perfectly how encoding works, no need to be insulting.
I said if you entered ™ in a string razor would encode the & and display ™ rather than ™ ,
razor should not be encoding these äöü ÄÖÜ ß, are you sure it is razor doing the encoding? What does the raw pageTitle string look like in the database?
I created a plain Razor app in VS to test it, made a string property in the codebehind that returns special characters, and put that into the page with @Model.Test, and yes, it also gets encoded to HTML entities, although not decimal numeric ones, instead they are hexadecimal numeric ones, probably because the app is .NET Core and not .NET Framework.
I'm trying to investigate. The compiled version of the page calls RazorPageBase.Write to output the string, and that uses an instance of System.Text.Encodings.Web.HtmlEncoder to encode the string. Will investigate further on the call chain.
Code behind:
public string Test
{
get { return "Dies ist ein Test ÄÖÜ äöü ß"; }
}
Output:
<h1>Dies ist ein Test ÄÖÜ äöü ß</h1>
For .NET Framework, it's quite different. The Razor page calls System.Web.WebPages.WebPageBase.Write, that calls System.Web.WebPages.WebPageExecutingBase.WriteTo, that calls System.Web.HttpUtility.HtmlEncode to encode the string, that calls HttpEncoder.Current.HtmlEncode.
The HttpEncoder.Current is initialized by <httpRuntime encoderType="" /> in web.config, and if not specified, it uses System.Web.Security.AntiXss.AntiXssEncoder with newer version of .NET Framework, which encodes a lot more characters than necessary.
But even the default HttpEncoder encodes non-ASCII characters as entities, by calling System.Net.WebUtility.HtmlEncode:
ASCII characters from 160 to 255 &#NNN;, where NNN is the three-digit
decimal character code
So this is all pretty ugly, basically you have to provide your own HttpEncoder implementation that overrides HtmlEncode and HtmlAttributeEncode. It's possible to put it into App_Code and reference it from the web.config without a strong assembly name. I won't post code here as it's probably weakening security to do the encoding yourself instead of having AntiXssEncoder do it, and the whole API is going to change anyway when Umbraco moves to .NET Core some time in the future.
I don't think there is much we can do about this issue in the umbraco forum. Also, I guess encoding the characters in the way that you mention, would be from the xhtml days, when html had to be xml compliant :)
I don't see how that has anything to with XML, as it has the same encoding rules. <, > and & have to be encoded, everything else depends on the charset of the file, which is usually UTF-8, which can accommodate most characters without encoding
As by my last comment, one can override the HttpEncoder, which I am already doing, and thus fix the problem.
When Umbraco moves to .NET Core, the IServiceCollection API is used instead, which means one will not be able to change encoding settings without recompiling Umbraco. Umbraco might choose to allow configuring the encoding via a different route, but I doubt it's a big priority right now. I mean it's really only going to be a problem with Cyrillic, Arabic and Asian pages, where it causes a lot of bloat, and that seems not to be an important market for Umbraco.
So, I dug even deeper. Turns out that the .NET Framework even has a compiler marco called ENTITY_ENCODE_HIGH_ASCII_CHARS that influences the encoding.
Anyway, it turns out, despite what the documentation says, AntiXssEncoder is NOT the default encoder, even with newer versions of .NET Framework, and it actually mitigates the situation by not encoding high ASCII characters to numeric entities.
This change will fix the issue, without having to provide a custom HttpEncoder implementation, although I am not yet sure if AntiXssEncoder will interfere with the Umbraco backend in some way.
Razor with unnecessary HTML entities
Razor outputs all text with HTML encoding, unless Html.Raw is used. For example:
Will automatically HTML encode the contents of PageTitle. This is fine, however Razor also will encode a number of special characters as HTML entities, like German Umlaute, as HTML numeric entities, which is unnecessary, as these can be represented perfectly fine in the UTF-8. The only characters that really need HTML entities are usually < > and &
Is there a way to configure this behavior?
This is, as they say, by design. When Razor renders strings, it automatically HTML encodes them, it is a security measure.
I would advise doing this rather than using Html.Raw
@(new HtmlString(stringWithMarkup))
I don't think it is "by design", and it is also not a safety feature, as any and all characters outside the ASCII range get encoded to a numeric entity without a check whether the page encoding could carry the character without being encoded in an entity.
Imagine you're doing Chinese texts, this would waste a good amount of space and bandwidth just to encode every single character into an entity, while browsers could easily read it as UTF-8 or UTF-16.
So, is there a way to influence the behavior, or do I actually have to provide my own encoding, and thus completely defeat the purpose of this feature?
it is by design, razor encodes the strings automatically, speak to Microsoft if you disagree. However razor should not be break umlaute or other special languages, I have a site in mvc razor that quite happily displays arabic.
Just a guess, but maybe the issue is that Umbraco should not be encoding pagetitle as you are basically encoding the output twice
It's only a single conversion, & is displayed fine and with a single entity encoding. It's just encoding like the output was only ASCII and actually needed characters outside the ASCII codepage to be encoded.
Razor generally encoding strings is fine, it's just the problem that it encodes characters that wouldn't really need any encoding at all.
I can't say I have had issue before, do you have an example of some characters that are rendered incorrectly, I will do some tests in my other app to see if it behaves the same or not
Well, all the German äöü ÄÖÜ ß characters, but also stuff like ®. The only characters that actually require encoding would be <, > and &.
I've never had issues with German characters, but yes you could have issues with copyright, trademark etc. because razor will encode the & which is part of the
©
if they are entered that way.Why would
&
be part of the trademark symbol, and if you were to enter©
, then that would put the actual text©
into the output, as the&
is getting encoded as&
. I.e. like this:Not sure you understand how encoding works. Anyway, still looking for a way to configure Razor encoding specifics.
I understand perfectly how encoding works, no need to be insulting.
I said if you entered
™
in a string razor would encode the & and display™
rather than ™ ,razor should not be encoding these äöü ÄÖÜ ß, are you sure it is razor doing the encoding? What does the raw pageTitle string look like in the database?
I created a plain Razor app in VS to test it, made a string property in the codebehind that returns special characters, and put that into the page with @Model.Test, and yes, it also gets encoded to HTML entities, although not decimal numeric ones, instead they are hexadecimal numeric ones, probably because the app is .NET Core and not .NET Framework.
I'm trying to investigate. The compiled version of the page calls RazorPageBase.Write to output the string, and that uses an instance of System.Text.Encodings.Web.HtmlEncoder to encode the string. Will investigate further on the call chain.
Code behind:
Output:
weird, I just tried the same as you and get a completely different result, no encoding.
in your razor did you just do @model.Test if you do @Html.Raw(model.Test) what does that output?
So, I dug a bit deeper, and the API, at least in .NET Core is:
That removes the entity encoding for special characters, while <, > and & still get encoded.
Now only to find out how to inject into Umbraco without recompiling. I miss the old days where configuration was simply done through an XML file.
Here is a relevant ticket on GitHub: https://github.com/aspnet/HttpAbstractions/issues/315
Dug deeper:
For .NET Framework, it's quite different. The Razor page calls System.Web.WebPages.WebPageBase.Write, that calls System.Web.WebPages.WebPageExecutingBase.WriteTo, that calls System.Web.HttpUtility.HtmlEncode to encode the string, that calls HttpEncoder.Current.HtmlEncode.
The HttpEncoder.Current is initialized by
<httpRuntime encoderType="" />
in web.config, and if not specified, it uses System.Web.Security.AntiXss.AntiXssEncoder with newer version of .NET Framework, which encodes a lot more characters than necessary.But even the default HttpEncoder encodes non-ASCII characters as entities, by calling System.Net.WebUtility.HtmlEncode:
https://docs.microsoft.com/en-us/dotnet/api/system.web.util.httpencoder.htmlencode?view=netframework-4.8
So this is all pretty ugly, basically you have to provide your own HttpEncoder implementation that overrides HtmlEncode and HtmlAttributeEncode. It's possible to put it into App_Code and reference it from the web.config without a strong assembly name. I won't post code here as it's probably weakening security to do the encoding yourself instead of having AntiXssEncoder do it, and the whole API is going to change anyway when Umbraco moves to .NET Core some time in the future.
Hi Alexander Gräf,
I don't think there is much we can do about this issue in the umbraco forum. Also, I guess encoding the characters in the way that you mention, would be from the xhtml days, when html had to be xml compliant :)
2021 looks to be the year Umbraco moves to .Net Core - https://umbraco.com/blog/status-of-migration-to-net-core-december-2020/. This would open up for using some of the first tweaks you mention.
I think that we will also soon see a version on .Net 5, depending on how much needs to be changed.
HTH :)
I don't see how that has anything to with XML, as it has the same encoding rules. <, > and & have to be encoded, everything else depends on the charset of the file, which is usually UTF-8, which can accommodate most characters without encoding
As by my last comment, one can override the HttpEncoder, which I am already doing, and thus fix the problem.
When Umbraco moves to .NET Core, the IServiceCollection API is used instead, which means one will not be able to change encoding settings without recompiling Umbraco. Umbraco might choose to allow configuring the encoding via a different route, but I doubt it's a big priority right now. I mean it's really only going to be a problem with Cyrillic, Arabic and Asian pages, where it causes a lot of bloat, and that seems not to be an important market for Umbraco.
So, I dug even deeper. Turns out that the .NET Framework even has a compiler marco called
ENTITY_ENCODE_HIGH_ASCII_CHARS
that influences the encoding.Anyway, it turns out, despite what the documentation says, AntiXssEncoder is NOT the default encoder, even with newer versions of .NET Framework, and it actually mitigates the situation by not encoding high ASCII characters to numeric entities.
This change will fix the issue, without having to provide a custom HttpEncoder implementation, although I am not yet sure if AntiXssEncoder will interfere with the Umbraco backend in some way.
is working on a reply...