A long time ago (Sept 2008) I posted on the old forum about a Profanity Filter for Umbraco. I ended up writing one. It was quite quick-n-dirty (sic) but it did the job.
Been thinking that I should package up the code and release it on Our Umbraco - which I'll do soon.
So, my question is... at present the bad words are all hard-coded (in English), obviously this needs to be i18n/L10n-ized and customisable. Where should this be done? Via a .config file? or a custom section (appTree) in the back-office admin?
Also should I be considering that it might be used on multi-lingual sites? (i.e. would applying the English profanity filter on German content cause any undesired effects?)
I'd go for a .config file. It could always be edited via the config file package.
And, unless there's a big performance penalty for lots of words in the filter I think a person could just have one file with words for all langs in it. One word/phrase per line, case-insensitive:
blah yada howdy good oh good-o
If you really wanted to be thorough you could have a config file that allows for the culture as well, and you'd use the appropriate culture based on the page being served. And a section that words for all cultures to save duplication if there is any. Something like:
Just because I'm a pedant - if you go the XML route (which I'd also vote for) please, please, please use the xml:lang attribute for designating the culture, as in:
XPath has a companion function lang() which selects nodes based on their language, e.g., to select all the english (whether UK- or US-variant) Word elements:
Thanks for the responses guys... I've gone with Doug's suggestion of the XML config, (with Chriztian's xml:lang attribute suggestion - although I doubt the bad-words will be ever accessible via XSLT).
I haven't gone for the "default" set of stop-words... (maybe in a future version?) The words are newline/tab delimited - I find it easier to read (and parse in code) ... otherwise there's too much XML (IMHO).
Next question.... what default words should I release it with?
I have a long list of en-GB bad words, (which I wont publish here - too rude!) ... anyone know of a good resource for bad-words in other languages? i.e. Dutch, German, French, etc.
Profanity Filter for Umbraco
A long time ago (Sept 2008) I posted on the old forum about a Profanity Filter for Umbraco. I ended up writing one. It was quite quick-n-dirty (sic) but it did the job.
Been thinking that I should package up the code and release it on Our Umbraco - which I'll do soon.
So, my question is... at present the bad words are all hard-coded (in English), obviously this needs to be i18n/L10n-ized and customisable. Where should this be done? Via a .config file? or a custom section (appTree) in the back-office admin?
Also should I be considering that it might be used on multi-lingual sites? (i.e. would applying the English profanity filter on German content cause any undesired effects?)
Any suggestions?
Thanks, Lee.
F***ing good idea, just be sure not to make the clbuttic mistake when making it ;-)
As for configuration, I'd be happy using a config file for this kind of thing.
I'd go for a .config file. It could always be edited via the config file package.
And, unless there's a big performance penalty for lots of words in the filter I think a person could just have one file with words for all langs in it. One word/phrase per line, case-insensitive:
If you really wanted to be thorough you could have a config file that allows for the culture as well, and you'd use the appropriate culture based on the page being served. And a section that words for all cultures to save duplication if there is any. Something like:
Or for those who just have to have full XML:
cheers,
doug.
I love that even Doug's swear words are no worse than Howdy ;-)
Just because I'm a pedant - if you go the XML route (which I'd also vote for) please, please, please use the xml:lang attribute for designating the culture, as in:
XPath has a companion function lang() which selects nodes based on their language, e.g., to select all the english (whether UK- or US-variant) Word elements:
or to grab only the US-variant:
Anyway, you get the idea...
/Chriztian
Thanks for the responses guys... I've gone with Doug's suggestion of the XML config, (with Chriztian's xml:lang attribute suggestion - although I doubt the bad-words will be ever accessible via XSLT).
I haven't gone for the "default" set of stop-words... (maybe in a future version?) The words are newline/tab delimited - I find it easier to read (and parse in code) ... otherwise there's too much XML (IMHO).
Next question.... what default words should I release it with?
I have a long list of en-GB bad words, (which I wont publish here - too rude!) ... anyone know of a good resource for bad-words in other languages? i.e. Dutch, German, French, etc.
Thanks, Lee.
Using the list in DansGaurdian seems to be the basic starting point for most. Here's a more detailed description and list (english)... http://stackoverflow.com/questions/273516/how-do-you-implement-a-good-profanity-filter
cheers,
doug.
Thanks Doug.
I ended up releasing with just the en-GB version ... otherwise I wouldn't have got the package out for the weekend.
I'll look at adding extra language profanities for next version. (Hopefully that will be soon... i.e. "Release early, release often").
Cheers, Lee.
is working on a reply...
This forum is in read-only mode while we transition to the new forum.
You can continue this topic on the new forum by tapping the "Continue discussion" link below.