I am finishing up on what is probably my largest and most complex development to date. It is a multi-site, multi-language install with more to follow after delivery. One of my last remaning issues is with regards to the search facility on non-english sites, in particular the French.
We have a tag search that is returning the tags without the original punctuation so Oeuf d'or becomes Oeuf dor which is obviously not the same thing. We are using the StandardAnalyzer which I understood to support such punctuation?
We have subscribed to the GatheringNodeData event in order to insert tags without the delimeters and to replace spaces for indexing as follow:
Any help would be much appreciated as I'm sure I'm not the first to encounter this but the documentation for searching with Examine is quite fragmented so as yet I've not found a solution.
Unable to edit the post however the version above shows WhitespaceAnalyzer following something I was testing but the current version is actually using Lucene.Net.Analysis.Standard.StandardAnalyzer.
It's French in this case however German, Dutch and Italian will follow closely behind. The reason for having it all in one index is that they are all part of a "group" and the group site will end up aggregating the data from all others so with a single index we can either use the siteId as a filter or grab all tags regardless of which site they originated.
What is the need for seperate indexes? To be able to use different Analyzers per index?
It could be that replace is having encoding issues? The foreign chars in the content should also be in index as far as i am aware. The analysers are more for ignoring stop words. Step through the code and see what your before and after is.
The foreign characters are going into the index and coming back out fine it's punctuation which is not, in this specific case the apostrophe. I'll step through the code shortly and confirm back the result.
Hi Simon Dingley, I am trying to use Examine to search documents, but they aren't in english, but in italian.
Do you found a ItalianAnalyzer or something like that?
Simon, my search result is not good. For example, I have a node with this name: "Festa dell'aquilone", this can be translate step-by-step in "festival" "of the" kite".
With WhitespaceAnalyzer, if I try to search with "aquilone" text, I have not results. Otherwise with "dell'aquilone" I can find the node.
Another issue is "stress mark": à é è ì ò ù.
I can find the node "Identità" with the same texh, but not with "identita".
In italian (like in any language) there are words too many commonly: il lo la i gli le di a da in con su per tra fra (like "in for as is are where when this that the..."). It is better that Lucene do ignoring these words when users do search.
Ok, I looking the index with Luke.
But... I don't understand: what do I look?
I look that the name field contains "dell'aquilone", and many times of "di", "la", "del" terms.
Examine/Lucene Searching for Multi-Lingual Site
I am finishing up on what is probably my largest and most complex development to date. It is a multi-site, multi-language install with more to follow after delivery. One of my last remaning issues is with regards to the search facility on non-english sites, in particular the French.
We have a tag search that is returning the tags without the original punctuation so Oeuf d'or becomes Oeuf dor which is obviously not the same thing. We are using the StandardAnalyzer which I understood to support such punctuation?
We have subscribed to the GatheringNodeData event in order to insert tags without the delimeters and to replace spaces for indexing as follow:
So as you can see we are not changing the original tags in any way other than to replace spaces with an underscore and remove comma delimters.
I should probably also mention that there is a single search index for the site and the configuration is as follows:
ExamineIndex.config
ExamineSettings.config
Any help would be much appreciated as I'm sure I'm not the first to encounter this but the documentation for searching with Examine is quite fragmented so as yet I've not found a solution.
Thanks, Simon
Unable to edit the post however the version above shows WhitespaceAnalyzer following something I was testing but the current version is actually using Lucene.Net.Analysis.Standard.StandardAnalyzer.
simon,
Ideally each language in the site have its own index? Also what language is it? It will probably need its own analyser for that language.
Regards
Ismail
It's French in this case however German, Dutch and Italian will follow closely behind. The reason for having it all in one index is that they are all part of a "group" and the group site will end up aggregating the data from all others so with a single index we can either use the siteId as a filter or grab all tags regardless of which site they originated.
What is the need for seperate indexes? To be able to use different Analyzers per index?
Thanks Ismail
Simon,
It could be that replace is having encoding issues? The foreign chars in the content should also be in index as far as i am aware. The analysers are more for ignoring stop words. Step through the code and see what your before and after is.
Regards
Ismail
Morning,
The foreign characters are going into the index and coming back out fine it's punctuation which is not, in this specific case the apostrophe. I'll step through the code shortly and confirm back the result.
Cheers, Simon
Problem solved, perhaps indirectly by changing the analyzer and then rebuilding the index again.
Thanks for the pointers Ismail.
Hi Simon Dingley, I am trying to use Examine to search documents, but they aren't in english, but in italian.
Do you found a ItalianAnalyzer or something like that?
Thanks
Simon, my search result is not good. For example, I have a node with this name: "Festa dell'aquilone", this can be translate step-by-step in "festival" "of the" kite". With WhitespaceAnalyzer, if I try to search with "aquilone" text, I have not results. Otherwise with "dell'aquilone" I can find the node.
Another issue is "stress mark": à é è ì ò ù. I can find the node "Identità" with the same texh, but not with "identita".
In italian (like in any language) there are words too many commonly: il lo la i gli le di a da in con su per tra fra (like "in for as is are where when this that the..."). It is better that Lucene do ignoring these words when users do search.
How do you solved these issues?
Thanks very much
I'm no expert on this but can you try opening your index with Luke and seeing if you can achieve the desired results?
https://code.google.com/p/luke/
Ok, I looking the index with Luke.
But... I don't understand: what do I look?
I look that the
name
field contains "dell'aquilone", and many times of "di", "la", "del" terms.is working on a reply...