Multilingual examine index with dynamic number of languages
Hi everyone.
I'm looking to implement search on a multilingual site, with both european and asian languages.
The number of languages is dynamic, since the editors can create new ones when they want, so the indexing of each language should not rely on configuration files.
Since I cannot use the same analyzer for english and chinese, I need to do something different that the standard indexes.
Looking a Lucene.net articles, I can see that some people recommend adding language specific fields to the index, and specifying different analyzers per field. Others recommend having separate indexes per language.
So my question is:
1: Can I create a custom index with examine, that has different analyzers for different fields?
or
2: Can I dynamically specify the number of indexes I wan, without changing my config files?
or
3: Should I not use Examine for this task, and just go directly to the Lucene.net api's?
I think no1 with examine is out, you may be able todo it with lucene.net directly. 2 is doable you would need to tab into umbraco events so when new root language node created you update the config files programatically so will need updating but do it in code, although knowing up front which analyser to use if non standard language may be a challenge.
I don't think updating the config files programatically will solve it, since we run multi server as well, and we deploy those files from TFS. I think having a dynamic config file like that would hurt in that respect. I might try and see if I can inject my own "config reader" to fake reading that actual config files.
With regards to selecting the correct analyzer, I think I would make a map for the analyzers we have, and then default to something like the StandardAnalyzer which works to some degree with most languages. Then we could add/map new analyzers if we have a better option later.
Document writing event gives you direct access to the lucene doc, you should be able to index however you want in that method.
You could also create your own analyzer that wraps the underlying ones you want
One thing I just thought of was, that I could create one index per analyzer, And then just dispatch my documents/searches to the appropriate index, based on the language. That way I can define my indexes at design time, using the standard config. Then I just need to add a get to each indexed document with the actual language, so I can restrict my search to that language.
Multilingual examine index with dynamic number of languages
Hi everyone.
I'm looking to implement search on a multilingual site, with both european and asian languages.
The number of languages is dynamic, since the editors can create new ones when they want, so the indexing of each language should not rely on configuration files.
Since I cannot use the same analyzer for english and chinese, I need to do something different that the standard indexes.
Looking a Lucene.net articles, I can see that some people recommend adding language specific fields to the index, and specifying different analyzers per field. Others recommend having separate indexes per language.
So my question is:
1: Can I create a custom index with examine, that has different analyzers for different fields?
or
2: Can I dynamically specify the number of indexes I wan, without changing my config files?
or
3: Should I not use Examine for this task, and just go directly to the Lucene.net api's?
Experiences are very welcome
Morten,
I think no1 with examine is out, you may be able todo it with lucene.net directly. 2 is doable you would need to tab into umbraco events so when new root language node created you update the config files programatically so will need updating but do it in code, although knowing up front which analyser to use if non standard language may be a challenge.
Thanks for the feedback Ismail.
I don't think updating the config files programatically will solve it, since we run multi server as well, and we deploy those files from TFS. I think having a dynamic config file like that would hurt in that respect. I might try and see if I can inject my own "config reader" to fake reading that actual config files.
With regards to selecting the correct analyzer, I think I would make a map for the analyzers we have, and then default to something like the StandardAnalyzer which works to some degree with most languages. Then we could add/map new analyzers if we have a better option later.
Document writing event gives you direct access to the lucene doc, you should be able to index however you want in that method. You could also create your own analyzer that wraps the underlying ones you want
Thanks Shannon, I will take a look at that.
One thing I just thought of was, that I could create one index per analyzer, And then just dispatch my documents/searches to the appropriate index, based on the language. That way I can define my indexes at design time, using the standard config. Then I just need to add a get to each indexed document with the actual language, so I can restrict my search to that language.
is working on a reply...