I'm trying to use Examine on a slightly older Umbraco site, running Umbraco 4.0.0 using Examine 0.10.0.292, and it seems to work well except for two issues:
1: Special character entered in a richtext field are replaced with their HTML entity names, e.g. ö is stored as ö and is also indexed as such and making it impossible to find words containing these characters in the index.
2: The other issue is searching for multiple words, where I always get 0 hits. If this topic was in my index, I could find it searching for "multiple" or "words", but not "multiple words". I'm searching using code like this:
I found and old forum post describing how to change entity encoding from 'named' to 'raw' in the tinymce javascript file. This almost fixed the problem with character encoding. It only seems to work for new content. If I save some of the existing conent, it is still being encoded. Any pointers on how to prevent this?
Seems tinymce isn't the only place content was/is encoded.
Now that tinymce is set to raw, I can type a word like "søster" in a richtext field, and when viewing the fields HTML, I can see that it still says "søster". When published, and displayed on the site, it says s&oslah;ster in the source, and in the Lucene index it also says søster. The fields HTML remains "søster" when viewed in Umbraco.
Indexing richtext fields with Examine
I'm trying to use Examine on a slightly older Umbraco site, running Umbraco 4.0.0 using Examine 0.10.0.292, and it seems to work well except for two issues:
1: Special character entered in a richtext field are replaced with their HTML entity names, e.g. ö is stored as ö and is also indexed as such and making it impossible to find words containing these characters in the index.
2: The other issue is searching for multiple words, where I always get 0 hits. If this topic was in my index, I could find it searching for "multiple" or "words", but not "multiple words". I'm searching using code like this:
searchCriteria = ExamineManager.Instance.SearchProviderCollection["mySearcher"].CreateSearchCriteria(BooleanOperation.Or);
filter = searchCriteria.GroupedOr(new string[] { "ShortDescription", "Description" }, Examine.LuceneEngine.SearchCriteria.LuceneSearchExtensions.Fuzzy(searchString));
searchResults =
ExamineManager.Instance.SearchProviderCollection["mySearcher"].Search(filter.Compile());
If I try to copy the query from filter.Compile().ToString() and entering it directly as a search in Luke, I do however get the results I would expect.
Both the index provider and the search provider are configured to use the StandardAnalyzer, and so is my search in Luke.
Any help resolving these issues will be greatly appreciated.
I found and old forum post describing how to change entity encoding from 'named' to 'raw' in the tinymce javascript file. This almost fixed the problem with character encoding. It only seems to work for new content. If I save some of the existing conent, it is still being encoded. Any pointers on how to prevent this?
Seems tinymce isn't the only place content was/is encoded.
Now that tinymce is set to raw, I can type a word like "søster" in a richtext field, and when viewing the fields HTML, I can see that it still says "søster". When published, and displayed on the site, it says s&oslah;ster in the source, and in the Lucene index it also says søster. The fields HTML remains "søster" when viewed in Umbraco.
Any help to solve this is highly appreciated.
is working on a reply...