I am creating my own queries, and feeds the BaseLuceneSearcher my queries. Everything works fine and dandy when searching for a single word (HTTP response times below 20ms).
But, when I search for two or more words, my response times quickly gets above 5-6 seconds.
Analyzing my code, it runs fast until it hits the Search() method in the BaseLuceneSearcher class. When I step over this part of the code, it takes several seconds for the debugger to get any further.
Trying out these queries in Luke shows no alarming timings (between 300 and 2000us).
Optimizing queries
Furthermore, I have found some other and even more frustrating examples that leads me think that something is entirely off with the BaseLuceneSearcher class.
I was optimizing my queries, removing a lot of *-symbols, and removing search for words that didn't provide any useful data. I was shocked to find that in Luke the query went from 180ms to 100ms, but when feeding it to Examine it takes 6 seonds to execute for each. Here I had expected it to take 6 seconds for the non-optimized query, and maybe 4 for the optimized. But no, Examine doesn't care, it takes 6 seconds for each query.
I am running "Umbraco version 7.1.8 assembly: 1.0.5394.16131" and the Examine.dll is v0.1.57.2941
Can anyone see something alarming in my queries that might be my Achilles heel? Or does anyone have an idea for why there might be a execution time difference from <20ms to >5s?
So think you are saying that it's not the index query that is causing the delay...
Rendering the results is not the problem. Debugging my code, and stepping through everything, timing everything, I see that it is in fact the search() method from the BaceLuceneSearcher class that is taking its time. When search() finally sends me a response, my code executes fairly quick afterwords (not the bottleneck).
Wow, I got majorly sidetracked by other projects, but now I'm back.
Looking at the thread, I really don't know what you wan't the razor code, the view file has nothing to do with this. As stated previously, it is the BaseLuceneSearcher in Examine that is taking its time.
How do I know that this is where the problem lies? Well, when running the code in debug mode, stepping through every step in the search everything runs smoothly until I reach the Search() method in the BaseLuceneSearcher class. When I step over this part of the code, it takes several seconds for the debugger to get any further. Here I also see a big difference in searching with one word, or with serveral.
Furthermore, I have found some other and even more frustrating examples that leads me think that something is entirely off with the BaseLuceneSearcher class.
I was optimizing my queries, removing a lot of *-symbols, and removing search for words that didn't provide any useful data.
I was shocked to find that in Luke the query went from 180ms to 100ms, but when feeding it to Examine it takes 6 seonds to execute for each. Here I had expected it to take 6 seconds for the non-optimized query, and maybe 4 for the optimized. But no, Examine doesn't care, it takes 6 seconds for each query.
(I know that the queries aren't pretty, I'll try to make them easier to look at when I get the time)
FYI: I tried to create a code-example, but I couldn't really figure out how to create the test in a way that it would be reproduce able for you guys. If you have any ideas for how to create such a test, please do let me know, and I will throw some time at it.
In your notes above, how are you timing this 6 seconds?
Examine it takes 6 seonds to execute for each
You keep mentioning it's when you are debugging/breakpointing. You cannot effectively measure any timing while you are debugging/breakpointing, Visual Studio will be busy doing a zillion things during this time like loading in symbols, etc...
Use a System.Diagnostics.Stopwatch or using Umbraco's DisposableTimer like:
using (DisposableTimer.DebugDuration<ContentTests>("Timing some code to execute...."))
{
//TODO: Put code here to time (i.e. your search code)
}
I just need to know how you are timing this so that if it's an issue that can be replicated, that the timing is accurate/consistent.
More importantly is to know how you are:
Creating and executing this query
What are you trying to achieve with these multi-word queries?
Multi-word queries probably don't do what you expect them to do and what you are doing is going to lead to results you don't want. For example:
fullName:flemming andreasen*
I'm assuming you think that this is going to match anything starting with "flemming andreasen" but it is not. This is going to match the field 'fullName' where the term:
the term 'flemming' should exist
the term 'andreasen' should exist
any term beginning with 'andreasen' should exist
So you can see that your multi-word queries are not constructed correctly and will lead to irrelevant results. In your example only part of your total query yields:
This is going to do all sorts of strange things that you probably don't want:
the term 'flemming' should exist
the term 'andreasen' should exist
any term beginning with 'andreasen' should exist
DUPLICATED: the term 'flemming' should exist
DUPLICATED: the term 'andreasen' should exist
DUPLICATED: the term 'flemming' should exist
DUPLICATED: the term 'andreasen' should exist
DUPLICATED: the term 'flemming' should exist
the term 'andreasen' should exist as a fuzzy match
DUPLICATED: the term 'flemming' should exist
any term beginning with 'flemming' should exist
DUPLICATED: any term beginning with 'andreasen' should exist
DUPLICATED: the term 'flemming' should exist
DUPLICATED: the term 'andreasen' should exist
DUPLICATED: the term 'flemming' should exist
DUPLICATED: the term 'andreasen' should exist
Then boost all of the above
Obviously this can be massively simplified. If you want to mach an exact phrase like flemming andreasen it needs to exist in quotes. See: https://lucene.apache.org/core/294/queryparsersyntax.html
Thanks for you thorough response Shannon. I will make a proper response later today.
For now I will give you a quick response to some of the things you are mentioning.
In your notes above, how are you timing this 6 seconds?
Examine it takes 6 seonds to execute for each
You keep mentioning it's
when you are debugging/breakpointing. You cannot effectively measure
any timing while you are debugging/breakpointing, Visual Studio will
be busy doing a zillion things during this time like loading in
symbols, etc...
I'm aware of that, and I might not have been explaining that clearly enough, that is my fault.
What I noticed that was, that in the compiled product, it took me <20ms to get a http-response when searching for one word, and >5s to get a http-response with several words. This I see by opening Developer Tools in chrome, and looking at the "Network" tab. (I can go more into depth with this if needed)
I then turned to my code, debugging it so that I could step through every method to see what it was that was taking so long. Here I found that the Search() method in the BaseLuceneSearcher was taking it's time to respond. I know that it takes a lot of time when debugging, so I have tested both debugging when searching for one word, and when searching for multiple, and when searching for multiple words it takes considerably longer to execute the Search() method then when I step through it, searching for only one word.
Next part is what I will take my time to write a proper response to:
More importantly is to know how you are:
Creating and executing this query
What are you trying to achieve with these multi-word queries?
The response will come later today.
EDIT: Regarding your link, it seem that the forum doesn't like "_" in the URLS, I guess you ment to refer to this link: Apache Lucene - Query Parser Syntax
What are you trying to achieve with these multi-word queries?
Lets start this one backwords.
What are you trying to achieve with these multi-word queries?
The end goal is that when a user searches for the query "flemming andreasen", I want to look for content that matches any of these requirements:
"flemming andreasen" as a direct hit
flemming as a direct hit
andreasen as a direct hit
any term that begins with flemming
any term that begins with andreasen
In the end-result I also want to weight results differently so that results are prioritized according to the above list
Next part:
... how you are creating and executing this query
I'm not sure how thorough I should be here, I will try to make it a mix of abstract and concrete, and then hope that you guys will ask me to clarify if something is not clear enough.
Generating the query
When something is typed into the search-bar it is send to the searchController via JSON. Here the string is split up into different keywords (and an failed attempt for search-strings). When this is done queries are created, this is where the sub-query is created.
To generate the query I have written my own library which simply put lets me do things like:
When the final query is created it is then handed to Examines BaseLuceneSearcher Search() method like this:
var searcher = (MultiIndexSearcher)ExamineManager.Instance.SearchProviderCollection["CombinedIndexSearcher"];
var searchCriteria = searcher.CreateSearchCriteria(BooleanOperation.And);
...
var results = searcher.Search(searchCriteria.RawQuery(rawQuery))
Showing results
When the results are ready they are returned to the view as a JSON object like this:
If you want full control over your query, then you should manually create it with RawQuery by casting your search criteria to LuceneSearchCriteria
The thing you are doing with new LuceneQuery and GroupedOr is doing you no favors and is causing your queries to be created in a way that you don't want. I've described the output of what your query is actually doing and it does not match your intentions.
Next, you are executing your query across many index sets. I'm assuming you are not doing this with Luke and simply executing a query across a single index set. I don't know how many index sets you are query against at once and I'm not sure what the exact performance penalty is of doing so but I would imagine you would have performance penalties based on the more index sets you are querying against (since that would make logical sense).
I would suggest:
Manually create your query
Ensure it's doing exactly what you think it is (if unsure, then ask)
Run some benchmarks between executing this against a single index set vs executing over multiple index sets
Still no change, the query still took several seconds du execute (it did get a little faster, it got down to 4-5 seconds in execute time, but nothing like what I had hoped). Then I looked more closely at the "Apache Lucene - Query Parser Syntax" and saw that I couldn't do "fuzzy search" on a sentence, but instead could do a "proximity search", which seems to have the requirement of being at least of one or greater...
You cant do this:
title:"flemming andreasen"~0.7
but you can do this:
title:"flemming andreasen"~1
Removing my attempt to do fuzzy search on a sentence made a BIG difference. Now I'm down to 600-800ms.
Run some benchmarks between executing this against a single index set vs executing over multiple index sets
Because you are running this query over multiple index sets, this could have some sort of performance penalty, i'm not sure. We are just using the lucene multi index searcher for this. When you are searching in Luke, i'm pretty sure you'll only be searching against a single index set.
I did read it, but I havn't got to test that part yet. I plan on doing that later on, but for the time being my primary focus have been on optimizing the way the query is generated. And it was in this quest I found the above results.
Slow response from BaseLuceneSearcher in Examine
I am creating my own queries, and feeds the BaseLuceneSearcher my queries. Everything works fine and dandy when searching for a single word (HTTP response times below 20ms).
But, when I search for two or more words, my response times quickly gets above 5-6 seconds.
Analyzing my code, it runs fast until it hits the Search() method in the BaseLuceneSearcher class. When I step over this part of the code, it takes several seconds for the debugger to get any further.
When searching for users
Pastebin for Lucene query with a single searchword: http://pastebin.com/5PFqkDzC
Pastebin for Lucene query with multiple searchword: http://pastebin.com/CHFMsnvk
Another more extensive example
Trying out these queries in Luke shows no alarming timings (between 300 and 2000us).
Optimizing queries
Furthermore, I have found some other and even more frustrating examples that leads me think that something is entirely off with the BaseLuceneSearcher class.
I was optimizing my queries, removing a lot of *-symbols, and removing search for words that didn't provide any useful data. I was shocked to find that in Luke the query went from 180ms to 100ms, but when feeding it to Examine it takes 6 seonds to execute for each. Here I had expected it to take 6 seconds for the non-optimized query, and maybe 4 for the optimized. But no, Examine doesn't care, it takes 6 seconds for each query.
Old query: http://pastebin.com/bcjK2FHi
Optimized query: http://pastebin.com/2bhhK75K
Closing up
I am running "Umbraco version 7.1.8 assembly: 1.0.5394.16131" and the Examine.dll is v0.1.57.2941
Can anyone see something alarming in my queries that might be my Achilles heel? Or does anyone have an idea for why there might be a execution time difference from <20ms to >5s?
Have you tried execution with luke? to see if it's lucene or something in betwen?
https://code.google.com/p/luke/
http://umbraco.com/follow-us/blog-archive/2011/9/16/examining-examine http://www.getopt.org/luke/
Yes, using Luke the query takes between 300 and 2000us (so, nothing alarming here)
So think you are saying that it's not the index query that is causing the delay...
What's your code for rendering the results doing, anything there that might causing the delay?
What do you mean by the following?
Rendering the results is not the problem. Debugging my code, and stepping through everything, timing everything, I see that it is in fact the search() method from the BaceLuceneSearcher class that is taking its time. When search() finally sends me a response, my code executes fairly quick afterwords (not the bottleneck).
So, I guess nobody knows anything about what could be wrong here? :/
Might help if you posted code examples and the raw lucene search query?
as I'm guessing your code is more involved than
May I point your attention for these lines in the first post:
gotcha.. and your razor code? :-)
I will try to create a simple working code example that replicates my issue.
Wow, I got majorly sidetracked by other projects, but now I'm back.
Looking at the thread, I really don't know what you wan't the razor code, the view file has nothing to do with this. As stated previously, it is the BaseLuceneSearcher in Examine that is taking its time.
How do I know that this is where the problem lies? Well, when running the code in debug mode, stepping through every step in the search everything runs smoothly until I reach the Search() method in the BaseLuceneSearcher class. When I step over this part of the code, it takes several seconds for the debugger to get any further. Here I also see a big difference in searching with one word, or with serveral.
Furthermore, I have found some other and even more frustrating examples that leads me think that something is entirely off with the BaseLuceneSearcher class.
I was optimizing my queries, removing a lot of *-symbols, and removing search for words that didn't provide any useful data. I was shocked to find that in Luke the query went from 180ms to 100ms, but when feeding it to Examine it takes 6 seonds to execute for each. Here I had expected it to take 6 seconds for the non-optimized query, and maybe 4 for the optimized. But no, Examine doesn't care, it takes 6 seconds for each query.
Old query: http://pastebin.com/bcjK2FHi
Optimized query: http://pastebin.com/2bhhK75K
(I know that the queries aren't pretty, I'll try to make them easier to look at when I get the time)
FYI: I tried to create a code-example, but I couldn't really figure out how to create the test in a way that it would be reproduce able for you guys. If you have any ideas for how to create such a test, please do let me know, and I will throw some time at it.
In your notes above, how are you timing this 6 seconds?
You keep mentioning it's when you are debugging/breakpointing. You cannot effectively measure any timing while you are debugging/breakpointing, Visual Studio will be busy doing a zillion things during this time like loading in symbols, etc...
Use a
System.Diagnostics.Stopwatch
or using Umbraco'sDisposableTimer
like:I just need to know how you are timing this so that if it's an issue that can be replicated, that the timing is accurate/consistent.
More importantly is to know how you are:
Multi-word queries probably don't do what you expect them to do and what you are doing is going to lead to results you don't want. For example:
I'm assuming you think that this is going to match anything starting with "flemming andreasen" but it is not. This is going to match the field 'fullName' where the term:
So you can see that your multi-word queries are not constructed correctly and will lead to irrelevant results. In your example only part of your total query yields:
This is going to do all sorts of strange things that you probably don't want:
Obviously this can be massively simplified. If you want to mach an exact phrase like flemming andreasen it needs to exist in quotes. See: https://lucene.apache.org/core/294/queryparsersyntax.html
Thanks for you thorough response Shannon. I will make a proper response later today.
For now I will give you a quick response to some of the things you are mentioning.
I'm aware of that, and I might not have been explaining that clearly enough, that is my fault.
What I noticed that was, that in the compiled product, it took me <20ms to get a http-response when searching for one word, and >5s to get a http-response with several words. This I see by opening Developer Tools in chrome, and looking at the "Network" tab. (I can go more into depth with this if needed)
I then turned to my code, debugging it so that I could step through every method to see what it was that was taking so long. Here I found that the Search() method in the BaseLuceneSearcher was taking it's time to respond. I know that it takes a lot of time when debugging, so I have tested both debugging when searching for one word, and when searching for multiple, and when searching for multiple words it takes considerably longer to execute the Search() method then when I step through it, searching for only one word.
Next part is what I will take my time to write a proper response to:
The response will come later today.
EDIT: Regarding your link, it seem that the forum doesn't like "_" in the URLS, I guess you ment to refer to this link: Apache Lucene - Query Parser Syntax
Lets start this one backwords.
The end goal is that when a user searches for the query "flemming andreasen", I want to look for content that matches any of these requirements:
In the end-result I also want to weight results differently so that results are prioritized according to the above list
Next part:
I'm not sure how thorough I should be here, I will try to make it a mix of abstract and concrete, and then hope that you guys will ask me to clarify if something is not clear enough.
Generating the query
When something is typed into the search-bar it is send to the searchController via JSON. Here the string is split up into different keywords (and an failed attempt for search-strings). When this is done queries are created, this is where the sub-query is created.
To generate the query I have written my own library which simply put lets me do things like:
This piece of pseudoCode would then create the previously discussed sub-query:
Searching
When the final query is created it is then handed to Examines BaseLuceneSearcher Search() method like this:
Showing results
When the results are ready they are returned to the view as a JSON object like this:
Here I have some Angular code ready to receive and display it
Is this something that you can use?
Hi,
If you want full control over your query, then you should manually create it with
RawQuery
by casting your search criteria toLuceneSearchCriteria
The thing you are doing with
new LuceneQuery
andGroupedOr
is doing you no favors and is causing your queries to be created in a way that you don't want. I've described the output of what your query is actually doing and it does not match your intentions.Next, you are executing your query across many index sets. I'm assuming you are not doing this with Luke and simply executing a query across a single index set. I don't know how many index sets you are query against at once and I'm not sure what the exact performance penalty is of doing so but I would imagine you would have performance penalties based on the more index sets you are querying against (since that would make logical sense).
I would suggest:
Hi
I have been working on shaving off the duplicates and trying to make the query do what I wanted it to. I ended up with the following query:
Still no change, the query still took several seconds du execute (it did get a little faster, it got down to 4-5 seconds in execute time, but nothing like what I had hoped). Then I looked more closely at the "Apache Lucene - Query Parser Syntax" and saw that I couldn't do "fuzzy search" on a sentence, but instead could do a "proximity search", which seems to have the requirement of being at least of one or greater...
You cant do this:
but you can do this:
Removing my attempt to do fuzzy search on a sentence made a BIG difference. Now I'm down to 600-800ms.
This is puzzling me, could this part: title:"flemming andreasen"~0.7 really be doing that much harm?
Hi, i'm not sure if you read my other statement:
Because you are running this query over multiple index sets, this could have some sort of performance penalty, i'm not sure. We are just using the lucene multi index searcher for this. When you are searching in Luke, i'm pretty sure you'll only be searching against a single index set.
Hi
I did read it, but I havn't got to test that part yet. I plan on doing that later on, but for the time being my primary focus have been on optimizing the way the query is generated. And it was in this quest I found the above results.
is working on a reply...