Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Squazz 35 posts 111 karma points
    Sep 24, 2015 @ 13:17
    Squazz
    0

    Slow response from BaseLuceneSearcher in Examine

    I am creating my own queries, and feeds the BaseLuceneSearcher my queries. Everything works fine and dandy when searching for a single word (HTTP response times below 20ms).

    But, when I search for two or more words, my response times quickly gets above 5-6 seconds.

    Analyzing my code, it runs fast until it hits the Search() method in the BaseLuceneSearcher class. When I step over this part of the code, it takes several seconds for the debugger to get any further.

    When searching for users

    Another more extensive example

    Trying out these queries in Luke shows no alarming timings (between 300 and 2000us).

    Optimizing queries

    Furthermore, I have found some other and even more frustrating examples that leads me think that something is entirely off with the BaseLuceneSearcher class.

    I was optimizing my queries, removing a lot of *-symbols, and removing search for words that didn't provide any useful data. I was shocked to find that in Luke the query went from 180ms to 100ms, but when feeding it to Examine it takes 6 seonds to execute for each. Here I had expected it to take 6 seconds for the non-optimized query, and maybe 4 for the optimized. But no, Examine doesn't care, it takes 6 seconds for each query.

    Old query: http://pastebin.com/bcjK2FHi

    Optimized query: http://pastebin.com/2bhhK75K

    Closing up

    I am running "Umbraco version 7.1.8 assembly: 1.0.5394.16131" and the Examine.dll is v0.1.57.2941

    Can anyone see something alarming in my queries that might be my Achilles heel? Or does anyone have an idea for why there might be a execution time difference from <20ms to >5s?

  • Mike Chambers 636 posts 1253 karma points c-trib
    Sep 25, 2015 @ 13:43
    Mike Chambers
    0

    Have you tried execution with luke? to see if it's lucene or something in betwen?

    https://code.google.com/p/luke/

    http://umbraco.com/follow-us/blog-archive/2011/9/16/examining-examine http://www.getopt.org/luke/

  • Squazz 35 posts 111 karma points
    Sep 25, 2015 @ 14:03
    Squazz
    0

    Yes, using Luke the query takes between 300 and 2000us (so, nothing alarming here)

  • Mike Chambers 636 posts 1253 karma points c-trib
    Sep 25, 2015 @ 15:31
    Mike Chambers
    0

    So think you are saying that it's not the index query that is causing the delay...

    What's your code for rendering the results doing, anything there that might causing the delay?

  • Squazz 35 posts 111 karma points
    Sep 25, 2015 @ 21:53
    Squazz
    0

    What do you mean by the following?

    So think you are saying that it's not the index query that is causing the delay...

    Rendering the results is not the problem. Debugging my code, and stepping through everything, timing everything, I see that it is in fact the search() method from the BaceLuceneSearcher class that is taking its time. When search() finally sends me a response, my code executes fairly quick afterwords (not the bottleneck).

  • Squazz 35 posts 111 karma points
    Oct 01, 2015 @ 07:00
    Squazz
    0

    So, I guess nobody knows anything about what could be wrong here? :/

  • Mike Chambers 636 posts 1253 karma points c-trib
    Oct 02, 2015 @ 08:36
    Mike Chambers
    0

    Might help if you posted code examples and the raw lucene search query?

    as I'm guessing your code is more involved than

    @if (!string.IsNullOrEmpty(Request.QueryString["query"]))    
    {
        <ul>
            @foreach (var result in Umbraco.TypedSearch(Request.QueryString["query"]))
            {
                <li>
                    <a href="@result.Url">@result.Name</a>
                </li>
            }
        </ul>
    }
    
  • Squazz 35 posts 111 karma points
    Oct 02, 2015 @ 08:42
    Squazz
    0

    May I point your attention for these lines in the first post:

    Pastebin for Lucene query with a single searchword: http://pastebin.com/5PFqkDzC

    Pastebin for Lucene query with multiple searchword: http://pastebin.com/CHFMsnvk

  • Mike Chambers 636 posts 1253 karma points c-trib
    Oct 02, 2015 @ 08:50
    Mike Chambers
    0

    gotcha.. and your razor code? :-)

  • Squazz 35 posts 111 karma points
    Oct 02, 2015 @ 08:56
    Squazz
    0

    I will try to create a simple working code example that replicates my issue.

  • Squazz 35 posts 111 karma points
    Jan 26, 2016 @ 08:50
    Squazz
    0

    Wow, I got majorly sidetracked by other projects, but now I'm back.

    Looking at the thread, I really don't know what you wan't the razor code, the view file has nothing to do with this. As stated previously, it is the BaseLuceneSearcher in Examine that is taking its time.

    How do I know that this is where the problem lies? Well, when running the code in debug mode, stepping through every step in the search everything runs smoothly until I reach the Search() method in the BaseLuceneSearcher class. When I step over this part of the code, it takes several seconds for the debugger to get any further. Here I also see a big difference in searching with one word, or with serveral.

    Furthermore, I have found some other and even more frustrating examples that leads me think that something is entirely off with the BaseLuceneSearcher class.

    I was optimizing my queries, removing a lot of *-symbols, and removing search for words that didn't provide any useful data. I was shocked to find that in Luke the query went from 180ms to 100ms, but when feeding it to Examine it takes 6 seonds to execute for each. Here I had expected it to take 6 seconds for the non-optimized query, and maybe 4 for the optimized. But no, Examine doesn't care, it takes 6 seconds for each query.

    Old query: http://pastebin.com/bcjK2FHi

    Optimized query: http://pastebin.com/2bhhK75K

    (I know that the queries aren't pretty, I'll try to make them easier to look at when I get the time)

    FYI: I tried to create a code-example, but I couldn't really figure out how to create the test in a way that it would be reproduce able for you guys. If you have any ideas for how to create such a test, please do let me know, and I will throw some time at it.

  • Shannon Deminick 1526 posts 5272 karma points MVP 3x
    Jan 26, 2016 @ 09:38
    Shannon Deminick
    1

    In your notes above, how are you timing this 6 seconds?

    Examine it takes 6 seonds to execute for each

    You keep mentioning it's when you are debugging/breakpointing. You cannot effectively measure any timing while you are debugging/breakpointing, Visual Studio will be busy doing a zillion things during this time like loading in symbols, etc...

    Use a System.Diagnostics.Stopwatch or using Umbraco's DisposableTimer like:

    using (DisposableTimer.DebugDuration<ContentTests>("Timing some code to execute...."))
    {
        //TODO: Put code here to time (i.e. your search code)
    }
    

    I just need to know how you are timing this so that if it's an issue that can be replicated, that the timing is accurate/consistent.

    More importantly is to know how you are:

    • Creating and executing this query
    • What are you trying to achieve with these multi-word queries?

    Multi-word queries probably don't do what you expect them to do and what you are doing is going to lead to results you don't want. For example:

    fullName:flemming andreasen*

    I'm assuming you think that this is going to match anything starting with "flemming andreasen" but it is not. This is going to match the field 'fullName' where the term:

    • the term 'flemming' should exist
    • the term 'andreasen' should exist
    • any term beginning with 'andreasen' should exist

    So you can see that your multi-word queries are not constructed correctly and will lead to irrelevant results. In your example only part of your total query yields:

    ((((fullName:flemming andreasen* fullName:"flemming andreasen" fullName:"flemming andreasen" fullName:flemming andreasen~0.7 fullName:flemming* fullName:andreasen* fullName:flemming fullName:andreasen fullName:flemming fullName:andreasen))^5.0)
    

    This is going to do all sorts of strange things that you probably don't want:

    • the term 'flemming' should exist
    • the term 'andreasen' should exist
    • any term beginning with 'andreasen' should exist
    • DUPLICATED: the term 'flemming' should exist
    • DUPLICATED: the term 'andreasen' should exist
    • DUPLICATED: the term 'flemming' should exist
    • DUPLICATED: the term 'andreasen' should exist
    • DUPLICATED: the term 'flemming' should exist
    • the term 'andreasen' should exist as a fuzzy match
    • DUPLICATED: the term 'flemming' should exist
    • any term beginning with 'flemming' should exist
    • DUPLICATED: any term beginning with 'andreasen' should exist
    • DUPLICATED: the term 'flemming' should exist
    • DUPLICATED: the term 'andreasen' should exist
    • DUPLICATED: the term 'flemming' should exist
    • DUPLICATED: the term 'andreasen' should exist
    • Then boost all of the above

    Obviously this can be massively simplified. If you want to mach an exact phrase like flemming andreasen it needs to exist in quotes. See: https://lucene.apache.org/core/294/queryparsersyntax.html

  • Squazz 35 posts 111 karma points
    Jan 26, 2016 @ 10:01
    Squazz
    0

    Thanks for you thorough response Shannon. I will make a proper response later today.

    For now I will give you a quick response to some of the things you are mentioning.

    In your notes above, how are you timing this 6 seconds?

    Examine it takes 6 seonds to execute for each

    You keep mentioning it's when you are debugging/breakpointing. You cannot effectively measure any timing while you are debugging/breakpointing, Visual Studio will be busy doing a zillion things during this time like loading in symbols, etc...

    I'm aware of that, and I might not have been explaining that clearly enough, that is my fault.

    What I noticed that was, that in the compiled product, it took me <20ms to get a http-response when searching for one word, and >5s to get a http-response with several words. This I see by opening Developer Tools in chrome, and looking at the "Network" tab. (I can go more into depth with this if needed)

    I then turned to my code, debugging it so that I could step through every method to see what it was that was taking so long. Here I found that the Search() method in the BaseLuceneSearcher was taking it's time to respond. I know that it takes a lot of time when debugging, so I have tested both debugging when searching for one word, and when searching for multiple, and when searching for multiple words it takes considerably longer to execute the Search() method then when I step through it, searching for only one word.

    Next part is what I will take my time to write a proper response to:

    More importantly is to know how you are:

    • Creating and executing this query
    • What are you trying to achieve with these multi-word queries?

    The response will come later today.

    EDIT: Regarding your link, it seem that the forum doesn't like "_" in the URLS, I guess you ment to refer to this link: Apache Lucene - Query Parser Syntax

  • Squazz 35 posts 111 karma points
    Jan 26, 2016 @ 11:40
    Squazz
    0

    More importantly is to know how you are:

    • Creating and executing this query
    • What are you trying to achieve with these multi-word queries?

    Lets start this one backwords.

    • What are you trying to achieve with these multi-word queries?

    The end goal is that when a user searches for the query "flemming andreasen", I want to look for content that matches any of these requirements:

    1. "flemming andreasen" as a direct hit
    2. flemming as a direct hit
    3. andreasen as a direct hit
    4. any term that begins with flemming
    5. any term that begins with andreasen

    In the end-result I also want to weight results differently so that results are prioritized according to the above list

    Next part:

    ... how you are creating and executing this query

    I'm not sure how thorough I should be here, I will try to make it a mix of abstract and concrete, and then hope that you guys will ask me to clarify if something is not clear enough.

    Generating the query

    When something is typed into the search-bar it is send to the searchController via JSON. Here the string is split up into different keywords (and an failed attempt for search-strings). When this is done queries are created, this is where the sub-query is created.

    To generate the query I have written my own library which simply put lets me do things like:

    new LuceneQuery(allowLeadingWildcards: true, occurance: BooleanClause.Occur.SHOULD, useStopWords: false).GroupedOr(new[] { "fullName" }, terms)
    

    This piece of pseudoCode would then create the previously discussed sub-query:

    (((fullName:flemming andreasen* fullName:"flemming andreasen" fullName:"flemming andreasen" fullName:flemming andreasen~0.7 fullName:flemming* fullName:andreasen* fullName:flemming fullName:andreasen fullName:flemming fullName:andreasen))^5.0)
    

    Searching

    When the final query is created it is then handed to Examines BaseLuceneSearcher Search() method like this:

    var searcher = (MultiIndexSearcher)ExamineManager.Instance.SearchProviderCollection["CombinedIndexSearcher"];
    var searchCriteria = searcher.CreateSearchCriteria(BooleanOperation.And);
    ...
    var results = searcher.Search(searchCriteria.RawQuery(rawQuery))
    

    Showing results

    When the results are ready they are returned to the view as a JSON object like this:

    return Json(
                new
                {
                    success = true,
                    results = results
                }, JsonRequestBehavior.AllowGet
            );
    

    Here I have some Angular code ready to receive and display it

    Is this something that you can use?

  • Shannon Deminick 1526 posts 5272 karma points MVP 3x
    Jan 26, 2016 @ 15:02
    Shannon Deminick
    0

    Hi,

    If you want full control over your query, then you should manually create it with RawQuery by casting your search criteria to LuceneSearchCriteria

    The thing you are doing with new LuceneQuery and GroupedOr is doing you no favors and is causing your queries to be created in a way that you don't want. I've described the output of what your query is actually doing and it does not match your intentions.

    Next, you are executing your query across many index sets. I'm assuming you are not doing this with Luke and simply executing a query across a single index set. I don't know how many index sets you are query against at once and I'm not sure what the exact performance penalty is of doing so but I would imagine you would have performance penalties based on the more index sets you are querying against (since that would make logical sense).

    I would suggest:

    • Manually create your query
    • Ensure it's doing exactly what you think it is (if unsure, then ask)
    • Run some benchmarks between executing this against a single index set vs executing over multiple index sets
  • Squazz 35 posts 111 karma points
    Feb 12, 2016 @ 12:41
    Squazz
    0

    Hi

    I have been working on shaving off the duplicates and trying to make the query do what I wanted it to. I ended up with the following query:

    (title:"flemming andreasen") (((title:"flemming andreasen"~0.7))^0.4) (((title:*flemming*))^0.9) (((title:andreasen*))^0.9) (((title:flemming title:andreasen))^0.2)
    

    Still no change, the query still took several seconds du execute (it did get a little faster, it got down to 4-5 seconds in execute time, but nothing like what I had hoped). Then I looked more closely at the "Apache Lucene - Query Parser Syntax" and saw that I couldn't do "fuzzy search" on a sentence, but instead could do a "proximity search", which seems to have the requirement of being at least of one or greater...

    You cant do this:

    title:"flemming andreasen"~0.7

    but you can do this:

    title:"flemming andreasen"~1

    Removing my attempt to do fuzzy search on a sentence made a BIG difference. Now I'm down to 600-800ms.

    (title:"flemming andreasen") (((title:*flemming*))^0.9) (((title:andreasen*))^0.9) (((title:flemming title:andreasen))^0.2)
    

    This is puzzling me, could this part: title:"flemming andreasen"~0.7 really be doing that much harm?

  • Shannon Deminick 1526 posts 5272 karma points MVP 3x
    Feb 12, 2016 @ 13:02
    Shannon Deminick
    0

    Hi, i'm not sure if you read my other statement:

    Run some benchmarks between executing this against a single index set vs executing over multiple index sets

    Because you are running this query over multiple index sets, this could have some sort of performance penalty, i'm not sure. We are just using the lucene multi index searcher for this. When you are searching in Luke, i'm pretty sure you'll only be searching against a single index set.

  • Squazz 35 posts 111 karma points
    Feb 12, 2016 @ 13:13
    Squazz
    0

    Hi

    I did read it, but I havn't got to test that part yet. I plan on doing that later on, but for the time being my primary focus have been on optimizing the way the query is generated. And it was in this quest I found the above results.

Please Sign in or register to post replies

Write your reply to:

Draft