provider model for search and indexing

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Nov 30, 2018 @ 16:05
0

Provider model for search and indexing
The goal

I would like to look into creating a clean provider model for switching the search and indexing engine used by Umbraco.

Lucene/Examine is a great default option, but for some solutions you want to use a centralized index such as Elasticsearch or Azure Search.

I would like it if a provider model would take care of:
- Common configuration for which fields to index
- An interface describing the indexing capabilities Umbraco Core needs
- An interface describing the searching capabilities Umbraco Core needs
- An interface describing some basic search capabilities that would be exposed to site builders through Umbraco Core
Advanced search for site builders should just go directly to whatever search engine they have chosen to use.

The current options

I have taken a look at the "Moriyama Azure Search" package. This is implemented by making "Dummy Providers" and subscribing to a lot of events, in order to gather data for the index. Then it intercepts angular requests to the search api, and sends them to a different controller instead. This does not feel like a clean interface for implementing new providers

Examine also has a concept of providers. However, it seems that the UmbracoIndexers inherit from the Examine ones, which means that in order to implement an indexer for a different engine, you would also need to rewrite/override the existing indexers in Umbraco. Also the examine provider requires you to understand what data might be in the XElement that is passed to the provider. This option also does not feel like a clean interface.

Proposal

I would like to propose that we make a set of interfaces and classes that describe in a structured way, which data structures the provider is expected to index, and which query operations the provider should support.

The operations should be kept relatively simple, to make it possible to use most engines. Maybe just supporting boosting and fuzzyness.

Umbraco Core would supply the configuration to the provider if needed. Core could also handle filtering out properties that should not be indexed before the data is sent to the provider.

The advantages to this approach would be that Core would handle all logic around when and what data to index, and the providers only need to handle persistence and querying. This helps avoid changes needed for packages when f.ex. event models change in core. It also allows other packages to still subscribe to core indexing events, regardless of which engine the data will eventually be stored in.

What do you think?

Would it be possible for a PR like this to get merged to V8?

Would it be accepted in general?
Copy Link
Anders Bjerner 487 posts 2996 karma points MVP 8x admin c-trib

Nov 30, 2018 @ 16:13

0

Hi Morten,

I think is the same being discussed here: https://github.com/umbraco/Umbraco-CMS/issues/3780

It seems that Shannon is already building this for v8

Copy Link
Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Nov 30, 2018 @ 16:29

0

That does indeed look like the same goals. I would love to take a stab at making an Elasticsearch provider, if I can find the branch containing these changes.

Copy Link
Thomas Rydeen Skyldahl 3 posts 73 karma points

Nov 30, 2018 @ 20:27

0

I tried getting Examine working with Umbraco in V6 but there where too many hard bindings on raw lucene queries between Umbraco and Examine back then to make it feasible, but I did get Shannon to make the hard instancing of specific providers go away.

But I would be glad to give a pure provider a stab, but I would make a clean interface from the backend and let the implementer deside what to index, no need to include the field configuration and all that in the core provider

Copy Link
Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Dec 01, 2018 @ 10:41

0

I don't like that approach, because that moves the responsibility for the behaviour from Core to the provider, which means that behaviour would change when changing provider.

Also, each provider would basically duplicate the logic for when/what to index, which does not seem like it should be required for implementing a provider.

I guess a middle ground could be making a BaseIndexer in core that you could inherit to just make a simple indexer for a different data store. Then that base would contain all the default behaviour.

In any case, it all depends on what Shannon is currently doing for V8, and what direction they want to go with it.

Copy Link
Thomas Rydeen Skyldahl 3 posts 73 karma points

Dec 01, 2018 @ 16:04

0

If you want to provide a way to map the fields it has to be open to extension, as that would enable each provider to get specific mappings they support.

but there are some big gaps between what different search solutions support:

eg. special language mapping for text strings, facets/aggregation, indexing and the use of index aliases, geo search, but all those could never be supported from a common configuration.

Off course the default provider should support basic configuration options, but don't force it on the other providers.

Abstract the different places Examine is used inside umbraco core with interfaces, and let the provider implement them independently (put every scenario into independent interfaces)

the interfaces should be as simple as ResultModel Search(string terms);

then it's easy to implement search for the backend in umbraco, without forcing a specific configuration model on implementers.

Copy Link
Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Dec 02, 2018 @ 21:18

0

I think the provider model should be as simple as possible, so that any provider should be able to support all back office features without additional configuration. This ensures site builders can swap provider, without being an expert in the underlying engine for a given provider.

If the site implementation then needs geo search, or other more complex features, then the implementation should just go straight to that search engine and not through the provider. I don't see a need for the provider interface to provide anything more than simple text search capabilities for the frontend.

Each provider could also just have it's own set of config options to enhance the indexing of data, such as language specific analyzers etc. But I don't think that part should be abstracted into the core config.

Core config should probably just be the list of indexes and fields/doctypes to index. That would cover 90% of all usecases I think. Then providers can provide additional config options for cases that need more tweaking.

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Dec 03, 2018 @ 10:40
2
Hi all,

I've replied about this a little on this thread: https://github.com/umbraco/Umbraco-CMS/issues/3780

But I'll keep the chat in here. So here's the info:

In v7 it is hard to decouple Examine and Lucene mostly due to the fact that the media cache is Lucene, that said.... it is possible ... but ....

There's some things to know first :)

1) First thing to know is that Examine is an abstraction, we don't really need to go create a whole other abstraction for indexing and searching but we do need to decouple Umbraco itself from the specific Lucene provider implementation in Umbraco.

2) The back office search is decoupled from Examine and Lucene using things called ISearchableTree and these are what power the main back office search. You can implement your own and replace the ones that exist. This goes for every tree that is searchable.

3) Right now, list view searches are powered by the database, not by Examine or Lucene

Due to v7's media cache requirements, we can't get away from Lucene syntax without forcing you to use entirely different APIs for dealing with media, but this is actually "OK" because both Elastic Search and Azure Search are based on Lucene (more or less) and support Lucene syntax and analyzers. I have a working prototype of Umbraco v7 running with a custom Examine version without any breaking changes (which means there's some ugliness) that uses Azure Search and the only thing that needed changing was to use a custom Azure Search Examine provider. Because Umbraco just uses Examine APIs for indexing and searching it 'just works'.

You can see the prototype in these code bases
- https://github.com/Shazwazza/Umbraco-CMS/tree/azure-search
- https://github.com/Shazwazza/Examine/tree/azure-search
If we use Examine as the abstraction then there's not a lot of implementation to be done and Umbraco will just work with it. If you want to go outside of Examine, then you would have to implement all of your own indexing with all of your own logic, event handlers, etc... and there are definitely a few 'gotchas' involved with doing this. It of course could be possible to create custom abstractions to deal with all of this from within Umbraco but the effort involved to do that in v7 is very substantial especially considering breaking changes.

I don't really plan to pursue these changes for v7 due to the inflexibility of non-breaking changes.

So here's what I've been working on in v8

Some things to know first:

1) First thing to know is that v8 does not cache media in Lucene, so we don't have this underlying problem of indexes + media cache + hard dependency on Lucene syntax. 2) Index rebuilding on startup isn't part of Examine, it's up to Umbraco to do this and this now happens on a delayed background thread since it's not critical for your indexes to be built for your site to startup like it is in v7 3) Examine comes with a FluentAPI which is an abstraction and Umbraco still exposes this API with the IPublishedContentQuery.Search method

What's happening now:
- Examine 1.0.0 major version will be released which runs lucene 3.0.3 and is massively simplified, trimmed back, faster, easier to work with and exposes a much better abstraction
- Umbraco v8 has no hard dependency on any specific Examine provider, only the abstractions
- A lot of the work is already done but it's still a WIP and i'm hoping to have it merged into the temp8 branch within a couple weeks, see https://github.com/umbraco/Umbraco-CMS/pull/3760 + https://github.com/shazwazza/examine/tree/v1.0
So where do we go from here?

It should be possible to create an Azure Search and Elastic Search provider for Examine and swap the built in ones for those ones in Umbraco v8 ... but we'll need to confirm this viability soon. We can use some of the code from the prototype I made.

If you really don't want to use Examine even if you can have it interfacing with hosted search engines, then your options are:
- Disable Examine on startup - you will be able to do this if you wanted
- You can already replace the ISearchableTrees in the back office so you can do that with your own searches
- For the front-end, you won't be able to use the IPublishedContentQuery.Search methods since those will be tied to Examine since they use Examine APIs with ISearchCriteria so you'll just have to use your own search logic
So all that would be feasible if you wanted, the only optional thing that we could do is create a detailed interface for indexing actions based on events that occur in Umbraco so that you don't have to jump through so many hoops to make that happen. This isn't planned since I think that the Examine abstractions should suffice but we can always add this later if it becomes really important.

The main thing that I want to make happen is that it is easy to swap the indexing/searching for Umbraco to use the main hosted search tools like Azure Search, Elastic Search and probably Solr.

What about more advanced things like facets, etc... ? Well like Thomas mentions, creating an abstraction for all aspects of search is near impossible. The ISearchCriteria really tries to cover all common scenarios to have a strongly typed search language but it can't cover everything but thats ok! The back office doesn't use these features and it probably wont, nor do Umbraco APIs, so if you need to use these features on your website, then you would just use the provider directly.
Copy Link
Thomas Rydeen Skyldahl 3 posts 73 karma points

Dec 04, 2018 @ 07:09

0

Hi Shannon

V8 sounds a lot better, especially that we get rid of Examine for media queries, and the whole indexing on Startup.

I havn't had a play with the ISearchableTree yet.

in V7: the issue with the raw lucene queries inside umbraco is more the fact that they include specific fields in the queries, witch means that the implementer either has to replace the fields in the raw query or have the exact same mapping as Examine for fields in their index, and that kind of forces the way the implementation has to look, this is where I would like the abstractions to go, to avoid this kind of binding.

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Dec 04, 2018 @ 07:25

0

ISearchableTree is a very simple interface, it just takes some search text. It might be extended for list view searches at some stage but it will remain a simple interface, no field requirements, etc...

Umbraco v8 only constructs raw lucene queries for it's own ISearchableTree implementations to pass to Examine - but since you can replace ISearchableTree that should be ok, and since the 3 main hosted Examine providers that I want to see working (Azure Search, Elastic Search and Solr) will support this syntax and will automatically be indexing the correct fields, it should work for those without having to replace ISearchableTree. Potentially these queries can be built with ISearchCriteria instead but I'm not sure yet.

It should still be possible to replace Examine entirely if you really want to so long as you don't use the UmbracoHelper.Search method tied to ISearchCritieria and you don't need to know about fields being searched on etc...

Confirming this is possible is part of the current work being done.

Copy Link
Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Dec 04, 2018 @ 19:44

0

It would be great if the core ISearhableTree implementations did no raw lucene queries. Then the examine provider would work as a single abstraction for search.

I will definitely give it a go with an Elasticsearch provider.

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Dec 05, 2018 @ 04:11

0

If we can make the ISearchableTree implementations use ISearchCriteria instead of raw lucene queries then yes that will be fine, but like i said, even with raw lucene queries that are passed to examine, having Elastic Search, Azure Search and Solr examine providers will still work with this. It will probably be rare to use something that is not these three things but in that case you could swap the ISearchableTree

Copy Link
Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Dec 05, 2018 @ 07:50

0

Wouldn't a raw lucene query make assumptions about the structure of the document in Elasticsearch?

Another question: How will the new language versions be handled with regard to indexing? Will there be an index per language? Are the fields going to be prefixed? Will a query contain the intended language?

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Dec 05, 2018 @ 08:21

0

Yes, but Examine is in charge of indexing and searching so the correct stuff would go into the index to match any of the searches. Even if ISearchCriteria is used assumptions are made that there are certain fields and data that need to exist. If you want a totally different/custom index structure, etc.. then you'll need to replace ISearchableTree and keep your own index based on your own events.

Variants/languages are interesting. For the first iteration, all languages will be indexed into a single document with suffixed names (i.e. nodeName_en-us, nodeName_es-es) Each field can have it's own analyzer.

Down the road we might decide to have an index per language or have it as an option, but it would be possible to configure separate indexes per language on your own if you wanted.

There are pros/cons to each of these but we've chosen a single index because this is easier to implement for now and it more easily allows contextual searching. What a single index doesn't allow for is to show a search result row per language whereas multiple indexes would allow that and vice-versa.

Will a query contain the intended language

If you are referring to ISearchableTree then at the moment the plan is to not have that, however based on my work next week that could change so stay tuned for updates on that. As far as Examine goes, searching for values via languages is just done by searching on specific language fields.

Copy Link
Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Dec 05, 2018 @ 08:30

0

I was mostly thinking of the search helpers used by site builders for the frontend. So if a user is browsing the german site, then any calls to IPublishedContentQuery.Search would then get the de-de postfixes added to the field names before the query is sent to the examine provider?

Copy Link
Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Dec 05, 2018 @ 08:45

0

Hmm, my reply seems to have gone missing :D

I was mostly thinking of when users are searching on the site. So if a visitor is on the german site, will calls to IPublishedContentQuery.Search automatically get the de-de postfixes added before hitting the Examine provider?

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Dec 05, 2018 @ 08:59

1

Ah right, yup something like that would make sense, i haven't got around to that API yet but i'll make a note about that, thanks!

Copy Link
Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Jan 03, 2019 @ 09:51

0

Hi Shannon

I was just wondering what the status is on this? I forked the current temp8 repo/branch, and as far as I can see, the LuceneIndex is still used directly from Umbraco?

Is it still the plan to remove that hard dependency, to enable other Examine providers to be used?

Copy Link
Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Jan 03, 2019 @ 14:35

0

I've now also spent some time with the Examine V1.0 branch.

The index provider interface now seems cleaner. No more xml, which is nice :)

When looking at the IQuery interface, for creating supporting the fluent query creation, it started getting harder to follow the logic. There is a whole lot of logic going on with regards to nesting and/or/not queries.

Taking on the task of implementing an Elasticsearch version of IQuery seems like a rather large chunk of work. I would be reluctant to start that task before also having created a test suite that proves that a query would deliver the same results across providers.

There also seems to be a broken abstraction where the IBooleanOperation inherits IOrdering, which means that you can add sorting on a nested boolean query, which does not really make sense?

I completely understand the idea of abstracting the query from the engine, but I do wonder if it adds enough value, as opposed to the complexity it adds? Would it make sense to make a simpler shared query interface, that should work across providers, and let and more advanced queries be done with provider specific queries?

All IQuery usages in Core could be isolated in a replaceable search facade in Umbraco Core, which could then be implemented by engine specific versions.

Packages would need to restrict themselves to the simpler search interface, but I don't know how many packages are utilizing advanced search currently?

These are just some thoughts from diving a bit further into the task. I don't think I would be able to complete an Examine provider as fast as I would have liked to :)

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Jan 07, 2019 @ 05:10

0

Some answers inline:

Taking on the task of implementing an Elasticsearch version of IQuery seems like a rather large chunk of work. I would be reluctant to start that task before also having created a test suite that proves that a query would deliver the same results across providers.

Sure it's a bit of a process to implement, search is complicated ;) A test suite would be near impossible to make because any implementation internally will create a different query mechanism. The lucene index will create a lucene query, an elasticsearch provider might not, it might build up some other query to be sent with the rest request, or maybe it will just stick with a lucene query, not sure.

There also seems to be a broken abstraction where the IBooleanOperation inherits IOrdering, which means that you can add sorting on a nested boolean query, which does not really make sense?

Good catch! I will update this. This is in-flux at the moment, the interfaces were being mashed together to come up with a nicer fluent syntax (i.e. no longer having a strange Compile method or being able to order by in the middle of a query)

I completely understand the idea of abstracting the query from the engine, but I do wonder if it adds enough value, as opposed to the complexity it adds? Would it make sense to make a simpler shared query interface, that should work across providers, and let and more advanced queries be done with provider specific queries?

Not really sure what you are expecting here tbh. You want to be able to implement a simple search functionality with no bells and whistles but why? Maybe you are thinking along the lines of having generic implementation like: ISearcher + ISearcher<TQuery> where TQuery:IQuery and IQuery is a stripped down version of what it currently is, then you'd have LuceneSearcher<LuceneQuery> . I have made a POC for this a couple of years ago and it ended up being extremely complex with generics, but lots has changed since then so might still be viable ... is this what you are referring to?

There are 2 things: Examine and Umbraco, by default Umbraco will use Examine and Examine itself is an abstraction. In Umbraco the 'simple' abstraction that can be replaced is ISearchableTree. The only place Examine APIs are referenced publicly is in the 2 overload methods: IPublishedContentQuery.Search with IQueryExecutor. It will be possible to replace the IPublishedContentQuery implementation too if you wanted which means you can wire up anything behind the simple search overload method that accepts a string term. If you want to replace the simple back office search you can easily with ISearchableTree. If you then want to use your own search functionality on your front-end you can do whatever you want but then you are also in charge of indexing which is currently an ugly task since you'll have to replicate all of the event logic.

I have a WIP branch in Examine to implement an Azure Search provider which is partially working. This re-uses much of the lucene searching logic because it also just builds up a lucene query, see the branch: https://github.com/shazwazza/examine/tree/azure-search-1.0

Ultimately I would want the implementation to be easier, part of the problem is the current implementation is more complex than it needs to be but I don't have time to un-wind all of that and much of it is many years old. The azure search branch starts simplifying it a little bit (as far as the azure search provider goes)

Copy Link
Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Jan 10, 2019 @ 23:41
0
I'm not sure I have any good suggestions that are realistic to do before v8 is released.

I think my main issue is that I find a fluent interface, such as IQuery, really hard to implement. If I were to reinvent search in Umbraco, I think I would prefer that the core expose a ISearchProvider interface, with methods such as:
```
IndexContent(IContent content)
IndexPublishedContent(IPublishedContent publishedContent)    
IndexMediaContent(IMedia media)
```
And for searching something like
```
SearchContent(string searchTerms,  string[] doctypes, string[] properties)    
SearchPublishedContent(string searchTerms,  string[] doctypes, string[] properties)
SearchMediaContent(string searchTerms,  string[] mediatypes, string[] properties)
```
I think that kind of interface would be enough to support the indexing and searches performed by core, as well as cover 80% of sites implementations when it comes to searching site content.

It would also be quite easy to implement providers for different search engines, regardless of if they were based on lucene or some cloud based API.

And if you need to do more complex queries on a site, just go directly to the provider you're using on the site and use whatever DSL they have exposed for querying their platform.

I realize this is not a small change, but just wanted to explain what I meant by having a simpler mandatory interface. I think the IQuery interface is too complex for most of us to write an implementation for. And I do think there is a need for non-file based search providers.
Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Jan 10, 2019 @ 23:59

0

I think i mentioned above about the indexing part, i agree that it would be nice to expose an interface for indexing... and that can still be possible in 8.x but there's no time to do that at the moment. The interface wont be as simple as you suggest though ;) there is a lot to consider when indexing based on what happens within umbraco like whether an entire branch needs to be removed, updated, etc... but in any case an interface like that would be good.

For searching ... i feel like i'm repeating myself ... you can just replace the ISearchableTree's for any of the trees and it's a very simple interface.

That would cover all searching that is done within umbraco. Then on your front-end you can do whatever you want.

Implementing IQuery can be a great thing because it will just happily work with Umbraco directly without modifying anything, but like I've said before you are not forced to do this. With ISearchableTree and potentially the indexing interface stuff mentioned above that will get you to where you want to be

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Jan 11, 2019 @ 00:04

0

Even without the indexing interface mentioned here, you can still implement an Examine index provider to handle the indexing - that's a reasonably easy thing to implement. Then you have your indexing working with your own provider. For search if you don't implement an Examine search provider with IQuery, you can throw a NotImplementException if you wanted and then replace the ISearchableTree with your own, and on your front-end do whatever you want.

Copy Link
Morten Bock 1867 posts 2140 karma points MVP 2x admin c-trib

Jan 11, 2019 @ 09:26

0

I'm sure the interfaces would need more that in my sample. It was just an indication of what type of interface :)

I did think about just not supporting the IQuery as you said. I was just not sure of the consequences.

If I were to implement the ISearcheableTree, can I do that without reimplementing the entire ContentTreeController for example? Because that class contains a lot of code that has nothing to do with search.

EDIT: Looks like that is indeed possible: https://our.umbraco.com/documentation/Extending/Section-Trees/Searchable-Trees/#replacing-an-existing-section-tree-search-searchabletreeresolver

I still think that it is a bit weird that replacing the search engine is a combo of making half an Examine provider, and replacing X number of ISearchableTree implementations. It would be nice for example to also have an interface for the Examine Dashboard features. Supporting reindexing etc. for any provider.

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Jan 15, 2019 @ 02:45

0

Yes ISearchableTree is just a simple interface and you can replace it for any tree.

I'm just trying to give you some options that you 'could' do. I'm not saying that you should just go ahead and make half of an Examine provider ... but i am saying that you could.

Each tree can have a different search method - just like we ship with, some trees search via the DB, some search via Examine. We can't just create a single interface to search all trees, and we have to consider that developers have their own trees. You don't want to replace all searching for all trees all of the time, that would be worse than "replacing X nubmer of ISearchableTree implementations"

Ideally we have an indexing interface baked into the CMS that you can implement to index your own data and then replace whatever ISearchableTree you need to (in this case, the only ones using Examine are content, media, members IIRC)

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Jan 15, 2019 @ 02:48

0

I also don't know what this means?

It would be nice for example to also have an interface for the Examine Dashboard features. Supporting reindexing etc. for any provider.

We do have that, reindexing button is decided based on an IIndexPopulator, other things that configure the dashboard come down to the provider itself and if it implements IIndexDiagnostics.

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies