status of examine and azure blob storage providers

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Sep 05, 2014 @ 09:39

2

Status of Examine and Azure blob storage providers

Hello All,

Previously when deploying Umbraco to Azure it was recommended to use a different set of provider DLLs - https://github.com/Shandem/Examine/wiki/ExamineWithAzure

The first thing we would like to query is why this was necessary? Is this a hangover from WebRoles or is there still a reason to use these different DLLs when in an Azure VMs setup or Azure Websites setup.

Secondly - I know that in a load balanced environment it was recommended to exclude Examine indexes from DFS file replication and have each instance create their own Examine index (I think because writes to lucene indexes are not thread safe). Is this still the recommendation for multiple instance Azure deployments, and if so what is the solution for Azure websites (multi instance) as they use a common filesystem and one can't "exclude" stuff. I assume the solution is to use the Azure providers mentioned in the first question?

Lastly, are the Azure DLLs still maintained as part of the project? I've pulled the current binary release from github and they aren't included - do I need to build those DLLs separately?

Thanks for any light that you can shed upon this.

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Sep 08, 2014 @ 08:19
0
Hi,

I'll update the docs regarding the examine/azure stuff. The library that it uses is still active: https://azuredirectory.codeplex.com/ but they've taken it off of Nuget for whatever reason - maybe it has it's own issues, I'm not sure.

Here's some things to know:
- WAWS files run on a shared file share which means there is network latency, Lucene has problems with this. The AzureDirectory stuff solved this problem because it would copy from Blog (central) storage to the local CodeGen/Temp folder on the local server. If you try to run Lucene off of the default App_Data, you'll see latency issues if you have a very active index
- Azure on VMs is different, if you are running off local storage, then it's just a normal server/setup and you can use normal Lucene/Examine
If you are Load Balancing, then of course things become tricky
- You cannot properly LB on WAWS - I have been working on prototypes that could work on that environment but I haven't had much time to get it fully operational yet
- LB on Azure VMs is possible since you are in charge of how it works, so you just follow the normal LB docs
If you are using WAWS on a single server setup:
- You can continue to use Examine.Azure and UmbracoExamine.Azure - but you will have to build the libs from the codebase since the ones that have been published are quite out of date and everything should work as before. Note: I've taken down the listing of Examine.Azure from Nuget since it's referenced AzureDirectory has also been taken down - so you'll have to build it all from code
- You can try something new: https://github.com/Shandem/UmbracoExamine.TempStorage - Just don't try the 'syncStorage' for now as there's some issues iron out with that. What this will do is store your index in your normal App_Data folder, but it will copy the index to your local Codegen/Temp folder in local storage. When the index is written to, it will write to both places but reading only happens from the local storage so you don't have latency issues. This is very similar to the AzureDirectory, but it doesn't use blob storage
Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Sep 08, 2014 @ 09:11

0

Thanks Shannon - for the very detailed response.

We'll try and set up in the fashion detailed by your latest docs.I guess my last question is... Is the Azure Directory stuff "thread safe". e.g. if multiple instances in a load balanced cluster get a publish event and try to update their examine indexes, are writes through Azure directory marshalled correctly.

I'm asking because we have a workaround in place to load balance on WAWS.

Thanks again!

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Sep 08, 2014 @ 09:12

0

That's what their docs say but you'd want to test it as it's a bit vague on the level of LB support.

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Sep 08, 2014 @ 10:03

0

Thanks Shannon - our customer is reporting a missing method when using the Azure Directory provider.

They report:

When you modify a document type (eg add a property or change ‘allow at root’) and click Save, the UI doesn’t update properly, the AJAX request returns a 500 response visible in the Chrome developer tools, and you get an error in the event log (attached). When you refresh you find the document type was updated. The error is: “Method not found: 'Void UmbracoExamine.UmbracoContentIndexer.RefreshIndexerDataFromDataService()”. When you look at that method in the source on github, it was introduced in Umbraco 7.1.

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Sep 08, 2014 @ 10:30

0

Hi Darren,

What 'provider' are you referring to? Did you build it all from code or?

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Sep 08, 2014 @ 12:38

0

I'll go grab the config and post it here!

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Sep 09, 2014 @ 15:28

0

So We are set up using:

UmbracoExamine.Azure.AzureContentIndexer and UmbracoExamine.Azure.AzureSearcher

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Sep 10, 2014 @ 03:50

0

Darren I need to know where you got these DLLs? As I mentioned in my first reply, you will need to build these from the sources, we don't ship these DLLs and any DLLs that have been previously shipped will be drastically out of date.

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Sep 10, 2014 @ 12:58

0

Thanks Shannon, I'll go and make sure that they are built against the latest version of Examine!

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Nov 26, 2014 @ 16:54

0

Hi Shannon,

I've picked up on this after a while now. We are trying to build an environment where the back office is a VM and the front end(s) are multiple Azure websites.

The AzureDirectory seems ideal for this - the back office can write to a blob and the multiple front ends can query it (we've dealt with content cache, media and sesion state across Azure websites already).

I took Umbraco 6.2.4 and built it from source to give me: UmbracoExamine.Azure.dll and it also grabbed

AzureDirectory.dll, Microsoft.WindowsAzure.ServiceRuntime.dll and Microsoft.WindowsAzure.StorageClient.dll

I then had to go to the Examine repository on github to build: Examine.Azure.dll

After setting up configuration the container for my index is created in blob storage - but no content.

A lot of digging later and It seems that the EnsureIndex routine that creates the index leaves a write.lock in place.

If I manually clear the lock and try to index some content, the lock file re-appears.

It seems that every instantiation of an IndexWriter using an AzureDirectory locks the index.

Is this something you've encountered? I notice that the Azure DLLs are removed from Umbraco 7 and I guess that this isn't any longer a supported way of doing things?

Lastly, I know that you are working to make Umbraco work with Azure websites - what will the supported solution for Examine be in the end?

Thanks for all of your help on this!

Darren.

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Nov 27, 2014 @ 02:33

0

Hi Darren,

The AzureDirectory package has been taken down from Nuget, whether or not that is because it's not supported anymore or just has issues I'm not sure. The source still exists here: http://azuredirectory.codeplex.com/

But the latest also targets lucene 3.x whereas Examine does not (not until v2 which will be out later on... but won't really be usable in Umbraco until v8 due to compatibility issues).

The AzureDirectory implementation was always a proof of concept, the testing that I did with it was always very simple and never done on any live site. I think it would always be tricky to keep blob storage in sync with local storage especially in a load balanced form... since I'm not sure how you would guarantee what true order things should be written to the index since there will be latency issues.

Umbraco will work on Azure websites but not in a load balanced kind of way. I've been meaning to write a blog post on this for quite some time but haven't got around to it yet. If you don't already know, Azure Websites in a load balanced configuration (in a single region) will use the same file system share for all IIS instances. This means that you cannot have Lucene exist 'locally' since locally isn't really true, it the same files between all servers.

A temporary solution is by using this library: https://github.com/Shazwazza/UmbracoExamine.TempStorage and ensuring that syncStorage="false". This will store the Lucene indexes in the true server local CodeGen folder. This is also what AzureDirectory attempted to do as well btw. The caveat is that the CodeGen folder can be blown away whenever the /bin or global.asax files change which means when that happens, the indexes will rebuild themselves.

An alternative solution is to implement your own Examine providers to ensure that the indexes are written to server name specific folders so they are not shared. So the indexes would be stored in ~/App_Data/TEMP/ExamineIndexes/{ComputerName}/ I could probably update that library to support that as well then you don't need write your own (or you can send a PR :)

Lastly, a long term solution is a combination of this. Examine will store the index in both the machine name specific folder in the main file storage and will sync to the local CodeGen folder. When reading from the index, it will always be done from the local CodeGen folder to improve speed because there are latency issues with a shared file store since it is over a network. When writing to the index, it will be written to the local CodeGen folder and to the main file storage. On app startup, the index will be restored to the local CodeGen folder from the main file storage.

Then what we do is use the logic in the latest couple commits of this branch: https://github.com/umbraco/Umbraco-CMS/tree/7.1.0-batchdistcalls to sync all of the servers. This sync happens based on a polling structure. With Azure websites, it's difficult to designate a 'master' server and then dumb front-ends unless you setup 2 environments and use some serious trickery to get the servers to talk to each other (would be interested to see how you've set that up, Matt Muller has done it this way: http://our.umbraco.org/projects/backoffice-extensions/rbcumbracoonazure ). So with this new setup, instead of sending out instructions to each registered server, we'd rather not have any servers even have to know about the other ones. When a distributed call needs to be made, this information is serialized and stored in a db instructions table. When another server receives a request (this is throttled btw), it checks if it's up-to-date with the instructions table, if it isn't it receives all of the instructions (which is done exactly the same with how it receives a dist call). This means that you can theoretically scale your server numbers up/down and it will just work. I just need a lot more time to finalize this. The code in that branch works though btw.

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Nov 27, 2014 @ 12:51

0

Hi Shannon,

Thanks for the detailed response - we've used and contributed to Matt Mullers package and also did our own initial POC here https://github.com/darrenferguson/cloud-umbraco-cache-refresh (which took the approach of each instance checking a queue for cache updates).

We are pretty up to date with the single filesystem of Azure websites and the issue of having a single back office (with web hosting plans, you can scale a site independently, so you can have one Azure website for the back office (single instance) and another for the front end (multi instance), with media in CDN etc etc).

Making Examine write to a folder with the machine name sounds like a sensible approach and realtively easy to implement - though i'm not sure how to access the local code gen folders (unless this is just Path.GetTempPath??).

For me using AzureDirectory and blob was ideal because the back office could do the index writing and the multiple front ends could query it - no duplication of index writing. I did write some tests yesterday that just use raw lucene to hammer a few hundred random documents into an index with no issues. It just occurs to me that maybe i should turn off examine async and see if that solves the locking issue.

Thanks again for taking the time to help. Obviously if I make any headway with AzureDirectory of any of the other options I'll post them here and happily submit A PR if anything is useful.

Best,

D

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Nov 27, 2014 @ 22:58

0

Just be aware that you definitely get network latency with lucene when reading from a central file store - I've had a few people tell me about this issue (including ourselves) and that is why i made this project:

https://github.com/Shazwazza/UmbracoExamine.TempStorage

I've also added the functionality of this project to Umbraco 7.1.9 with no extra option to 'syncStorage', this is true by default. I need to port back the updated syncing code from 7.1.9 to the TempStorage project and the syncStorage option should work. Combine that with storing the indexes on the main file store with machine name'd folders and this project would probably work well for you.

I don't think changing examine async will fix anything, not sure why it would. And be sure to never run examine in non-async on any server that is writing to the index since you'll get write problems. I have a feeling that maybe the current implementation of AzureDirectory on Lucene 2.9 has some issues.

To access codegen folder is: HttpRuntime.CodegenDir

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Nov 28, 2014 @ 10:03

0

Thanks Shannon, I'll take a look at the code in 7.1.9 then.

My only question would be, what is the overhead in copying from shared storage to codegen at startup time? Could it cause a large delay in new instances coming to life? I guess not, as Examine indexes are usually megabytes rather than hundreds of megabytes.

I'll let you know how I get on.

Interestingly - I switched Examine to async and got some documnets into an index - but spordically/randomly. It just wasn't reliable. I'll push on with the solution above and let you know how I go.

Cheers.

D

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Dec 03, 2014 @ 19:47

0

As a further update on this, I can have Examine write into the codegen folder - i can also have a second indexer writing into ~/AppData/Temp/ExamineIndexes/<Environment.MachineName>

I think I still have a fundamental issue though.

When a new website instance spins up it can take a long time to generate it's indexes from scratch when you have lots of content. This results in the new instance not coming to life for a couple of minutes which isn't really workable.

I wanted to mess around with a new instance copying it's index from the back office instance - but it gets very tricky. I need to know which instance is my backoffice and I'm unsure of the implication of trying to copy an index while it is "live". In this scenario

I'm aware that you can stop Examine building indexes at startup and have it run async in the background, but our app is all based around Examine Search so the instance is useless until it is indexed.

At this stage - I think we need to stick with a VM deployment with a finite number of instances in play and we can manually scale the individual instances (or add more).

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Dec 03, 2014 @ 22:53
0
That is exactly what this syncStorage option is doing - and what has been coded into the core of 7.1.9 if you enable this option. I'll try to re-iterate as an ideal circumstance:
- Each of your server's stores their index in a machine name specific folder @ ~/App_Data/TEMP/ExamineIndexes/{MachineName}/...
  - These indexes are stores on your main file system - which in this case is a shared FS between all of your servers
  - Each server is responsible for it's own index
- Then each server on startup does a backup/restore of Lucene to it's own local CodeGen folder... this is not just a straight file copy, this is done using the Lucene internals to copy the required files and is performed before the indexers/searchers start using the index. If the index is already in sync, then nothing needs to happen.
- Each server then operates from this local codegen folder for all index reading
- For any index writing, these indexers are using a custom lucene Directory that writes to both the CodeGen index and the main index on the FS
So if you take this concept, you could in theory backup/restore an index from a different machine's folder if the new machine doesn't have a machine specific folder.

It's also worth noting that a new server will still serve requests while the index is rebuilding, this happens on a different thread.
Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Dec 04, 2014 @ 09:58

0

Thanks Shannon, So I guess the bit that is missing is the ability of a new instance to acquire it's index from another instances ~/AppData folder on startup.

I did manage to get examine indexing into codegen and also ~/App_Data using the Enironment.MachineName property to place the index in.

I guess in the indexer when you ensure the index folder exists you would then scan for other instances of the index and copy it across. Is the Lucene internal index copying stuff you mention in the 7.1.9 branch.

Cheers!

D

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Dec 04, 2014 @ 09:59

0

Not sure if you've been reading my replies, but all of this stuff is in this project already :)

https://github.com/Shazwazza/UmbracoExamine.TempStorage

apart from the machine name folders that use App_Data.

The 7.1.9 branch has a slight fix for the syncing of the storage that needs to be back ported into the above library.

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Dec 04, 2014 @ 10:12

0

But an instance can't create it's local codegen index based on the ~/App_Data folder of another instance right? That would need to be added?

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Dec 04, 2014 @ 10:13

0

Yup, that part doesn't exist. You'd have to check if the local machine's folder existed in App_Data, if it didn't, you'd have to choose another machine's folder to restore from... how you choose would be up to you :)

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Dec 04, 2014 @ 10:15

0

Cool - Ok. Will dig more in temp storage, probably next week.

Thanks!

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Dec 11, 2014 @ 06:46

0

Hey Darren, these issues will be of interest to you:

http://issues.umbraco.org/issue/U4-5993 http://issues.umbraco.org/issue/U4-5995

It's all in the 7.2.1 branch. If you get any time at all it would be insanely awesome to get some tests done :)

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Dec 12, 2014 @ 13:34

0

Nice! but will they back port to 6.2.x as that is where I am with this project just now I am afraid!

If these can be safely moved back - I'll try 'em out tomorrow.

Best,
D

Copy Link
Blue299 12 posts 71 karma points

Jul 02, 2015 @ 10:54

0

Hi Darren,

Any updates on Lucene working within Azure Web apps?

We are looking to deploy to Azure web apps and would love to utilize the auto scaling feature of it.

Thanks

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Jul 02, 2015 @ 11:36

0

Auto scaling will not work with Umbraco until Umbraco 7.3. I explain all about this in my conference talk, there are many factors and that is not just limited to Lucene.

http://stream.umbraco.org/video/11665943/umbraco-load-balancing

In 7.2.2, Examine index paths can be tokenized: http://issues.umbraco.org/issue/U4-5995, in 7.2.5 Examine indexes can be synced to local storage: http://issues.umbraco.org/issue/U4-5993 which is required because Azure uses a remote file server which Lucene doesn't like reading/writing over the network due to latency and will cause CPU issues.

Copy Link
Blue299 12 posts 71 karma points

Jul 02, 2015 @ 21:02

0

Great presentation and solution to get umbraco running in azure webapp environment.

Question do you see any issues with using Azure storage for media files in an azure web app environment using 7.3? Also when do you think 7.3 will be out of beta?

Thanks

Copy Link
Blue299 12 posts 71 karma points

Jul 06, 2015 @ 14:17

0

Do you have any instructions on how to configure 7.3 in terms of load balancing? I'm attempting an install within azure.

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Jul 06, 2015 @ 15:30

0

Sure, there's no reason you wouldn't be able to use a custom IFileSystem for media (i.e. like the azure storage package.

I haven't had time to document LB setup on azure web apps yet.

Copy Link
Blue299 12 posts 71 karma points

Jul 06, 2015 @ 15:32

0

Can you give me a quick guide? I'm eager to test in my environment.

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Jul 06, 2015 @ 15:32
0
Azure webapps uses shared (not replicated) storage, so all the information here is still pretty valid: https://our.umbraco.org/documentation/Getting-Started/Setup/Server-Setup/load-balancing

but the examine stuff is part of a core change so you don't need UmbracoExamine.TempStorage provider, these tasks should get you going though:
- http://issues.umbraco.org/issue/U4-5993
- http://issues.umbraco.org/issue/U4-5995
Copy Link
Blue299 12 posts 71 karma points

Jul 07, 2015 @ 21:53

0

Thanks for the links. I'm having an issue of just getting 7.3.0 running. It installs but when I install a starter kit and try to view the page I get this error.

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Jul 08, 2015 @ 07:54

0

Maybe you can post another topic for this problem ? or if you can reproduce it every time you can log an issue.

Copy Link
Shannon Deminick 1530 posts 5278 karma points MVP 3x

Aug 16, 2018 @ 07:58

2

Hi all,

I know this is super old but it's worth noting that I have this Blob Storage Examine provider running on my site https://shazwazza.com for the last 2 months without issue. That said my site gets very little updates so it's not like it's being stress tested or anything. I've deployed the change live and it's still running fine.

I've put together some docs here if you want to test this out: Examine Azure Directory Docs

Darren - i know you've already built your own work-arounds and that is super awesome!

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies