examine understanding triggering a complete re indexing

overflew 87 posts 110 karma points

Dec 12, 2010 @ 00:46

Examine - Understanding triggering a complete re-indexing

Hi there,

I'm going through the process of trying to diagnose why our client's Examine indexes aren't rebuilding, and going through the process of constantly deleting ours to replicate the issue they're having. I'd like to eventually turn this into a Wiki, so it can help anyone else who encounters issues in the future.

In short - Deleting the index dir, then publishing a node should trigger a rebuild of the entire index. However, I'm seeing it only add the individually published items since the index delete. Any information on settings I may have missed, or how I can gauge the status/condition of the index would be appreciated.

The long version:

The environment:

We perform a migration of 20k nodes of the site we're developing, and then pass this to the client. The deployment zip has the Examine indexes + Client Dependency folders cleared out - Both use file names tied to the machine name, and the app seems to return odd things when it has 2x sets of master files.

The indexes we create have been fine, so we've never seen an issue in our testing environments. However, after constant testing, have now somehow got it to an point point where I see two things:

Publishing a node will only index that node (rather than trigger the rebuild, as I've been able to do before)
Re-indexing will start,

My understanding of what happens:

Once the /App_Data/ExamineIndex/Internal dir is cleared, a search in the top-left box in the back office will leave the spinning gif running, and underneath, the call to QuickSearchHandler.ashx give an HTTP 500 error, detailing:

no segments* file found in Lucene.Net.Store.SimpleFSDirectory@C:\Umbraco\App_Data\ExamineIndexes\Internal\Index: files:

The same Yellow Screen of Death error can be seen when pressing enter in the search box, bringing up the search dialog. Any reload of the backoffice will then regenerate the /App_Data/ExamineIndex/Internal folder, and publishing a single node will generate a segments.gen + segments_1 file in /Index , and .del file in /Queue.

At this point, using the search will return an empty result set for nodes it hasn't yet indexed (so no underlying HTTP 500s I can see...)

The issue I get here:

...is that the first publish operation will time out. On the first one, one of the DB CPUs will sit at 80%. On the second attempt to publish, the 2x web app CPUs sit at 70% & 30%, then another timeout.

The third publish works, and I can now find that one page in the Examine search. There is a '_0.cfs' file added to /Index, but no further activity there for 5 minutes. (The polling interval is set to 30 seconds in the config)

I've then added a new content item, which goes through the AutoFolders process, and publish it. Both the new item, and the auto-created folder can now be found with the search. There are now 12 files in /Index, but only totaling 11.6KB, and /Queue is empty. There is no further activity in this dir for 5 minutes.

Back when it worked:

Although I'll do more testing once it works again, a previous index delete + regen took 23 minutes (of 20k nodes), gauging by the first and late date of of the '_' files in /Index, and weighed 78mb.

I have a few notes about when it was DB vs. web app CPU intensive, but I'd need to test again to

Machine switching:

Is it possible to just do a comprehensive rename of the machine name in the file name + contents of:

(MachineName).exa
(MachineName).lck

Rather than getting new environments to rebuild it? Obviously it'd be handier to perform a reliable re-index, but it'd be handy to know.

Settings:

This is a Umbraco 4.5.2 install, running on a SQL 2008 R2 web server, with a SQL 2008 DB server.

The config files are bog standard from the 4.5.2 release, with the exception of changing the interval to 30 seconds.

Full read/write permissions have been given to the service account that the IIS instance is running as to the Umbraco directory.

The only other thing I've seen of note, is that occasionally when deleting the /Index folder, it's given me an error about the file being in use (open reader?), which hasn't let up, and required an IIS reset.

The properties of the /Index dir have 'Ready only' checked, but in grey, indicating there is/has been a read-only file in the dir? (None of the files themselves have read-only set)

Questions:

Please, any insight onto why it's only indexing new nodes?
When moving the site across machines, is it possible to do a rename of the filename + contents of the .exa + .lck file, and have it work?

Thanks so much for reading this, and any help is very definitely appreciated. I'd like to turn this into a Wiki page/book once I understand it, as I see there's a few scattered questions about the place asking about re-indexing.

Copy Link

overflew 87 posts 110 karma points

Dec 12, 2010 @ 22:37

K - to update, running the indexer with async off will allow it to write to the umbracoLog table.

I don't know if it's a red herring, but some of the errors I then saw in the log indicated that it doesn't like items that are in the trash. Cleared those out, and we have our smaller site rebuilding fine, but not the larger one.

The error we're seeing from items in the trash is:

Error adding node with url 'Test Product' to SiteMapProvider: System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
     at System.ThrowHelper.ThrowKeyNotFoundException()
     at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
     at umbraco.presentation.nodeFactory.UmbracoSiteMapProvider.UpdateNode(Node node)

Where 'Test Product' is the first item in the trash. There are no other users on the system at the time these errors were logged

The other error we see (though it's still present on the smaller site, where the indexes are able to rebuild), is:

Error adding to SiteMapProvider: System.InvalidOperationException: Multiple nodes with the same URL '/products/other-product.aspx' were found. XmlSiteMapProvider requires that sitemap nodes have unique URLs.
     at System.Web.StaticSiteMapProvider.AddNode(SiteMapNode node, SiteMapNode parentNode)
     at umbraco.presentation.nodeFactory.UmbracoSiteMapProvider.loadNodes(String parentId, SiteMapNode parentNode)

This may be a result of an Umbraco bug, where if multiple versions of a document exist with the same 'updateDate' (a result of auto-foldering) the UI will display multiple copies of the same item. It is discussed in the forum here, and there's a submitted patch on Codeplex here. The patch affects the m_SQLOptimizedMany stored query - though it may need to stretch further if this is an issue?

Note - We still see the 'Multiple nodes' error when indexing is successful on the smaller site.

As always - any input appreciated.

Copy Link

Christo 23 posts 44 karma points

Dec 13, 2010 @ 12:11

Hi overflew,

were you able to solve the problem? If yes please let me know how?

I experience the same issue.

Thanks

Christo

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

Examine - Understanding triggering a complete re-indexing