Under "Examine Management" in the Developer section, we have a media section indexer that had the following:
Has deletions? / Optimized? = true (21)/ false
We had a document that was showing up in search results (querying the lucene index with StandardAnalyzer, indexing also by StandardAnalyzer). This document had been removed from the Media section (by deleting it and then emptying the recycle bin).
We rebuilt the index, and then the setting changed to:
Has deletions? / Optimized? = false (0)/ true
Then the document no longer showed up in search results, i.e. it was removed from the index.
I assume the document wasn't one of the items marked as deleted, as Lucene wouldn't return those documents from a search. But before rebuilding the index, the indexer had the following setting:
Documents in index = 3011
But afterwards it had:
Documents in index = 2461
This is obviously far more than the 21 documents marked as deleted.
I'm not clear how Umbraco or Lucene handles the removal of documents that are marked for deletion, nor how it handles optimization, perhaps someone can tell me. But I would like to know why the index had over 500 documents hanging around in the index that were not marked for deletion. Any ideas?
The umbracoLog table has no error entries attributed to media item deletions. Does the GatheringNodeData event automatically log errors to the umbracoLog table if an error occurs?
Is there a way to automate the rebuilding of an index?
We have just discovered a similar issue with the InternalIndexer. There were files that were deleted from the media section (and recycle bin emptied), which still existed in the lucene index, so it would seem it's a general problem. Is there a solution?
I'm planning on writing a scheduled task to periodically rebuild the index. But it would be good to know if there's an alternative. I would expect the lucene indexes to match the media section items.
Examine Index: Media items not being removed
Umbraco 6.1.5
Under "Examine Management" in the Developer section, we have a media section indexer that had the following:
We had a document that was showing up in search results (querying the lucene index with StandardAnalyzer, indexing also by StandardAnalyzer). This document had been removed from the Media section (by deleting it and then emptying the recycle bin).
We rebuilt the index, and then the setting changed to:
Then the document no longer showed up in search results, i.e. it was removed from the index.
I assume the document wasn't one of the items marked as deleted, as Lucene wouldn't return those documents from a search. But before rebuilding the index, the indexer had the following setting:
But afterwards it had:
Documents in index = 2461
This is obviously far more than the 21 documents marked as deleted.
I'm not clear how Umbraco or Lucene handles the removal of documents that are marked for deletion, nor how it handles optimization, perhaps someone can tell me. But I would like to know why the index had over 500 documents hanging around in the index that were not marked for deletion. Any ideas?
The umbracoLog table has no error entries attributed to media item deletions. Does the GatheringNodeData event automatically log errors to the umbracoLog table if an error occurs?
Is there a way to automate the rebuilding of an index?
We have just discovered a similar issue with the InternalIndexer. There were files that were deleted from the media section (and recycle bin emptied), which still existed in the lucene index, so it would seem it's a general problem. Is there a solution?
I'm planning on writing a scheduled task to periodically rebuild the index. But it would be good to know if there's an alternative. I would expect the lucene indexes to match the media section items.
Looks like I'll have to post on Stack Overflow...
is working on a reply...