I did some quick investigation and maybe the following helps you some bit in the right direction.
When you implement your own ContentValueSetValidator you are able to exclude fields from being indexed in the Validate(ValueSet valueSet) method.
With valueSet.Values.Remove(key) you can remove values from the valueset.
I think the UmbracoFieldDefinitionCollection maybe doesn't determine which fields are really included in the index. It is more a mapping between Umbraco fields and Examine field types. The comment above this code says:
A type that defines the type of index for each Umbraco field (non user
defined fields)
I got it working by making my own ContentValueSetValidator, but i dunno if i am happy with the way to do it.
public class ContentValueSetValidator : ValueSetValidator, IContentValueSetValidator
{
private readonly IPublicAccessService _publicAccessService;
private const string PathKey = "path";
private static readonly IEnumerable<string> ValidCategories = new[] { IndexTypes.Content, IndexTypes.Media };
protected override IEnumerable<string> ValidIndexCategories => ValidCategories;
public bool PublishedValuesOnly { get; }
public bool SupportProtectedContent { get; }
public int? ParentId { get; }
public ContentValueSetValidator(bool publishedValuesOnly, int? parentId = null, IEnumerable<string> includeItemTypes = null, IEnumerable<string> excludeItemTypes = null, IEnumerable<string> includeFields = null, IEnumerable<string> excludeFields = null)
: this(publishedValuesOnly, true, null, parentId, includeItemTypes, excludeItemTypes, includeFields, excludeFields)
{
}
public ContentValueSetValidator(bool publishedValuesOnly, bool supportProtectedContent, IPublicAccessService publicAccessService, int? parentId = null, IEnumerable<string> includeItemTypes = null, IEnumerable<string> excludeItemTypes = null, IEnumerable<string> includeFields = null, IEnumerable<string> excludeFields = null)
: base(includeItemTypes, excludeItemTypes, includeFields, excludeFields)
{
PublishedValuesOnly = publishedValuesOnly;
SupportProtectedContent = supportProtectedContent;
ParentId = parentId;
_publicAccessService = publicAccessService;
}
public bool ValidatePath(string path, string category)
{
//check if this document is a descendent of the parent
if (ParentId.HasValue && ParentId.Value > 0)
{
// we cannot return FAILED here because we need the value set to get into the indexer and then deal with it from there
// because we need to remove anything that doesn't pass by parent Id in the cases that umbraco data is moved to an illegal parent.
if (!path.Contains(string.Concat(",", ParentId.Value, ",")))
return false;
}
return true;
}
public bool ValidateRecycleBin(string path, string category)
{
var recycleBinId = category == IndexTypes.Content ? Constants.System.RecycleBinContent : Constants.System.RecycleBinMedia;
//check for recycle bin
if (PublishedValuesOnly)
{
if (path.Contains(string.Concat(",", recycleBinId, ",")))
return false;
}
return true;
}
public bool ValidateProtectedContent(string path, string category)
{
if (category == IndexTypes.Content
&& !SupportProtectedContent
// if the service is null we can't look this up so we'll return false
&& (_publicAccessService == null || _publicAccessService.IsProtected(path)))
{
return false;
}
return true;
}
public override ValueSetValidationResult Validate(ValueSet valueSet)
{
// Removed base.Validate(valueSet) in order to manipulate the valueSet.Values the way we want to.
if (ValidIndexCategories != null && !ValidIndexCategories.InvariantContains(valueSet.Category))
{
return ValueSetValidationResult.Failed;
}
// check if this document is of a correct type of node type alias
if (IncludeItemTypes != null && !IncludeItemTypes.InvariantContains(valueSet.ItemType))
{
return ValueSetValidationResult.Failed;
}
// if this node type is part of our exclusion list
if (ExcludeItemTypes != null && ExcludeItemTypes.InvariantContains(valueSet.ItemType))
{
return ValueSetValidationResult.Failed;
}
ValueSetValidationResult baseValidateResult = ValueSetValidationResult.Valid;
// Checking IncludeFields and ExcludeFields for exact key name or culture name.
foreach (var key in valueSet.Values.Keys.ToList())
{
if (IncludeFields != null && !IncludeFields.Any(x => x.Equals(key) || key.StartsWith($"{x}_")))
{
valueSet.Values.Remove(key); //remove any value with a key that doesn't match the inclusion list
baseValidateResult = ValueSetValidationResult.Filtered;
}
if (ExcludeFields != null && ExcludeFields.Any(x => x.Equals(key) || key.StartsWith($"{x}_")))
{
valueSet.Values.Remove(key); //remove any value with a key that matches the exclusion list
baseValidateResult = ValueSetValidationResult.Filtered;
}
}
var isFiltered = baseValidateResult == ValueSetValidationResult.Filtered;
//check for published content
if (valueSet.Category == IndexTypes.Content && PublishedValuesOnly)
{
if (!valueSet.Values.TryGetValue(UmbracoExamineIndex.PublishedFieldName, out var published))
return ValueSetValidationResult.Failed;
if (!published[0].Equals("y"))
return ValueSetValidationResult.Failed;
//deal with variants, if there are unpublished variants than we need to remove them from the value set
if (valueSet.Values.TryGetValue(UmbracoContentIndex.VariesByCultureFieldName, out var variesByCulture)
&& variesByCulture.Count > 0 && variesByCulture[0].Equals("y"))
{
//so this valueset is for a content that varies by culture, now check for non-published cultures and remove those values
foreach (var publishField in valueSet.Values.Where(x => x.Key.StartsWith($"{UmbracoExamineIndex.PublishedFieldName}_")).ToList())
{
if (publishField.Value.Count <= 0 || !publishField.Value[0].Equals("y"))
{
//this culture is not published, so remove all of these culture values
var cultureSuffix = publishField.Key.Substring(publishField.Key.LastIndexOf('_'));
foreach (var cultureField in valueSet.Values.Where(x => x.Key.InvariantEndsWith(cultureSuffix)).ToList())
{
valueSet.Values.Remove(cultureField.Key);
isFiltered = true;
}
}
}
}
}
//must have a 'path'
if (!valueSet.Values.TryGetValue(PathKey, out var pathValues)) return ValueSetValidationResult.Failed;
if (pathValues.Count == 0) return ValueSetValidationResult.Failed;
if (pathValues[0] == null) return ValueSetValidationResult.Failed;
if (pathValues[0].ToString().IsNullOrWhiteSpace()) return ValueSetValidationResult.Failed;
var path = pathValues[0].ToString();
// We need to validate the path of the content based on ParentId, protected content and recycle bin rules.
// We cannot return FAILED here because we need the value set to get into the indexer and then deal with it from there
// because we need to remove anything that doesn't pass by protected content in the cases that umbraco data is moved to an illegal parent.
if (!ValidatePath(path, valueSet.Category)
|| !ValidateRecycleBin(path, valueSet.Category)
|| !ValidateProtectedContent(path, valueSet.Category))
return ValueSetValidationResult.Filtered;
return isFiltered ? ValueSetValidationResult.Filtered : ValueSetValidationResult.Valid;
}
}
Then i use it in a custom LuceneIndexCreator.
public class ContentSearchIndexCreator : LuceneIndexCreator, IUmbracoIndexesCreator
{
private readonly IProfilingLogger _profilingLogger;
private readonly ILocalizationService _languageService;
public ContentSearchIndexCreator(IProfilingLogger profilingLogger, ILocalizationService languageService)
{
_profilingLogger = profilingLogger;
_languageService = languageService;
}
public override IEnumerable<IIndex> Create()
{
return new[]
{
CreateContentIndex(
"ContentSearchIndex",
"ContentSearch",
new UmbracoFieldDefinitionCollection(),
new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30),
new ValueSetValidators.ContentValueSetValidator(true, true, null, null, new string[] { "TextPage", "RedirectNode" }, null, new string[] { "__IndexType", "__Published", "__Key", "__Path", "__VariesByCulture", "__NodeId", "id", "path", "nodeName", "pageGrid", "searchTags", "searchablePath" }, null)
)
};
}
private IIndex CreateContentIndex(
string name,
string folderName,
FieldDefinitionCollection fieldDefinitionCollection,
Lucene.Net.Analysis.Analyzer luceneAnalyzer,
IContentValueSetValidator contentValueSetValidator)
{
var index = new UmbracoContentIndex(
name,
CreateFileSystemLuceneDirectory(folderName),
fieldDefinitionCollection,
luceneAnalyzer,
_profilingLogger,
_languageService,
contentValueSetValidator);
return index;
}
}
As bonus info i added the searchablePath as and IComponent
public class ExamineLuceneComponent : IComponent
{
private readonly IExamineManager _examineManager;
private readonly ILogger _logger;
public ExamineLuceneComponent(IExamineManager examineManager, ILogger logger)
{
_logger = logger;
_examineManager = examineManager;
}
public void Initialize()
{
var externalIndex = _examineManager.Indexes.FirstOrDefault(x => x.Name == "ContentSearchIndex");
if (externalIndex != null)
{
((BaseIndexProvider)externalIndex).TransformingIndexValues += ExamineLuceneComponent_TransformingIndexValues;
}
}
private void ExamineLuceneComponent_TransformingIndexValues(object sender, IndexingItemEventArgs e)
{
if (e.ValueSet.Category == IndexTypes.Content)
{
try
{
var value = e.ValueSet.Values.Where(x => x.Key == "path").Select(x => x.Value).FirstOrDefault();
if (value != null && value.Any())
{
var list = new List<object>();
var path = value.First().ToString().Replace(",", " ");
list.Add(path);
var searchablePath = e.ValueSet.Values.FirstOrDefault(x => x.Key == "searchablePath");
if (searchablePath.Key != null)
{
searchablePath.Value.Clear();
searchablePath.Value.Add(list);
}
else
{
e.ValueSet.Values.Add("searchablePath", list);
}
}
}
catch (Exception ex)
{
_logger.Error<Exception>("error munging fields for " + e.ValueSet.Id, ex);
}
}
}
public void Terminate() { }
}
Then it gives these results:
Next step is to figure out how to include PDF and WORD files.
public class CustomIndexCreator : LuceneIndexCreator, IUmbracoIndexesCreator
{
private readonly IProfilingLogger _profilingLogger;
private readonly IPublicAccessService _publicAccessService;
public CustomIndexCreator(IProfilingLogger profilingLogger,
IPublicAccessService publicAccessService)
{
_profilingLogger = profilingLogger;
_publicAccessService = publicAccessService;
}
public override IEnumerable<IIndex> Create()
{
var index = new UmbracoContentIndex("MediaFileIndex",
CreateFileSystemLuceneDirectory("MediaFileIndex"),
new UmbracoFieldDefinitionCollection(),
new StandardAnalyzer(Version.LUCENE_30),
_profilingLogger,
_localizationService,
new ContentValueSetValidator(true, false, _publicAccessService, includeItemTypes: new string[] { "File" }));
return new[] { index };
}
}
Then i add the index and find it to add TransformingIndexValues, where i read the text content from the files i think, like pdf and docx. You can add many more, just be awhere that it also will try to read the content from video and audio files if you dont limit it.
using System.Web.Hosting;
using Umbraco.Core.Composing;
using Umbraco.Core.Services;
using Umbraco.Examine;
using Umbraco.Core.Logging;
using TikaOnDotNet.TextExtraction;
public class IndexCreatorComponent : IComponent
{
private readonly IExamineManager _examineManager;
private readonly CustomIndexCreator _customIndexCreator;
private readonly ILogger _logger;
public IndexCreatorComponent(IExamineManager examineManager, CustomIndexCreator customIndexCreator, ILogger logger)
{
_examineManager = examineManager;
_customIndexCreator = customIndexCreator;
_logger = logger;
}
public void Initialize()
{
foreach (var index in _customIndexCreator.Create())
{
_examineManager.AddIndex(index);
}
if (_examineManager.TryGetIndex("MediaFileIndex", out IIndex customMediaIndex))
{
if (customMediaIndex is BaseIndexProvider indexProviderMedia)
{
indexProviderMedia.TransformingIndexValues += IndexProviderTransformingIndexValues;
}
}
}
private void IndexProviderTransformingIndexValues(object sender, IndexingItemEventArgs e)
{
if (e.ValueSet.Category == IndexTypes.Media)
{
var field = e.ValueSet.Values.FirstOrDefault(x => x.Key.Equals("umbracoFile"));
foreach(var value in field.Value)
{
if (value != null)
{
try
{
var fileVirtualPath = value.ToString();
if (fileVirtualPath.EndsWith(".docx") || fileVirtualPath.EndsWith(".pdf"))
{
var filePath = HostingEnvironment.MapPath(fileVirtualPath);
var textExtractor = new TextExtractor();
var textExtractionResult = textExtractor.Extract(filePath);
e.ValueSet.TryAdd("fileTextContnet", textExtractionResult.Text);
}
}
catch (Exception exception)
{
_logger.Error<IndexCreatorComponent>(exception, "Error extracting text from file");
}
}
}
}
public void Terminate() { }
}
Last thing is to add CustomIndexCreator and IndexCreatorComponent to the starup.
public class InstallIndexCreatorComposer : IUserComposer
{
public void Compose(Composition composition)
{
composition.RegisterUnique<CustomIndexCreator>();
composition.Components().Append<IndexCreatorComponent>();
}
}
// Checking IncludeFields and ExcludeFields for exact key name or culture name.
foreach (var key in valueSet.Values.Keys.ToList())
{
if (IncludeFields != null && !IncludeFields.Any(x => x.Equals(key) || key.StartsWith($"{x}_")))
{
valueSet.Values.Remove(key); //remove any value with a key that doesn't match the inclusion list
baseValidateResult = ValueSetValidationResult.Filtered;
}
if (ExcludeFields != null && ExcludeFields.Any(x => x.Equals(key) || key.StartsWith($"{x}_")))
{
valueSet.Values.Remove(key); //remove any value with a key that matches the exclusion list
baseValidateResult = ValueSetValidationResult.Filtered;
}
}
var isFiltered = baseValidateResult == ValueSetValidationResult.Filtered;
return isFiltered ? ValueSetValidationResult.Filtered : ValueSetValidationResult.Valid;
How to create a custom Examine index with specified fields to include in Umbraco 8
Hi all.
Using Umbraco 8.0.1
How can i specify which fields to include, so the index do not automatically take all fields from the defined document aliases into the index?
The Umbraco.Examine.ContentValueSetValidator always sets IncludeFields and ExcludeFields to null. And when i define my own ContentValueSetValidator, it does not care about the fields i include in the IncludeFields array. https://github.com/umbraco/Umbraco-CMS/blob/v8/dev/src/Umbraco.Examine/ContentValueSetValidator.cs
The Umbraco.Examine.UmbracoFieldDefinitionCollection seems to add the fields, but when i define my own it breaks. https://github.com/umbraco/Umbraco-CMS/blob/v8/dev/src/Umbraco.Examine/UmbracoFieldDefinitionCollection.cs
Hi Bo,
I did some quick investigation and maybe the following helps you some bit in the right direction.
When you implement your own
ContentValueSetValidator
you are able to exclude fields from being indexed in theValidate(ValueSet valueSet)
method.With
valueSet.Values.Remove(key)
you can remove values from the valueset.I think the
UmbracoFieldDefinitionCollection
maybe doesn't determine which fields are really included in the index. It is more a mapping between Umbraco fields and Examine field types. The comment above this code says:https://github.com/umbraco/Umbraco-CMS/blob/853087a75044b814df458457dc9a1f778cc89749/src/Umbraco.Examine/UmbracoFieldDefinitionCollection.cs
To be continued..
Hi Corné
I got it working by making my own ContentValueSetValidator, but i dunno if i am happy with the way to do it.
Then i use it in a custom LuceneIndexCreator.
As bonus info i added the searchablePath as and IComponent
Then it gives these results:
Next step is to figure out how to include PDF and WORD files.
Hi
I have just arrived at the requirement for search on a new build in U8.
This is all very different from U7!
I have got the regular indexes working - did you make any progress on Word/PDF indexing? I can't find anything out there presently.
Hi Jo Kendal.
No luck with the file indexing yet.
Hi
I did get it working. Sorry. I'm pretty snowed under just now but I intend to post when I can.
Hi again.
I also got the file indexing working now. I do not have the code right here right now, so i post it next week.
I used TikaOnDotNet.TextExtraction to extract the text from files, you can find it here https://github.com/KevM/tikaondotnet
First i make the Index.
Then i add the index and find it to add TransformingIndexValues, where i read the text content from the files i think, like pdf and docx. You can add many more, just be awhere that it also will try to read the content from video and audio files if you dont limit it.
Last thing is to add CustomIndexCreator and IndexCreatorComponent to the starup.
When I add this to the Validate function , 0 documents are returned but over 700 before. :
ValueSetValidationResult baseValidateResult = ValueSetValidationResult.Valid;
is working on a reply...