how to create a custom index with specified fields to include in umbraco 8

Bo Jacobsen 610 posts 2409 karma points

Apr 11, 2019 @ 09:28

How to create a custom Examine index with specified fields to include in Umbraco 8

Hi all.

Using Umbraco 8.0.1

How can i specify which fields to include, so the index do not automatically take all fields from the defined document aliases into the index?

The Umbraco.Examine.ContentValueSetValidator always sets IncludeFields and ExcludeFields to null. And when i define my own ContentValueSetValidator, it does not care about the fields i include in the IncludeFields array. https://github.com/umbraco/Umbraco-CMS/blob/v8/dev/src/Umbraco.Examine/ContentValueSetValidator.cs

The Umbraco.Examine.UmbracoFieldDefinitionCollection seems to add the fields, but when i define my own it breaks. https://github.com/umbraco/Umbraco-CMS/blob/v8/dev/src/Umbraco.Examine/UmbracoFieldDefinitionCollection.cs

public class ContentSearchIndexCreator : LuceneIndexCreator, IUmbracoIndexesCreator
{
    private readonly IProfilingLogger _profilingLogger;
    private readonly ILocalizationService _languageService;

    public ContentSearchIndexCreator(IProfilingLogger profilingLogger, ILocalizationService languageService)
    {
        _profilingLogger = profilingLogger;
        _languageService = languageService;
    }

    public override IEnumerable<IIndex> Create()
    {
        return new[]
        {
                CreateContentIndex(
                    "ContentSearchIndex",
                    "ContentSearch",
                    new UmbracoFieldDefinitionCollection(),
                    new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30),
                    new ContentValueSetValidator(true, true, null, null, new string[] { "TextPage", "NumberPage" }, null)
                 )
            };
    }

    private IIndex CreateContentIndex(
        string name,
        string folderName,
        FieldDefinitionCollection fieldDefinitionCollection,
        Lucene.Net.Analysis.Analyzer luceneAnalyzer,
        IContentValueSetValidator contentValueSetValidator)
    {
        var index = new UmbracoContentIndex(
        name,
        CreateFileSystemLuceneDirectory(folderName),
        fieldDefinitionCollection,
        luceneAnalyzer,
        _profilingLogger,
        _languageService,
        contentValueSetValidator);

        return index;
    }
}

Copy Link

Corné Strijkert 80 posts 456 karma points c-trib

Apr 11, 2019 @ 11:42

Hi Bo,

I did some quick investigation and maybe the following helps you some bit in the right direction.

When you implement your own ContentValueSetValidator you are able to exclude fields from being indexed in the Validate(ValueSet valueSet) method.

With valueSet.Values.Remove(key) you can remove values from the valueset.

I think the UmbracoFieldDefinitionCollection maybe doesn't determine which fields are really included in the index. It is more a mapping between Umbraco fields and Examine field types. The comment above this code says:

A type that defines the type of index for each Umbraco field (non user defined fields)

https://github.com/umbraco/Umbraco-CMS/blob/853087a75044b814df458457dc9a1f778cc89749/src/Umbraco.Examine/UmbracoFieldDefinitionCollection.cs

To be continued..

Copy Link

Bo Jacobsen 610 posts 2409 karma points

Apr 12, 2019 @ 08:35

Hi Corné

I got it working by making my own ContentValueSetValidator, but i dunno if i am happy with the way to do it.

public class ContentValueSetValidator : ValueSetValidator, IContentValueSetValidator
{
    private readonly IPublicAccessService _publicAccessService;

    private const string PathKey = "path";
    private static readonly IEnumerable<string> ValidCategories = new[] { IndexTypes.Content, IndexTypes.Media };
    protected override IEnumerable<string> ValidIndexCategories => ValidCategories;

    public bool PublishedValuesOnly { get; }
    public bool SupportProtectedContent { get; }
    public int? ParentId { get; }


    public ContentValueSetValidator(bool publishedValuesOnly, int? parentId = null, IEnumerable<string> includeItemTypes = null, IEnumerable<string> excludeItemTypes = null, IEnumerable<string> includeFields = null, IEnumerable<string> excludeFields = null)
        : this(publishedValuesOnly, true, null, parentId, includeItemTypes, excludeItemTypes, includeFields, excludeFields)
    {
    }

    public ContentValueSetValidator(bool publishedValuesOnly, bool supportProtectedContent, IPublicAccessService publicAccessService, int? parentId = null, IEnumerable<string> includeItemTypes = null, IEnumerable<string> excludeItemTypes = null, IEnumerable<string> includeFields = null, IEnumerable<string> excludeFields = null)
        : base(includeItemTypes, excludeItemTypes, includeFields, excludeFields)
    {
        PublishedValuesOnly = publishedValuesOnly;
        SupportProtectedContent = supportProtectedContent;
        ParentId = parentId;
        _publicAccessService = publicAccessService;
    }


    public bool ValidatePath(string path, string category)
    {
        //check if this document is a descendent of the parent
        if (ParentId.HasValue && ParentId.Value > 0)
        {
            // we cannot return FAILED here because we need the value set to get into the indexer and then deal with it from there
            // because we need to remove anything that doesn't pass by parent Id in the cases that umbraco data is moved to an illegal parent.
            if (!path.Contains(string.Concat(",", ParentId.Value, ",")))
                return false;
        }

        return true;
    }

    public bool ValidateRecycleBin(string path, string category)
    {
        var recycleBinId = category == IndexTypes.Content ? Constants.System.RecycleBinContent : Constants.System.RecycleBinMedia;

        //check for recycle bin
        if (PublishedValuesOnly)
        {
            if (path.Contains(string.Concat(",", recycleBinId, ",")))
                return false;
        }
        return true;
    }

    public bool ValidateProtectedContent(string path, string category)
    {
        if (category == IndexTypes.Content
            && !SupportProtectedContent
            // if the service is null we can't look this up so we'll return false
            && (_publicAccessService == null || _publicAccessService.IsProtected(path)))
        {
            return false;
        }

        return true;
    }

    public override ValueSetValidationResult Validate(ValueSet valueSet)
    {
        // Removed base.Validate(valueSet) in order to manipulate the valueSet.Values the way we want to.

        if (ValidIndexCategories != null && !ValidIndexCategories.InvariantContains(valueSet.Category))
        {
            return ValueSetValidationResult.Failed;
        }

        // check if this document is of a correct type of node type alias
        if (IncludeItemTypes != null && !IncludeItemTypes.InvariantContains(valueSet.ItemType))
        {
            return ValueSetValidationResult.Failed;
        }

        // if this node type is part of our exclusion list
        if (ExcludeItemTypes != null && ExcludeItemTypes.InvariantContains(valueSet.ItemType))
        {
            return ValueSetValidationResult.Failed;
        }

        ValueSetValidationResult baseValidateResult = ValueSetValidationResult.Valid;

        // Checking IncludeFields and ExcludeFields for exact key name or culture name.
        foreach (var key in valueSet.Values.Keys.ToList())
        {
            if (IncludeFields != null && !IncludeFields.Any(x => x.Equals(key) || key.StartsWith($"{x}_")))
            {
                valueSet.Values.Remove(key); //remove any value with a key that doesn't match the inclusion list
                baseValidateResult = ValueSetValidationResult.Filtered;
            }

            if (ExcludeFields != null && ExcludeFields.Any(x => x.Equals(key) || key.StartsWith($"{x}_")))
            {
                valueSet.Values.Remove(key); //remove any value with a key that matches the exclusion list
                baseValidateResult = ValueSetValidationResult.Filtered;
            }
        }

        var isFiltered = baseValidateResult == ValueSetValidationResult.Filtered;

        //check for published content
        if (valueSet.Category == IndexTypes.Content && PublishedValuesOnly)
        {
            if (!valueSet.Values.TryGetValue(UmbracoExamineIndex.PublishedFieldName, out var published))
                return ValueSetValidationResult.Failed;

            if (!published[0].Equals("y"))
                return ValueSetValidationResult.Failed;


            //deal with variants, if there are unpublished variants than we need to remove them from the value set
            if (valueSet.Values.TryGetValue(UmbracoContentIndex.VariesByCultureFieldName, out var variesByCulture)
                && variesByCulture.Count > 0 && variesByCulture[0].Equals("y"))
            {
                //so this valueset is for a content that varies by culture, now check for non-published cultures and remove those values
                foreach (var publishField in valueSet.Values.Where(x => x.Key.StartsWith($"{UmbracoExamineIndex.PublishedFieldName}_")).ToList())
                {
                    if (publishField.Value.Count <= 0 || !publishField.Value[0].Equals("y"))
                    {
                        //this culture is not published, so remove all of these culture values
                        var cultureSuffix = publishField.Key.Substring(publishField.Key.LastIndexOf('_'));
                        foreach (var cultureField in valueSet.Values.Where(x => x.Key.InvariantEndsWith(cultureSuffix)).ToList())
                        {
                            valueSet.Values.Remove(cultureField.Key);
                            isFiltered = true;
                        }
                    }
                }
            }
        }

        //must have a 'path'
        if (!valueSet.Values.TryGetValue(PathKey, out var pathValues)) return ValueSetValidationResult.Failed;
        if (pathValues.Count == 0) return ValueSetValidationResult.Failed;
        if (pathValues[0] == null) return ValueSetValidationResult.Failed;
        if (pathValues[0].ToString().IsNullOrWhiteSpace()) return ValueSetValidationResult.Failed;
        var path = pathValues[0].ToString();

        // We need to validate the path of the content based on ParentId, protected content and recycle bin rules.
        // We cannot return FAILED here because we need the value set to get into the indexer and then deal with it from there
        // because we need to remove anything that doesn't pass by protected content in the cases that umbraco data is moved to an illegal parent.
        if (!ValidatePath(path, valueSet.Category)
            || !ValidateRecycleBin(path, valueSet.Category)
            || !ValidateProtectedContent(path, valueSet.Category))
            return ValueSetValidationResult.Filtered;

        return isFiltered ? ValueSetValidationResult.Filtered : ValueSetValidationResult.Valid;
    }
}

Then i use it in a custom LuceneIndexCreator.

public class ContentSearchIndexCreator : LuceneIndexCreator, IUmbracoIndexesCreator
{
    private readonly IProfilingLogger _profilingLogger;
    private readonly ILocalizationService _languageService;

    public ContentSearchIndexCreator(IProfilingLogger profilingLogger, ILocalizationService languageService)
    {
        _profilingLogger = profilingLogger;
        _languageService = languageService;
    }

    public override IEnumerable<IIndex> Create()
    {
        return new[]
        {
                CreateContentIndex(
                    "ContentSearchIndex",
                    "ContentSearch",
                    new UmbracoFieldDefinitionCollection(),
                    new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30),
                    new ValueSetValidators.ContentValueSetValidator(true, true, null, null, new string[] { "TextPage", "RedirectNode" }, null, new string[] { "__IndexType", "__Published", "__Key", "__Path", "__VariesByCulture", "__NodeId", "id", "path", "nodeName", "pageGrid", "searchTags", "searchablePath" }, null)
                 )
            };
    }

    private IIndex CreateContentIndex(
        string name,
        string folderName,
        FieldDefinitionCollection fieldDefinitionCollection,
        Lucene.Net.Analysis.Analyzer luceneAnalyzer,
        IContentValueSetValidator contentValueSetValidator)
    {
        var index = new UmbracoContentIndex(
        name,
        CreateFileSystemLuceneDirectory(folderName),
        fieldDefinitionCollection,
        luceneAnalyzer,
        _profilingLogger,
        _languageService,
        contentValueSetValidator);

        return index;
    }
}

As bonus info i added the searchablePath as and IComponent

public class ExamineLuceneComponent : IComponent
{
    private readonly IExamineManager _examineManager;
    private readonly ILogger _logger;

    public ExamineLuceneComponent(IExamineManager examineManager, ILogger logger)
    {
        _logger = logger;
        _examineManager = examineManager;
    }

    public void Initialize()
    {
        var externalIndex = _examineManager.Indexes.FirstOrDefault(x => x.Name == "ContentSearchIndex");
        if (externalIndex != null)
        {
            ((BaseIndexProvider)externalIndex).TransformingIndexValues += ExamineLuceneComponent_TransformingIndexValues;
        }
    }

    private void ExamineLuceneComponent_TransformingIndexValues(object sender, IndexingItemEventArgs e)
    {
        if (e.ValueSet.Category == IndexTypes.Content)
        {
            try
            {
                var value = e.ValueSet.Values.Where(x => x.Key == "path").Select(x => x.Value).FirstOrDefault();
                if (value != null && value.Any())
                {
                    var list = new List<object>();
                    var path = value.First().ToString().Replace(",", " ");
                    list.Add(path);

                    var searchablePath = e.ValueSet.Values.FirstOrDefault(x => x.Key == "searchablePath");
                    if (searchablePath.Key != null)
                    {
                        searchablePath.Value.Clear();
                        searchablePath.Value.Add(list);
                    }
                    else
                    {
                        e.ValueSet.Values.Add("searchablePath", list);
                    }
                }
            }
            catch (Exception ex)
            {
                _logger.Error<Exception>("error munging fields for " + e.ValueSet.Id, ex);
            }
        }
    }

    public void Terminate() { }
}

Then it gives these results:

enter image description here

Next step is to figure out how to include PDF and WORD files.

Copy Link

Jo Kendal 32 posts 194 karma points

May 23, 2019 @ 12:53

Hi

I have just arrived at the requirement for search on a new build in U8.

This is all very different from U7!

I have got the regular indexes working - did you make any progress on Word/PDF indexing? I can't find anything out there presently.

Copy Link

Bo Jacobsen 610 posts 2409 karma points

Jun 13, 2019 @ 12:36

Hi Jo Kendal.

No luck with the file indexing yet.

Copy Link

Jo Kendal 32 posts 194 karma points

Jun 13, 2019 @ 13:40

Hi

I did get it working. Sorry. I'm pretty snowed under just now but I intend to post when I can.

Copy Link

Bo Jacobsen 610 posts 2409 karma points

Jul 19, 2019 @ 13:48

Hi again.

I also got the file indexing working now. I do not have the code right here right now, so i post it next week.

Copy Link

Bo Jacobsen 610 posts 2409 karma points

Jul 22, 2019 @ 10:38

I used TikaOnDotNet.TextExtraction to extract the text from files, you can find it here https://github.com/KevM/tikaondotnet

First i make the Index.

public class CustomIndexCreator : LuceneIndexCreator, IUmbracoIndexesCreator
{
    private readonly IProfilingLogger _profilingLogger;
    private readonly IPublicAccessService _publicAccessService;

    public CustomIndexCreator(IProfilingLogger profilingLogger,
    IPublicAccessService publicAccessService)
    {
        _profilingLogger = profilingLogger;
        _publicAccessService = publicAccessService;
    }

    public override IEnumerable<IIndex> Create()
    {
        var index = new UmbracoContentIndex("MediaFileIndex",
            CreateFileSystemLuceneDirectory("MediaFileIndex"),
            new UmbracoFieldDefinitionCollection(),
            new StandardAnalyzer(Version.LUCENE_30),
            _profilingLogger,
            _localizationService,
            new ContentValueSetValidator(true, false, _publicAccessService, includeItemTypes: new string[] { "File" }));

        return new[] { index };
    }
}

Then i add the index and find it to add TransformingIndexValues, where i read the text content from the files i think, like pdf and docx. You can add many more, just be awhere that it also will try to read the content from video and audio files if you dont limit it.

using System.Web.Hosting;
using Umbraco.Core.Composing;
using Umbraco.Core.Services;
using Umbraco.Examine;
using Umbraco.Core.Logging;
using TikaOnDotNet.TextExtraction;

public class IndexCreatorComponent : IComponent
{
    private readonly IExamineManager _examineManager;
    private readonly CustomIndexCreator _customIndexCreator;
    private readonly ILogger _logger;

    public IndexCreatorComponent(IExamineManager examineManager, CustomIndexCreator customIndexCreator, ILogger logger)
    {
        _examineManager = examineManager;
        _customIndexCreator = customIndexCreator;
        _logger = logger;
    }

    public void Initialize()
    {
        foreach (var index in _customIndexCreator.Create())
        {
            _examineManager.AddIndex(index);
        }

        if (_examineManager.TryGetIndex("MediaFileIndex", out IIndex customMediaIndex))
        {
            if (customMediaIndex is BaseIndexProvider indexProviderMedia)
            {
                indexProviderMedia.TransformingIndexValues += IndexProviderTransformingIndexValues;
            }
        }
    }

    private void IndexProviderTransformingIndexValues(object sender, IndexingItemEventArgs e)
    {
        if (e.ValueSet.Category == IndexTypes.Media)
        {
            var field = e.ValueSet.Values.FirstOrDefault(x => x.Key.Equals("umbracoFile"));
            foreach(var value in field.Value)
            {
                if (value != null)
                {
                    try
                    {
                        var fileVirtualPath = value.ToString();

                        if (fileVirtualPath.EndsWith(".docx") || fileVirtualPath.EndsWith(".pdf"))
                        {
                            var filePath = HostingEnvironment.MapPath(fileVirtualPath);

                            var textExtractor = new TextExtractor();
                            var textExtractionResult = textExtractor.Extract(filePath);

                            e.ValueSet.TryAdd("fileTextContnet", textExtractionResult.Text);
                        }
                    }
                    catch (Exception exception)
                    {
                        _logger.Error<IndexCreatorComponent>(exception, "Error extracting text from file");
                    }
                }
        }
    }

    public void Terminate() { }
}

Last thing is to add CustomIndexCreator and IndexCreatorComponent to the starup.

public class InstallIndexCreatorComposer : IUserComposer
{
    public void Compose(Composition composition)
    {
        composition.RegisterUnique<CustomIndexCreator>();
        composition.Components().Append<IndexCreatorComponent>();
    }
}

Copy Link

michael farrell 2 posts 72 karma points

Jul 27, 2023 @ 12:50

When I add this to the Validate function , 0 documents are returned but over 700 before. :

ValueSetValidationResult baseValidateResult = ValueSetValidationResult.Valid;

        // Checking IncludeFields and ExcludeFields for exact key name or culture name.
        foreach (var key in valueSet.Values.Keys.ToList())
        {
            if (IncludeFields != null && !IncludeFields.Any(x => x.Equals(key) || key.StartsWith($"{x}_")))
            {
                valueSet.Values.Remove(key); //remove any value with a key that doesn't match the inclusion list
                baseValidateResult = ValueSetValidationResult.Filtered;
            }

            if (ExcludeFields != null && ExcludeFields.Any(x => x.Equals(key) || key.StartsWith($"{x}_")))
            {
                valueSet.Values.Remove(key); //remove any value with a key that matches the exclusion list
                baseValidateResult = ValueSetValidationResult.Filtered;
            }
        }

        var isFiltered = baseValidateResult == ValueSetValidationResult.Filtered;
        return isFiltered ? ValueSetValidationResult.Filtered : ValueSetValidationResult.Valid;

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

How to create a custom Examine index with specified fields to include in Umbraco 8