Hey guys! The last days ive been figuring out how i should use examine and lucene. And now im on to do some programming!
But i have a issue.
i need to add a field to the index, this i do by adding the name onto e.Fields.Add("Field", value);
But my problem is that i need to get some information for my IndexingNodeDataEventArgs e parameter, and it seems like e contains every document each time, so can't figure out how i should be able to get some data from a field on the document to get the information i want to put into the field ?
how do i differensiate to get the data i need when e contains the same each time a item gets indexed ? shouldn't i be specefic to the document so i could get information for e and set to be indexed as a field ?
Hope you guys know what i want to do, otherwise please ask :)
The information i needed to fetch to be indexed is containing HTML, so atm the html is searchable, which isn't that great.
So my question to you is, can i with Lucene/Examine, go around the html and only get the data for indexing? or is there somehow a workaround. The data is generated from a extern service which we dont have any control over, sadly.
The code above is for another issue which is now fixed? However you now have another issue where you are indexing custom data (not umbraco data?) or it is umbraco data but the content is entered into an umbraco document from a third party source? I just need context of the problem then hopefully should be able to fix.
Ok I just re read your code and now understand the problem. Can you take a look at your ExamineSettings.config file and for the indexer what is the analyzer set to? I still think it should strip the html. However what you could try is
The code is working atm. But the Borger field is getting alle the html with it, so a when i search for div, then i get that document, based in there is a div in the content for instance :)- which i would like to not happen. the id is stored in Umbraco, then i fetch the id, and use my repository to get the content from the external service. So i got no control of how the html is looking.
The thing i want to achieve is to get rid of the HTML in the indexing moment, if that is posible, or atleast that the attributes and tags from html isn't searchable.
Try umbraco.library.StringHtml that should get rid its part of the umbraco core if that dont work then you could load the html into an HtmlAgility pack doc and use that to strip the html. I think in later umbraco versions you get htmlagility pack if not you can download the library from http://htmlagilitypack.codeplex.com/
Right when umbraco content is being indexed then within umbraco or examine it strips out html i suspect its umbraco doing it. In your case its external data so umbraco has no knowledge of it therefore html is still there so try umbraco.library.stringhtml failing that use agilitypack.
And btw, a great thanks for being so active in blogging and on the forums about this topic, you keep showing up in every serach i make, more or less..
One last question: ive seen alot of places that the lucene in action book should be the best place to learn about lucene, can you confern this still ? Or are there better reading elsewhere now when im using lucene.net and examine.
examine using IndexingNodeDataEventArgs
Hey guys!
The last days ive been figuring out how i should use examine and lucene. And now im on to do some programming!
But i have a issue.
i need to add a field to the index, this i do by adding the name onto e.Fields.Add("Field", value);
But my problem is that i need to get some information for my IndexingNodeDataEventArgs e parameter, and it seems like e contains every document each time, so can't figure out how i should be able to get some data from a field on the document to get the information i want to put into the field ?
how do i differensiate to get the data i need when e contains the same each time a item gets indexed ? shouldn't i be specefic to the document so i could get information for e and set to be indexed as a field ?
Hope you guys know what i want to do, otherwise please ask :)
Thanks in advance.
- Niclas Schumacher
Niclas,
You have e.Node where Node is XElement you can get data from there or you can do e.NodeId and then new up a document
Regards
Ismail
Thanks Ismail, that did the job..
I got another problem too, sadly.
The information i needed to fetch to be indexed is containing HTML, so atm the html is searchable, which isn't that great.
So my question to you is, can i with Lucene/Examine, go around the html and only get the data for indexing? or is there somehow a workaround. The data is generated from a extern service which we dont have any control over, sadly.
my code:
var document = new Document(e.NodeId);
var property = document.getProperty("borger");
if(property != null)
{
if(property.Value != "")
{
XElement xmlProperty = XElement.Parse(property.Value.ToString());
var currentNode = _repository.GetArticleById(Convert.ToInt32(xmlProperty.Attribute("ArticleId").Value));
var content = _repository.SplitArticleContent(currentNode.Content);
e.Fields.Add("Borger", content.MainContent);
}
}
Niclas,
The code above is for another issue which is now fixed? However you now have another issue where you are indexing custom data (not umbraco data?) or it is umbraco data but the content is entered into an umbraco document from a third party source? I just need context of the problem then hopefully should be able to fix.
During indexing examine should strip out html.
Regards
Ismail
Niclas,
Ok I just re read your code and now understand the problem. Can you take a look at your ExamineSettings.config file and for the indexer what is the analyzer set to? I still think it should strip the html. However what you could try is
e.Fields.Add("Borger", umbraco.library.StripHtml( content.MainContent));
that should also strip it out just giving you content without the html.
Regards
Ismial
Hallo Ismail,
The code is working atm. But the Borger field is getting alle the html with it, so a when i search for div, then i get that document, based in there is a div in the content for instance :)- which i would like to not happen.
the id is stored in Umbraco, then i fetch the id, and use my repository to get the content from the external service. So i got no control of how the html is looking.
The thing i want to achieve is to get rid of the HTML in the indexing moment, if that is posible, or atleast that the attributes and tags from html isn't searchable.
Thanks for the help !
Niclas,
Try umbraco.library.StringHtml that should get rid its part of the umbraco core if that dont work then you could load the html into an HtmlAgility pack doc and use that to strip the html. I think in later umbraco versions you get htmlagility pack if not you can download the library from http://htmlagilitypack.codeplex.com/
Ill check it out if we can't get examine to do it.
Im just using the standard analyzer with my indexer,
Niclas,
Right when umbraco content is being indexed then within umbraco or examine it strips out html i suspect its umbraco doing it. In your case its external data so umbraco has no knowledge of it therefore html is still there so try umbraco.library.stringhtml failing that use agilitypack.
Regards
Ismail
Okay, ill give it a try.
And btw, a great thanks for being so active in blogging and on the forums about this topic, you keep showing up in every serach i make, more or less..
One last question: ive seen alot of places that the lucene in action book should be the best place to learn about lucene, can you confern this still ? Or are there better reading elsewhere now when im using lucene.net and examine.
Once agian, thanks for the help!
Niclas,
Highly recommend lucene in action 2nd edition just to get deeper understanding of Lucene. There is my session from uk festival http://umbracoukfestival.co.uk/videos-photos/ video and slides also i did 12/13 posts going into a bit more detail on different indexing topics with examine see http://thecogworks.co.uk/blog/posts/2012/november/examiness-hints-and-tips-from-the-trenches-part-1/
Regards
Ismail
Okay, that was also my thought.
I've been through all of those, great stuff! :)
is working on a reply...