You could do what i do and use pdf Ifilter although it assumes you have it installed on the server, most servers do have it, if not its a free download from adobe. I am providing a link to the parser class i use you could incorporate into your updated umbSearch see here so in my indexing code where i get the content for the pdf i have
res = getTextUsingIFilter(FullPathToFile);
//ifilter didnt work use pdfbox if (res.Length == 0) {
I've checked that one. Unfortunately it gives only metadata of a MS Word document. Like the template used and word count. It doesn't give access to the text content.
This morning I uploaded UmbSearch2 v1.1 to codeplex. Thanks to some very helpfull input from Ismail Mayat and Tim Saunders I decided to use IFilters instead of PDFBox. The UmbSearch2 package decreased in size immediately from 8MB to 57kb! The only catch is that you need to have the right Ifilters installed on your webserver, but most servers do already have these.
I am desperately trying to include in the search indexing all pdf and word documents that are being uploaded via the data type 'Upload' on a document type. These uploaded files are being stored in the /media folder but are not accessible through (nor listed in) the Media section. Therefore they are not getting indexed.
At CodeGarden '09 both Tim Geyssens and Ismail Mayat gave me some very good pointers on how to include the uploaded files from the /media folder, but I must admit that I haven't succeeded in writing the needed code. So what would be more convenient than asking you whether this feature is already a part of the new UmbSearch2 package or - if not - you would concider including it in the road map? ...
Basically I'm wondering if the inclusion of an 'Upload' property in the search automatically catches any uploaded pdf and/or word documents?
This will unfortunately not work. UmbSearch2 only indexes files that are uploaded through the CMS. When you upload a file through the 'Media Section', this file will be saved as a media item. When you use the UploadField control on a document type, the uploaded file will NOT be saved as a Media item. That explains why it's not working in your situation.
A solution would be to download the source code and to add some funtionality to index files in the media folder that are not media items. Just add that code to the UmbSearch2.Search.Indexer.ReIndex() method.
If it's working, please let me know. Then I will include it in the next release!
Great work! I have a small piece of feedback that made my life a bit difficult when I tried to play with your lucene.net package.
If the index is somehow invalid then a reindex is started using a thread. Unfortunately the new thread does not have httpcontext etc. so the HttpContext.Current.Server.MapPath fails. The MapPath is used to find the directory where the index is stored so the reindex will fail - but it continues to keep rebuilding for a while.
Also a static variable could be used to make sure that the reindexer is only running in one thread.
Yep, you are right. I will change this as soon as possible. Probably tomorrow I will release v1.2. That one will also work with multi-lingual websites. So you have something else to look forward to :-)
@If the index is somehow invalid then a reindex is started using a thread. Unfortunately the new thread does not have httpcontext etc. so the HttpContext.Current.Server.MapPath fails. The MapPath is used to find the directory where the index is stored so the reindex will fail - but it continues to keep rebuilding for a while.
I fixed the problem that the HttpContext wasn't known. Unfortunately, the index will not be build properly because for some strange reason it's not possible to get the properties of a Document anymore. I really don't get what is happening here. When I Reindex the page in a normal way (I use the page reindexsite.aspx) everything goes as normal. Properties can be read out and added to the index. When reindexing starts as a result of a broken index, and this reindexing action is done on a different thread, the properties collection is [0].
Because of this I decided not to do reindexing of a different thread anymore. The user who is so unfortunate to perform a search action when the index is broken, will have to wait a little while before he gets his search results. I don't expect this to happen very often.
Indexing suggestion
Martijn
You could do what i do and use pdf Ifilter although it assumes you have it installed on the server, most servers do have it, if not its a free download from adobe. I am providing a link to the parser class i use you could incorporate into your updated umbSearch see here so in my indexing code where i get the content for the pdf i have
res = getTextUsingIFilter(FullPathToFile);
//ifilter didnt work use pdfbox
if (res.Length == 0)
{
//use pdfbox or whatever
}
private string getTextUsingIFilter(string FullPathToFile){
string docText="";
try{
docText = Seekafile.Server.IFilter.DefaultParser.Extract(FullPathToFile);
if (docText.Length != 0)
{
logMessage(FullPathToFile + " indexed using ifilter", umbraco.BusinessLogic.LogTypes.Debug);
}
}
catch(Exception ex){
logMessage(ex.Message.ToString(), umbraco.BusinessLogic.LogTypes.Error);
}
return docText;
}
Regards
Ismail
Thanks, I will definately have a look at it. By the way, do you have a good MS Word parser?
I think Darren suggested one on twitter... Might wanna search google for dsofile.dll (Credits go to Darren!!)
Cheers,
/Dirk
Hallo Dirk,
I've checked that one. Unfortunately it gives only metadata of a MS Word document. Like the template used and word count. It doesn't give access to the text content.
Love to hear another suggestion!
Hans
Hi,
Here's some more information on using IFilter in order to extract text form MS docs for use in Lucene:
http://stackoverflow.com/questions/465302/what-is-the-best-way-to-parse-microsoft-office-and-pdf-documents
Looks like it shouldn't be too tricky......
T
Hello everybody,
This morning I uploaded UmbSearch2 v1.1 to codeplex. Thanks to some very helpfull input from Ismail Mayat and Tim Saunders I decided to use IFilters instead of PDFBox. The UmbSearch2 package decreased in size immediately from 8MB to 57kb! The only catch is that you need to have the right Ifilters installed on your webserver, but most servers do already have these.
Download: umbsearch2.codeplex.com
New features:
- document nodes will be added to the search index on publishing and removed from the index when you unpublish a page.
- you can hide document nodes from indexing by using the property-alias 'umbracoSearchHide'
- You can easily determine the document properties that should be searched.
- It's also possible to determine the parent document from where UmbSearch2 has to start searching.
Well, enjoy it!
Hi Hans,
I am desperately trying to include in the search indexing all pdf and word documents that are being uploaded via the data type 'Upload' on a document type. These uploaded files are being stored in the /media folder but are not accessible through (nor listed in) the Media section. Therefore they are not getting indexed.
At CodeGarden '09 both Tim Geyssens and Ismail Mayat gave me some very good pointers on how to include the uploaded files from the /media folder, but I must admit that I haven't succeeded in writing the needed code. So what would be more convenient than asking you whether this feature is already a part of the new UmbSearch2 package or - if not - you would concider including it in the road map? ...
Basically I'm wondering if the inclusion of an 'Upload' property in the search automatically catches any uploaded pdf and/or word documents?
I'll appreciate any help in this matter.
Thanks!
@Soren
This will unfortunately not work. UmbSearch2 only indexes files that are uploaded through the CMS. When you upload a file through the 'Media Section', this file will be saved as a media item. When you use the UploadField control on a document type, the uploaded file will NOT be saved as a Media item. That explains why it's not working in your situation.
A solution would be to download the source code and to add some funtionality to index files in the media folder that are not media items. Just add that code to the UmbSearch2.Search.Indexer.ReIndex() method.
If it's working, please let me know. Then I will include it in the next release!
Hans
Hi Hans,
Great work! I have a small piece of feedback that made my life a bit difficult when I tried to play with your lucene.net package.
If the index is somehow invalid then a reindex is started using a thread. Unfortunately the new thread does not have httpcontext etc. so the HttpContext.Current.Server.MapPath fails. The MapPath is used to find the directory where the index is stored so the reindex will fail - but it continues to keep rebuilding for a while.
Also a static variable could be used to make sure that the reindexer is only running in one thread.
Kind regards
Lars Buur
Chainbox ApS
@Lars,
Yep, you are right. I will change this as soon as possible. Probably tomorrow I will release v1.2. That one will also work with multi-lingual websites. So you have something else to look forward to :-)
Hans
@Lars,
@If the index is somehow invalid then a reindex is started using a thread. Unfortunately the new thread does not have httpcontext etc. so the HttpContext.Current.Server.MapPath fails. The MapPath is used to find the directory where the index is stored so the reindex will fail - but it continues to keep rebuilding for a while.
I fixed the problem that the HttpContext wasn't known. Unfortunately, the index will not be build properly because for some strange reason it's not possible to get the properties of a Document anymore. I really don't get what is happening here. When I Reindex the page in a normal way (I use the page reindexsite.aspx) everything goes as normal. Properties can be read out and added to the index. When reindexing starts as a result of a broken index, and this reindexing action is done on a different thread, the properties collection is [0].
Because of this I decided not to do reindexing of a different thread anymore. The user who is so unfortunate to perform a search action when the index is broken, will have to wait a little while before he gets his search results. I don't expect this to happen very often.
Hans
Hi Ismail, do you have a package of umbSearch that works for umbraco 4 ?
if yes can you send it to me in my email [email protected] ?
Thanks
is working on a reply...