indexing suggestion

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jul 14, 2009 @ 12:16

1

Indexing suggestion

Martijn

You could do what i do and use pdf Ifilter although it assumes you have it installed on the server, most servers do have it, if not its a free download from adobe. I am providing a link to the parser class i use you could incorporate into your updated umbSearch see here so in my indexing code where i get the content for the pdf i have

res = getTextUsingIFilter(FullPathToFile);

             //ifilter didnt work use pdfbox
            if (res.Length == 0)
            {

                //use pdfbox or whatever

            }

        private string getTextUsingIFilter(string FullPathToFile){
            string docText="";
            try{
                docText = Seekafile.Server.IFilter.DefaultParser.Extract(FullPathToFile);
                if (docText.Length != 0)
                {
                    logMessage(FullPathToFile + " indexed using ifilter", umbraco.BusinessLogic.LogTypes.Debug);
                }
            }
            catch(Exception ex){
                logMessage(ex.Message.ToString(), umbraco.BusinessLogic.LogTypes.Error);
            }
            return docText;
        }

Regards

Ismail

Copy Link
Hans 23 posts 57 karma points

Jul 14, 2009 @ 16:59

0

Thanks, I will definately have a look at it. By the way, do you have a good MS Word parser?

Copy Link
Dirk De Grave 4541 posts 6021 karma points MVP 3x admin c-trib

Jul 14, 2009 @ 17:09

0

I think Darren suggested one on twitter... Might wanna search google for dsofile.dll (Credits go to Darren!!)

Cheers,

/Dirk

Copy Link
Hans 23 posts 57 karma points

Jul 16, 2009 @ 08:38

0

Hallo Dirk,

I've checked that one. Unfortunately it gives only metadata of a MS Word document. Like the template used and word count. It doesn't give access to the text content.

Love to hear another suggestion!

Hans

Copy Link
Tim 225 posts 690 karma points

Jul 16, 2009 @ 12:23

1

Hi,

Here's some more information on using IFilter in order to extract text form MS docs for use in Lucene:

http://stackoverflow.com/questions/465302/what-is-the-best-way-to-parse-microsoft-office-and-pdf-documents

Looks like it shouldn't be too tricky......

T

Copy Link
Hans 23 posts 57 karma points

Jul 17, 2009 @ 10:35

2

Hello everybody,

This morning I uploaded UmbSearch2 v1.1 to codeplex. Thanks to some very helpfull input from Ismail Mayat and Tim Saunders I decided to use IFilters instead of PDFBox. The UmbSearch2 package decreased in size immediately from 8MB to 57kb! The only catch is that you need to have the right Ifilters installed on your webserver, but most servers do already have these.

Download: umbsearch2.codeplex.com

New features:

- document nodes will be added to the search index on publishing and removed from the index when you unpublish a page.

- you can hide document nodes from indexing by using the property-alias 'umbracoSearchHide'

- You can easily determine the document properties that should be searched.

- It's also possible to determine the parent document from where UmbSearch2 has to start searching.

Well, enjoy it!

Copy Link
Søren Tidmand 129 posts 366 karma points

Jul 19, 2009 @ 00:39

0

Hi Hans,

I am desperately trying to include in the search indexing all pdf and word documents that are being uploaded via the data type 'Upload' on a document type. These uploaded files are being stored in the /media folder but are not accessible through (nor listed in) the Media section. Therefore they are not getting indexed.

At CodeGarden '09 both Tim Geyssens and Ismail Mayat gave me some very good pointers on how to include the uploaded files from the /media folder, but I must admit that I haven't succeeded in writing the needed code. So what would be more convenient than asking you whether this feature is already a part of the new UmbSearch2 package or - if not - you would concider including it in the road map? ...

Basically I'm wondering if the inclusion of an 'Upload' property in the search automatically catches any uploaded pdf and/or word documents?

I'll appreciate any help in this matter.

Thanks!

Copy Link
Hans 23 posts 57 karma points

Jul 20, 2009 @ 09:11

0

@Soren

This will unfortunately not work. UmbSearch2 only indexes files that are uploaded through the CMS. When you upload a file through the 'Media Section', this file will be saved as a media item. When you use the UploadField control on a document type, the uploaded file will NOT be saved as a Media item. That explains why it's not working in your situation.

A solution would be to download the source code and to add some funtionality to index files in the media folder that are not media items. Just add that code to the UmbSearch2.Search.Indexer.ReIndex() method.

If it's working, please let me know. Then I will include it in the next release!

Hans

Copy Link
Lars Buur 28 posts 20 karma points

Jul 23, 2009 @ 09:39

0

Hi Hans,

Great work! I have a small piece of feedback that made my life a bit difficult when I tried to play with your lucene.net package.

If the index is somehow invalid then a reindex is started using a thread. Unfortunately the new thread does not have httpcontext etc. so the HttpContext.Current.Server.MapPath fails. The MapPath is used to find the directory where the index is stored so the reindex will fail - but it continues to keep rebuilding for a while.

Also a static variable could be used to make sure that the reindexer is only running in one thread.

Kind regards
Lars Buur
Chainbox ApS

Copy Link
Hans 23 posts 57 karma points

Jul 23, 2009 @ 17:08

0

@Lars,

Yep, you are right. I will change this as soon as possible. Probably tomorrow I will release v1.2. That one will also work with multi-lingual websites. So you have something else to look forward to :-)

Hans

Copy Link
Hans 23 posts 57 karma points

Jul 24, 2009 @ 13:47

0

@Lars,

@If the index is somehow invalid then a reindex is started using a thread. Unfortunately the new thread does not have httpcontext etc. so the HttpContext.Current.Server.MapPath fails. The MapPath is used to find the directory where the index is stored so the reindex will fail - but it continues to keep rebuilding for a while.

I fixed the problem that the HttpContext wasn't known. Unfortunately, the index will not be build properly because for some strange reason it's not possible to get the properties of a Document anymore. I really don't get what is happening here. When I Reindex the page in a normal way (I use the page reindexsite.aspx) everything goes as normal. Properties can be read out and added to the index. When reindexing starts as a result of a broken index, and this reindexing action is done on a different thread, the properties collection is [0].

Because of this I decided not to do reindexing of a different thread anymore. The user who is so unfortunate to perform a search action when the index is broken, will have to wait a little while before he gets his search results. I don't expect this to happen very often.

Hans

Copy Link
Vipul Patel 18 posts 37 karma points

Nov 03, 2009 @ 18:09

0

Hi Ismail, do you have a package of umbSearch that works for umbraco 4 ?

if yes can you send it to me in my email [email protected] ?

Thanks

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

Indexing suggestion