Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Jul 14, 2009 @ 12:16
    Ismail Mayat
    1

    Indexing suggestion

    Martijn

    You could do what  i do and use pdf Ifilter although it assumes you have it installed on the server, most servers do have it, if not its a free download from adobe.  I am providing a link to the parser class i use you could incorporate into your updated umbSearch see here  so in my indexing code where i get the content for the pdf i have

     res = getTextUsingIFilter(FullPathToFile);

                 //ifilter didnt work use pdfbox
                if (res.Length == 0)
                {

                    //use pdfbox or whatever

                }

     

            private string getTextUsingIFilter(string FullPathToFile){
                string docText="";
                try{
                    docText = Seekafile.Server.IFilter.DefaultParser.Extract(FullPathToFile);
                    if (docText.Length != 0)
                    {
                        logMessage(FullPathToFile + " indexed using ifilter", umbraco.BusinessLogic.LogTypes.Debug);
                    }
                }
                catch(Exception ex){
                    logMessage(ex.Message.ToString(), umbraco.BusinessLogic.LogTypes.Error);
                }
                return docText;
            }

     

    Regards

     

    Ismail

  • Hans 23 posts 57 karma points
    Jul 14, 2009 @ 16:59
    Hans
    0

    Thanks, I will definately have a look at it. By the way, do you have a good MS Word parser?

  • Dirk De Grave 4541 posts 6021 karma points MVP 3x admin c-trib
    Jul 14, 2009 @ 17:09
    Dirk De Grave
    0

    I think Darren suggested one on twitter... Might wanna search google for dsofile.dll (Credits go to Darren!!)

     

    Cheers,

    /Dirk  

     

  • Hans 23 posts 57 karma points
    Jul 16, 2009 @ 08:38
    Hans
    0

    Hallo Dirk,

    I've checked that one. Unfortunately it gives only metadata of a MS Word document. Like the template used and word count. It doesn't give access to the text content.

    Love to hear another suggestion!

    Hans

  • Tim 225 posts 690 karma points
    Jul 16, 2009 @ 12:23
    Tim
    1

    Hi,

    Here's some more information on using IFilter in order to extract text form MS docs for use in Lucene:

    http://stackoverflow.com/questions/465302/what-is-the-best-way-to-parse-microsoft-office-and-pdf-documents 

    Looks like it shouldn't be too tricky......

    T

  • Hans 23 posts 57 karma points
    Jul 17, 2009 @ 10:35
    Hans
    2

    Hello everybody,

    This morning I uploaded UmbSearch2 v1.1 to codeplex. Thanks to some very helpfull input from Ismail Mayat and Tim Saunders I decided to use IFilters instead of PDFBox. The UmbSearch2 package decreased in size immediately from 8MB to 57kb! The only catch is that you need to have the right Ifilters installed on your webserver, but most servers do already have these.

    Download: umbsearch2.codeplex.com

    New features:

    - document nodes will be added to the search index on publishing and removed from the index when you unpublish a page.

    - you can hide document nodes from indexing by using the property-alias 'umbracoSearchHide'

    - You can easily determine the document properties that should be searched.

    - It's also possible to determine the parent document from where UmbSearch2 has to start searching.

    Well, enjoy it!

  • Søren Tidmand 129 posts 366 karma points
    Jul 19, 2009 @ 00:39
    Søren Tidmand
    0

    Hi Hans,

    I am desperately trying to include in the search indexing all pdf and word documents that are being uploaded via the data type 'Upload' on a document type. These uploaded files are being stored in the /media folder but are not accessible through (nor listed in) the Media section. Therefore they are not getting indexed.

    At CodeGarden '09 both Tim Geyssens and Ismail Mayat gave me some very good pointers on how to include the uploaded files from the /media folder, but I must admit that I haven't succeeded in writing the needed code. So what would be more convenient than asking you whether this feature is already a part of the new UmbSearch2 package or - if not - you would concider including it in the road map? ...

    Basically I'm wondering if the inclusion of an 'Upload' property in the search automatically catches any uploaded pdf and/or word documents?

    I'll appreciate any help in this matter.

    Thanks!

  • Hans 23 posts 57 karma points
    Jul 20, 2009 @ 09:11
    Hans
    0

    @Soren

    This will unfortunately not work. UmbSearch2 only indexes files that are uploaded through the CMS. When you upload a file through the 'Media Section', this file will be saved as a media item. When you use the UploadField control on a document type, the uploaded file will NOT be saved as a Media item. That explains why it's not working in your situation.

    A solution would be to download the source code and to add some funtionality to index files in the media folder that are not media items. Just add that code to the UmbSearch2.Search.Indexer.ReIndex() method.

    If it's working, please let me know. Then I will include it in the next release!

    Hans

     

  • Lars Buur 28 posts 20 karma points
    Jul 23, 2009 @ 09:39
    Lars Buur
    0

    Hi Hans,

    Great work! I have a small piece of feedback that made my life a bit difficult when I tried to play with your lucene.net package.

    If the index is somehow invalid then a reindex is started using a thread. Unfortunately the new thread does not have httpcontext etc. so the HttpContext.Current.Server.MapPath fails. The MapPath is used to find the directory where the index is stored so the reindex will fail - but it continues to keep rebuilding for a while.

    Also a static variable could be used to make sure that the reindexer is only running in one thread.

    Kind regards
    Lars Buur
    Chainbox ApS

  • Hans 23 posts 57 karma points
    Jul 23, 2009 @ 17:08
    Hans
    0

    @Lars,

    Yep, you are right. I will change this as soon as possible. Probably tomorrow I will release v1.2. That one will also work with multi-lingual websites. So you have something else to look forward to :-)

    Hans

  • Hans 23 posts 57 karma points
    Jul 24, 2009 @ 13:47
    Hans
    0

    @Lars,

    @If the index is somehow invalid then a reindex is started using a thread. Unfortunately the new thread does not have httpcontext etc. so the HttpContext.Current.Server.MapPath fails. The MapPath is used to find the directory where the index is stored so the reindex will fail - but it continues to keep rebuilding for a while.

    I fixed the problem that the HttpContext wasn't known. Unfortunately, the index will not be build properly because for some strange reason it's not possible to get the properties of a Document anymore. I really don't get what is happening here. When I Reindex the page in a normal way (I use the page reindexsite.aspx) everything goes as normal. Properties can be read out and added to the index. When reindexing starts as a result of a broken index, and this reindexing action is done on a different thread, the properties collection is [0].

    Because of this I decided not to do reindexing of a different thread anymore. The user who is so unfortunate to perform a search action when the index is broken, will have to wait a little while before he gets his search results. I don't expect this to happen very often.

    Hans

     

  • Vipul Patel 18 posts 37 karma points
    Nov 03, 2009 @ 18:09
    Vipul Patel
    0

    Hi Ismail, do you have a package of umbSearch that works for umbraco 4 ?

    if yes can you send it to me in my email [email protected] ?

     

    Thanks

Please Sign in or register to post replies

Write your reply to:

Draft