Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Yaco Zaragoza 88 posts 362 karma points
    May 03, 2023 @ 17:29
    Yaco Zaragoza
    0

    Parsing a Word or PDF Document

    Not sure if this is possible or not but figured I would ask.

    I have a Word document that I would like to parse and save the content into specific fields.

    The Word Document has 5 sections

    • Shift
    • Description
    • Requirements
    • Location
    • Apply Info

    and I would want that content to be mapped automatically to the respective fields when I upload the file (So I do not copy and paste each one)

  • Lewis Smith 211 posts 620 karma points c-trib
    May 03, 2023 @ 19:07
    Lewis Smith
    0

    Hi Yaco,

    Now this isn’t an easy one, as searching through a word doc for specific areas, or words isn’t easy.

    I have had a look and have found a package that could work for you https://learn.microsoft.com/en-us/dotnet/csharp/advanced-topics/interop/how-to-access-office-interop-objects

    There are plenty of questions on stack overflow, which should show you what is needed.

    If this were me, I would try and ditch the word format and move to something more consistent such as excel or a general .txt file. However, I appreciate this might not be possible.

    I have found this post on stack overflow, the example searches for a specific word in a .doc/.docx file - https://stackoverflow.com/questions/44699445/extract-words-from-a-doc-docx-file-c-sharp

    There are some issues of course, what if your target sections (shift, description etc) are in the content you want, as well as titles… also, performance is certainly something to keep an eye on. Basically, if you go done this route, do a good amount of testing, perhaps some unit tests as well.

    Lewis

  • Lewis Smith 211 posts 620 karma points c-trib
    May 03, 2023 @ 19:07
    Lewis Smith
    1

    If I get some time tomorrow, I will throw something together and post it here.

    Lewis

  • Yaco Zaragoza 88 posts 362 karma points
    May 04, 2023 @ 15:05
    Yaco Zaragoza
    0

    Thank you Lewis, for the great information above.

    I am sure I can convert the Word file to a .txt file to make sure I do not run into issues with HTML/MS mark up.

    Anything you put together will be greatly appreciated.

  • Lewis Smith 211 posts 620 karma points c-trib
    May 04, 2023 @ 15:34
    Lewis Smith
    0

    This is pseudo code, so you will need to edit, plus the format of you text file matters.

    But something like:

    var textFiles = Directory.GetFiles("path to all files");
    
    foreach(var file in textFiles)
    {
       var fileAsText = file.ReadToEnd();
       var firstSectionEnd = fileAsText.IndexOf('--split here--');
       var first = fileAsText.Substring(0, firstSectionEnd);
    
      //Do something with first, upload to Umbraco using Umbraco api
    }
    

    The above assumes your .txt. file looks something like this:

    section 1 content here --split here-- section 2 content here --split here

    the var first above would get you 'section 1 content here'

    You of course will need null checks etc.

    Lewis

  • Yaco Zaragoza 88 posts 362 karma points
    May 04, 2023 @ 17:29
    Yaco Zaragoza
    0

    Where would I put this code? (I understand it is pseudo code and I still need to actually write it out)

Please Sign in or register to post replies

Write your reply to:

Draft