parsing a word or pdf document

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Yaco Zaragoza 88 posts 363 karma points

May 03, 2023 @ 17:29
0

Parsing a Word or PDF Document
Not sure if this is possible or not but figured I would ask.

I have a Word document that I would like to parse and save the content into specific fields.

The Word Document has 5 sections
- Shift
- Description
- Requirements
- Location
- Apply Info
and I would want that content to be mapped automatically to the respective fields when I upload the file (So I do not copy and paste each one)
Copy Link
Lewis Smith 211 posts 620 karma points c-trib

May 03, 2023 @ 19:07

0

Hi Yaco,

Now this isn’t an easy one, as searching through a word doc for specific areas, or words isn’t easy.

I have had a look and have found a package that could work for you https://learn.microsoft.com/en-us/dotnet/csharp/advanced-topics/interop/how-to-access-office-interop-objects

There are plenty of questions on stack overflow, which should show you what is needed.

If this were me, I would try and ditch the word format and move to something more consistent such as excel or a general .txt file. However, I appreciate this might not be possible.

I have found this post on stack overflow, the example searches for a specific word in a .doc/.docx file - https://stackoverflow.com/questions/44699445/extract-words-from-a-doc-docx-file-c-sharp

There are some issues of course, what if your target sections (shift, description etc) are in the content you want, as well as titles… also, performance is certainly something to keep an eye on. Basically, if you go done this route, do a good amount of testing, perhaps some unit tests as well.

Lewis

Copy Link
Lewis Smith 211 posts 620 karma points c-trib

May 03, 2023 @ 19:07

1

If I get some time tomorrow, I will throw something together and post it here.

Lewis

Copy Link
Yaco Zaragoza 88 posts 363 karma points

May 04, 2023 @ 15:05

0

Thank you Lewis, for the great information above.

I am sure I can convert the Word file to a .txt file to make sure I do not run into issues with HTML/MS mark up.

Anything you put together will be greatly appreciated.

Copy Link
Lewis Smith 211 posts 620 karma points c-trib

May 04, 2023 @ 15:34
0
This is pseudo code, so you will need to edit, plus the format of you text file matters.

But something like:
```
var textFiles = Directory.GetFiles("path to all files");

foreach(var file in textFiles)
{
   var fileAsText = file.ReadToEnd();
   var firstSectionEnd = fileAsText.IndexOf('--split here--');
   var first = fileAsText.Substring(0, firstSectionEnd);

  //Do something with first, upload to Umbraco using Umbraco api
}
```
The above assumes your .txt. file looks something like this:

section 1 content here --split here-- section 2 content here --split here

the var first above would get you 'section 1 content here'

You of course will need null checks etc.

Lewis
Copy Link
Yaco Zaragoza 88 posts 363 karma points

May 04, 2023 @ 17:29

0

Where would I put this code? (I understand it is pseudo code and I still need to actually write it out)

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

Parsing a Word or PDF Document