There are plenty of questions on stack overflow, which should show you what is needed.
If this were me, I would try and ditch the word format and move to something more consistent such as excel or a general .txt file. However, I appreciate this might not be possible.
There are some issues of course, what if your target sections (shift, description etc) are in the content you want, as well as titles… also, performance is certainly something to keep an eye on. Basically, if you go done this route, do a good amount of testing, perhaps some unit tests as well.
This is pseudo code, so you will need to edit, plus the format of you text file matters.
But something like:
var textFiles = Directory.GetFiles("path to all files");
foreach(var file in textFiles)
{
var fileAsText = file.ReadToEnd();
var firstSectionEnd = fileAsText.IndexOf('--split here--');
var first = fileAsText.Substring(0, firstSectionEnd);
//Do something with first, upload to Umbraco using Umbraco api
}
The above assumes your .txt. file looks something like this:
section 1 content here
--split here--
section 2 content here
--split here
the var first above would get you 'section 1 content here'
Parsing a Word or PDF Document
Not sure if this is possible or not but figured I would ask.
I have a Word document that I would like to parse and save the content into specific fields.
The Word Document has 5 sections
and I would want that content to be mapped automatically to the respective fields when I upload the file (So I do not copy and paste each one)
Hi Yaco,
Now this isn’t an easy one, as searching through a word doc for specific areas, or words isn’t easy.
I have had a look and have found a package that could work for you https://learn.microsoft.com/en-us/dotnet/csharp/advanced-topics/interop/how-to-access-office-interop-objects
There are plenty of questions on stack overflow, which should show you what is needed.
If this were me, I would try and ditch the word format and move to something more consistent such as excel or a general .txt file. However, I appreciate this might not be possible.
I have found this post on stack overflow, the example searches for a specific word in a .doc/.docx file - https://stackoverflow.com/questions/44699445/extract-words-from-a-doc-docx-file-c-sharp
There are some issues of course, what if your target sections (shift, description etc) are in the content you want, as well as titles… also, performance is certainly something to keep an eye on. Basically, if you go done this route, do a good amount of testing, perhaps some unit tests as well.
Lewis
If I get some time tomorrow, I will throw something together and post it here.
Lewis
Thank you Lewis, for the great information above.
I am sure I can convert the Word file to a .txt file to make sure I do not run into issues with HTML/MS mark up.
Anything you put together will be greatly appreciated.
This is pseudo code, so you will need to edit, plus the format of you text file matters.
But something like:
The above assumes your .txt. file looks something like this:
section 1 content here --split here-- section 2 content here --split here
the var
first
above would get you 'section 1 content here'You of course will need null checks etc.
Lewis
Where would I put this code? (I understand it is pseudo code and I still need to actually write it out)
is working on a reply...