How to access a media file on Azure Blob Storage directly? (Not via http URL)
Hello friends,
I have some code which runs on a media file "Save" event that extracts the content from a PDF and stores it in a text field on the Media item for use in searching, etc. This is using TikaOnDotNet.TextExtraction.TextExtractor (see: https://github.com/KevM/tikaondotnet#usage)
When running locally, this code works great - it can access the PDF file via the /Media/ folder and read it.
The site is hosted on Umbraco Cloud, which uses Azure Blob Storage for media, so if the file cannot be located in the Media folder (which is generally empty on Cloud sites), it uses the URL of the media, and grabs it via new WebClient().DownloadData(uri). If the Media file is Added/Saved on the Live environment, this works just fine, since the URI is publicly accessible, however, if it is added on a Development or Staging environment, it fails because those environments are protected via Basic Auth.
Can anyone recommend a way to read a media file from Azure Blob Storage on those protected environments?
Thanks to help from Nik Rimington and Anders Bjerner, I was able to find a solution utilizing Umbraco's IMediaFileSystem.
Stripped-down example using Dependency Injection:
using Umbraco.Core.IO;
private readonly IMediaFileSystem _mediaFileSystem;
...
// Open a stream for reading the file contents
using (var fs = _mediaFileSystem.OpenFile(mediaUmbracoFile))
{
if (fs != null)
{
var fileTextContent = mediaParser.ParseMediaText(fs,
out extractedMetaFromTika);
...
}
else
{
_iLogger.Error(typeof(RegisterEventsComponent),
new Exception($"Unable to open PDF file {fileInfo.FullName}"),
"Unable to Open PDF file");
}
}
...
public string ParseMediaText(Stream SourceStream, out Dictionary<string, string> MetaData)
{
var sb = new StringBuilder();
var metaData = new Dictionary<string, string>();
var textExtractor = new TextExtractor();
try
{
using (var memoryStream = new MemoryStream())
{
SourceStream.CopyTo(memoryStream);
var streamBytes = memoryStream.ToArray();
var textExtractionResult = textExtractor.Extract(streamBytes);
sb.Append(textExtractionResult.Text);
metaData = (Dictionary<string, string>)textExtractionResult.Metadata;
}
}
catch (Exception ex)
{
var msg = $"MediaParserService.ParseMediaText: Could not read media item provided by stream";
throw new Exception(msg, ex);
}
MetaData = metaData;
return sb.ToString();
}
How to access a media file on Azure Blob Storage directly? (Not via http URL)
Hello friends,
I have some code which runs on a media file "Save" event that extracts the content from a PDF and stores it in a text field on the Media item for use in searching, etc. This is using TikaOnDotNet.TextExtraction.TextExtractor (see: https://github.com/KevM/tikaondotnet#usage)
When running locally, this code works great - it can access the PDF file via the /Media/ folder and read it.
The site is hosted on Umbraco Cloud, which uses Azure Blob Storage for media, so if the file cannot be located in the Media folder (which is generally empty on Cloud sites), it uses the URL of the media, and grabs it via
new WebClient().DownloadData(uri)
. If the Media file is Added/Saved on the Live environment, this works just fine, since the URI is publicly accessible, however, if it is added on a Development or Staging environment, it fails because those environments are protected via Basic Auth.Can anyone recommend a way to read a media file from Azure Blob Storage on those protected environments?
Thanks to help from Nik Rimington and Anders Bjerner, I was able to find a solution utilizing Umbraco's IMediaFileSystem.
Stripped-down example using Dependency Injection:
Additionally, the tip for v9+ implementation is to look at https://github.com/umbraco/UmbracoExamine.PDF/blob/v11/dev/src/UmbracoExamine.PDF/PdfPigTextExtractor.cs
HI,
This is part of private readonly IMediaFileSystem _mediaFileSystem Umbraco v8 right, how can we get teh same in Umbraco 11.
Basically to get the media full path which is stored in Azure blob.
Thanks,
is working on a reply...