Umbraco doesn't have the concept of 'published/unpublished' for media, if the file is uploaded into the Media section of Umbraco then it is available 'published' on the site!
Media Pickers, will store the ids of any image/file picked from the media section, and inside the Rich Text Area the path to the image/file is stored eg /media/3434/doc.pdf
so it depends a little on how PDFs are linked to in the site, as the best way to try to search for them in Umbraco's database or in Examine indexes.
But if your only concerned with which PDFs that are 'linked to' and discoverable by visitors to the site, then a better strategy might be to use a link checker, eg Xenu Link Sleuth:
This will crawl your site and find all links (including links to the pdfs in the media section), you can then export the results in a spreadsheet, and filter away:
If you want a quick-and-dirty approach - and if you have access to the web-server itself, and the PDFs are in the /Media folder, then try a DOS command (from the root of the site) ...
dir Media\*.pdf /s /b
Of course, this will give you the paths for the files, not the URLs ... but nothing a quick search-and-replace won't fix.
Marc - Unfortunately I don't think the link crawler would work as some of the pdf links are hidden until logged in. You mentioned searching in the database tables or Examine indexes - can you point me in the right direction of how to do that? I've been looking through the tables but haven't managed to work out
Lee - I've used the command line to get the "quick and dirty" list of pdfs
Can you tell me what you meant by the find and replace? Is there a way to translate the paths into urls?
Lee - I've used the command line to get the "quick and dirty" list of pdfs Can you tell me what you meant by the find and replace? Is there a way to translate the paths into urls?
By that, I meant copy the text from the DOS command, and paste it into a text editor... then string replace the base/root paths (e.g C:\path\to\umbraco\site) with the domain name (e.g https://example.com) ... and replace any back-slashes with forward-slashes.
If this is a one-off job, then it's quick-n-dirty, but if you need to repeat it then it'll probably become a painful task.
ahh I see, there are two things here - one finding all PDFs that have been uploaded to the site (and are technically published, as soon as a file is uploaded it's available on that URL) and another finding all PDFs that have been 'linked to', and is discoverable to the outside world.
eg if a PDF has been uploaded to the backoffice and never linked to - are you interested in that?
What I mean by Examine, (and I'm hoping this is true as far back as v6.1.6 :-))
is that the InternalIndex contains a reference to all uploaded Media, and there is a property called umbracoExtension, in the index - so if you go via the Developer Examine Dashboard, and use the InternalSearcher to do a 'Lucene search' with the following text:
umbracoExtension: pdf
Should return all the PDFs uploaded to Umbraco...
(but that won't tell you if they've been linked to...)
Getting a list of all PDFs published on the website
I've been asked if I come up with a list of all the PDFs published on the website. I can't see any obvious way of doing this. We are on Umbraco 6.16
Can anyone help me out?
Many thanks in advance!
Hi Graeme
Umbraco doesn't have the concept of 'published/unpublished' for media, if the file is uploaded into the Media section of Umbraco then it is available 'published' on the site!
Media Pickers, will store the ids of any image/file picked from the media section, and inside the Rich Text Area the path to the image/file is stored eg /media/3434/doc.pdf
so it depends a little on how PDFs are linked to in the site, as the best way to try to search for them in Umbraco's database or in Examine indexes.
But if your only concerned with which PDFs that are 'linked to' and discoverable by visitors to the site, then a better strategy might be to use a link checker, eg Xenu Link Sleuth:
https://xenus-link-sleuth.en.softonic.com/
This will crawl your site and find all links (including links to the pdfs in the media section), you can then export the results in a spreadsheet, and filter away:
https://moz.com/blog/xenu-link-sleuth-more-than-just-a-broken-links-finder
regards
Marc
Hi Graeme,
Following on from Marc's suggestions.
If you want a quick-and-dirty approach - and if you have access to the web-server itself, and the PDFs are in the /Media folder, then try a DOS command (from the root of the site) ...
Of course, this will give you the paths for the files, not the URLs ... but nothing a quick search-and-replace won't fix.
Hope this helps.
Cheers,
- Lee
Thanks for the replies
Marc - Unfortunately I don't think the link crawler would work as some of the pdf links are hidden until logged in. You mentioned searching in the database tables or Examine indexes - can you point me in the right direction of how to do that? I've been looking through the tables but haven't managed to work out
Lee - I've used the command line to get the "quick and dirty" list of pdfs Can you tell me what you meant by the find and replace? Is there a way to translate the paths into urls?
Thanks again for your help
By that, I meant copy the text from the DOS command, and paste it into a text editor... then string replace the base/root paths (e.g
C:\path\to\umbraco\site
) with the domain name (e.ghttps://example.com
) ... and replace any back-slashes with forward-slashes.If this is a one-off job, then it's quick-n-dirty, but if you need to repeat it then it'll probably become a painful task.
Hi Graeme
ahh I see, there are two things here - one finding all PDFs that have been uploaded to the site (and are technically published, as soon as a file is uploaded it's available on that URL) and another finding all PDFs that have been 'linked to', and is discoverable to the outside world.
eg if a PDF has been uploaded to the backoffice and never linked to - are you interested in that?
What I mean by Examine, (and I'm hoping this is true as far back as v6.1.6 :-))
is that the InternalIndex contains a reference to all uploaded Media, and there is a property called umbracoExtension, in the index - so if you go via the Developer Examine Dashboard, and use the InternalSearcher to do a 'Lucene search' with the following text:
Should return all the PDFs uploaded to Umbraco...
(but that won't tell you if they've been linked to...)
regards
Marc
Thanks for the responses Marc and Lee. I supplied both the link crawler results and the DOS command results in the end and the user seems happy
is working on a reply...