getting a list of all pdfs published on the website

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Graeme W 113 posts 289 karma points

Oct 22, 2020 @ 13:25

0

Getting a list of all PDFs published on the website

I've been asked if I come up with a list of all the PDFs published on the website. I can't see any obvious way of doing this. We are on Umbraco 6.16

Can anyone help me out?

Many thanks in advance!

Copy Link
Marc Goodson 2157 posts 14435 karma points MVP 9x c-trib

Oct 23, 2020 @ 06:15

0

Hi Graeme

Umbraco doesn't have the concept of 'published/unpublished' for media, if the file is uploaded into the Media section of Umbraco then it is available 'published' on the site!

Media Pickers, will store the ids of any image/file picked from the media section, and inside the Rich Text Area the path to the image/file is stored eg /media/3434/doc.pdf

so it depends a little on how PDFs are linked to in the site, as the best way to try to search for them in Umbraco's database or in Examine indexes.

But if your only concerned with which PDFs that are 'linked to' and discoverable by visitors to the site, then a better strategy might be to use a link checker, eg Xenu Link Sleuth:

https://xenus-link-sleuth.en.softonic.com/

This will crawl your site and find all links (including links to the pdfs in the media section), you can then export the results in a spreadsheet, and filter away:

https://moz.com/blog/xenu-link-sleuth-more-than-just-a-broken-links-finder

regards

Marc

Copy Link
Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Oct 23, 2020 @ 10:24
1
Hi Graeme,

Following on from Marc's suggestions.

If you want a quick-and-dirty approach - and if you have access to the web-server itself, and the PDFs are in the /Media folder, then try a DOS command (from the root of the site) ...
```
dir Media\*.pdf /s /b
```
Of course, this will give you the paths for the files, not the URLs ... but nothing a quick search-and-replace won't fix.

Hope this helps.

Cheers,
- Lee
Copy Link
Graeme W 113 posts 289 karma points

Oct 26, 2020 @ 09:46

0

Thanks for the replies

Marc - Unfortunately I don't think the link crawler would work as some of the pdf links are hidden until logged in. You mentioned searching in the database tables or Examine indexes - can you point me in the right direction of how to do that? I've been looking through the tables but haven't managed to work out

Lee - I've used the command line to get the "quick and dirty" list of pdfs Can you tell me what you meant by the find and replace? Is there a way to translate the paths into urls?

Thanks again for your help

Copy Link
Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Oct 26, 2020 @ 11:33

0

Lee - I've used the command line to get the "quick and dirty" list of pdfs Can you tell me what you meant by the find and replace? Is there a way to translate the paths into urls?

By that, I meant copy the text from the DOS command, and paste it into a text editor... then string replace the base/root paths (e.g C:\path\to\umbraco\site) with the domain name (e.g https://example.com) ... and replace any back-slashes with forward-slashes.

If this is a one-off job, then it's quick-n-dirty, but if you need to repeat it then it'll probably become a painful task.

Copy Link
Marc Goodson 2157 posts 14435 karma points MVP 9x c-trib

Oct 26, 2020 @ 11:14
0
Hi Graeme

ahh I see, there are two things here - one finding all PDFs that have been uploaded to the site (and are technically published, as soon as a file is uploaded it's available on that URL) and another finding all PDFs that have been 'linked to', and is discoverable to the outside world.

eg if a PDF has been uploaded to the backoffice and never linked to - are you interested in that?

What I mean by Examine, (and I'm hoping this is true as far back as v6.1.6 :-))

is that the InternalIndex contains a reference to all uploaded Media, and there is a property called umbracoExtension, in the index - so if you go via the Developer Examine Dashboard, and use the InternalSearcher to do a 'Lucene search' with the following text:
```
umbracoExtension: pdf
```
Should return all the PDFs uploaded to Umbraco...

(but that won't tell you if they've been linked to...)

regards

Marc
Copy Link
Graeme W 113 posts 289 karma points

Nov 06, 2020 @ 15:17

0

Thanks for the responses Marc and Lee. I supplied both the link crawler results and the DOS command results in the end and the user seems happy

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

Getting a list of all PDFs published on the website