Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Graeme W 113 posts 289 karma points
    Oct 22, 2020 @ 13:25
    Graeme W
    0

    Getting a list of all PDFs published on the website

    I've been asked if I come up with a list of all the PDFs published on the website. I can't see any obvious way of doing this. We are on Umbraco 6.16

    Can anyone help me out?

    Many thanks in advance!

  • Marc Goodson 2141 posts 14344 karma points MVP 8x c-trib
    Oct 23, 2020 @ 06:15
    Marc Goodson
    0

    Hi Graeme

    Umbraco doesn't have the concept of 'published/unpublished' for media, if the file is uploaded into the Media section of Umbraco then it is available 'published' on the site!

    Media Pickers, will store the ids of any image/file picked from the media section, and inside the Rich Text Area the path to the image/file is stored eg /media/3434/doc.pdf

    so it depends a little on how PDFs are linked to in the site, as the best way to try to search for them in Umbraco's database or in Examine indexes.

    But if your only concerned with which PDFs that are 'linked to' and discoverable by visitors to the site, then a better strategy might be to use a link checker, eg Xenu Link Sleuth:

    https://xenus-link-sleuth.en.softonic.com/

    This will crawl your site and find all links (including links to the pdfs in the media section), you can then export the results in a spreadsheet, and filter away:

    https://moz.com/blog/xenu-link-sleuth-more-than-just-a-broken-links-finder

    regards

    Marc

  • Lee Kelleher 4020 posts 15802 karma points MVP 13x admin c-trib
    Oct 23, 2020 @ 10:24
    Lee Kelleher
    1

    Hi Graeme,

    Following on from Marc's suggestions.

    If you want a quick-and-dirty approach - and if you have access to the web-server itself, and the PDFs are in the /Media folder, then try a DOS command (from the root of the site) ...

    dir Media\*.pdf /s /b
    

    Of course, this will give you the paths for the files, not the URLs ... but nothing a quick search-and-replace won't fix.

    Hope this helps.

    Cheers,
    - Lee

  • Graeme W 113 posts 289 karma points
    Oct 26, 2020 @ 09:46
    Graeme W
    0

    Thanks for the replies

    Marc - Unfortunately I don't think the link crawler would work as some of the pdf links are hidden until logged in. You mentioned searching in the database tables or Examine indexes - can you point me in the right direction of how to do that? I've been looking through the tables but haven't managed to work out

    Lee - I've used the command line to get the "quick and dirty" list of pdfs Can you tell me what you meant by the find and replace? Is there a way to translate the paths into urls?

    Thanks again for your help

  • Lee Kelleher 4020 posts 15802 karma points MVP 13x admin c-trib
    Oct 26, 2020 @ 11:33
    Lee Kelleher
    0

    Lee - I've used the command line to get the "quick and dirty" list of pdfs Can you tell me what you meant by the find and replace? Is there a way to translate the paths into urls?

    By that, I meant copy the text from the DOS command, and paste it into a text editor... then string replace the base/root paths (e.g C:\path\to\umbraco\site) with the domain name (e.g https://example.com) ... and replace any back-slashes with forward-slashes.

    If this is a one-off job, then it's quick-n-dirty, but if you need to repeat it then it'll probably become a painful task.

  • Marc Goodson 2141 posts 14344 karma points MVP 8x c-trib
    Oct 26, 2020 @ 11:14
    Marc Goodson
    0

    Hi Graeme

    ahh I see, there are two things here - one finding all PDFs that have been uploaded to the site (and are technically published, as soon as a file is uploaded it's available on that URL) and another finding all PDFs that have been 'linked to', and is discoverable to the outside world.

    eg if a PDF has been uploaded to the backoffice and never linked to - are you interested in that?

    What I mean by Examine, (and I'm hoping this is true as far back as v6.1.6 :-))

    is that the InternalIndex contains a reference to all uploaded Media, and there is a property called umbracoExtension, in the index - so if you go via the Developer Examine Dashboard, and use the InternalSearcher to do a 'Lucene search' with the following text:

    umbracoExtension: pdf
    

    Should return all the PDFs uploaded to Umbraco...

    (but that won't tell you if they've been linked to...)

    regards

    Marc

  • Graeme W 113 posts 289 karma points
    Nov 06, 2020 @ 15:17
    Graeme W
    0

    Thanks for the responses Marc and Lee. I supplied both the link crawler results and the DOS command results in the end and the user seems happy

Please Sign in or register to post replies

Write your reply to:

Draft