This is really just a case of throwing an idea out for discussion at the moment but here is the scenario...
I have a client who was recenty contacted by a journalist regarding a child protection policy(file) on an Umbraco site that was out of date. The file was in the media section but wasn't linked to any documents but obviously the media item had already been indexed by search engines so the clients response was "That’s quite worrying that a file in the media library has public access!". Ordinarily this is not an issue for most site owners but in this case has caused an issue.
I am looking to the community for ideas on how to address this issue in order to satisy the clients requirement?
Anything I can think of would most likely introduce a performance overhead that in 99% of cases would be unnecessary. Any ideas? My initial thought would be to ensure that any requests for content in the Media section have to have come via the current hosts IP address?
Couldn't you add into the robots.txt to disallow the media folder so they are not indexable by the search bots?
I know this would not resolve this immediatly but would be a start for new media items not to be indexed.
Also a snippet from google: "If you own the site, you'll need to make the changes to
your website yourself and then request removal of the problematic page
from Google's search results using the URL removal tool in Webmaster
Tools"
The only way a search engine has indexed the media item is because someone has linked to it directly - making it crawlable.
As you know, the URL structure of the media folders isn't easy to "guess" (because they use the property ids from the database) - so Joe Bloggs isn't going to work it out.
Quick wins...
Add "/media" to your robots.txt (for those that honour it)
Restrict access to the /media folder at server-level (IIS) - as you say via the host's IP address.
Move the /media folder outside the web-root (somehow) and use a proxy script to handle the requests?
Tom, in this instance the file has been removed however adding something to the robots.txt file kind of tells the world it is there anyway, I think they are looking for something more secure.
Lee, it was linked previously to a document that has since been updated and the link removed.
I am thinking of perhaps adding a new property to their media items to allow them to protect them from being linked to directly from anywhere other than content on the same IP address. I would then need to check on each request for "protected" media items which I am sure will have overhead but likely do the job.
The fundamental issue is a hard one - when a media item is updated or unliked (that is, not used directly by a link in an RTE), should it be removed from the filesystem? In your case it seems the answer might be yes, but generally that would be overkill and even a problem. For instance, you might have a photo gallery macro that reads a folder in your media section - no direct links but deleting the images would be a bad idea because it would break the photo gallery.
Doug, I think you may have touched on a potential solution here by looking at it from a different angle. They(and me) may have been looking at it from the wrong perspective and what they really need is to be able to identify orphaned files in a similar way to Tim's solution and then make a decision as to whether the media item stays or goes.
Since I have to do this for the client anyway I think I will package something up for this.
Having had another discussion with the client on this matter they have agreed that the situation is an exception rather than the rule and that a solution along the lines of what you recommended Doug, to remove orphaned items from the media library would be an acceptable solution so watch this space!
Media Published Status
This is really just a case of throwing an idea out for discussion at the moment but here is the scenario...
I have a client who was recenty contacted by a journalist regarding a child protection policy(file) on an Umbraco site that was out of date. The file was in the media section but wasn't linked to any documents but obviously the media item had already been indexed by search engines so the clients response was "That’s quite worrying that a file in the media library has public access!". Ordinarily this is not an issue for most site owners but in this case has caused an issue.
I am looking to the community for ideas on how to address this issue in order to satisy the clients requirement?
Anything I can think of would most likely introduce a performance overhead that in 99% of cases would be unnecessary. Any ideas? My initial thought would be to ensure that any requests for content in the Media section have to have come via the current hosts IP address?
Thanks, Simon
Hi Simon,
Couldn't you add into the robots.txt to disallow the media folder so they are not indexable by the search bots?
I know this would not resolve this immediatly but would be a start for new media items not to be indexed.
Also a snippet from google: "If you own the site, you'll need to make the changes to your website yourself and then request removal of the problematic page from Google's search results using the URL removal tool in Webmaster Tools"
http://www.google.com/support/webmasters/bin/answer.py?answer=164734
Thanks
Tom
Hi Simon,
The only way a search engine has indexed the media item is because someone has linked to it directly - making it crawlable.
As you know, the URL structure of the media folders isn't easy to "guess" (because they use the property ids from the database) - so Joe Bloggs isn't going to work it out.
Quick wins...
There's probably other ways...
Cheers, Lee.
Tom, in this instance the file has been removed however adding something to the robots.txt file kind of tells the world it is there anyway, I think they are looking for something more secure.
Lee, it was linked previously to a document that has since been updated and the link removed.
I am thinking of perhaps adding a new property to their media items to allow them to protect them from being linked to directly from anywhere other than content on the same IP address. I would then need to check on each request for "protected" media items which I am sure will have overhead but likely do the job.
The fundamental issue is a hard one - when a media item is updated or unliked (that is, not used directly by a link in an RTE), should it be removed from the filesystem? In your case it seems the answer might be yes, but generally that would be overkill and even a problem. For instance, you might have a photo gallery macro that reads a folder in your media section - no direct links but deleting the images would be a bad idea because it would break the photo gallery.
Even so, our friend Tim Gaunt had a blog post about this situation and a tool that might be helpful. http://blogs.thesitedoctor.co.uk/tim/2008/09/03/Clean+Out+Unused+Media+Items+From+Umbraco+Media+Folder.aspx
Keep the conversation going, this is an important topic!
cheers,
doug.
Doug, I think you may have touched on a potential solution here by looking at it from a different angle. They(and me) may have been looking at it from the wrong perspective and what they really need is to be able to identify orphaned files in a similar way to Tim's solution and then make a decision as to whether the media item stays or goes.
Since I have to do this for the client anyway I think I will package something up for this.
Having had another discussion with the client on this matter they have agreed that the situation is an exception rather than the rule and that a solution along the lines of what you recommended Doug, to remove orphaned items from the media library would be an acceptable solution so watch this space!
is working on a reply...