ideas for automatic creating of robottxt rules

Chris Houston 535 posts 980 karma points MVP admin c-trib

Jul 22, 2009 @ 00:42

Ideas for automatic creating of Robot.txt rules.

Hi Lee,

My thoughts on this are based on a previous Umbraco 3 site that had issues where Umbraco kept getting it's knickers in a twist and outputting pages with unfriendly URLs, i.e. www.mydomain.com/nodeid.aspx which obviously were not URL's we wanted Google or any other search engine to Index, however, Google did, as I found when I checked the Google Webmaster Tools.

I added these bad URL's to the robots.txt and then next time Google indexed our site the bad URL's were removed and the correct URL's appeared in the index.

It made me think that when users remove pages from the Umbraco content section that those URL's just suddenly disappear, so it would be really good if there was a way of doing the following:

a) Replacing the old page with a standard re-direct document ( that a the user selects where the dead page should now re-direct too ) this should exist for X number of days and then automatically recycle.

b) Added a rule to the Robot.txt file to dis-allow the search engines from continuing to index the page.

I'd be interested to hear yours and others thoughts on this.

Cheers,

Chris

Copy Link

Petr Snobelt 923 posts 1535 karma points

Jul 22, 2009 @ 09:13

You can add canonical link to your pages

http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

Copy Link

rasb 162 posts 218 karma points

Jul 22, 2009 @ 10:49

That's good idea Petr!

In any case it would make sense to use it.

I have tried to create a macro using xslt.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl:stylesheet [ <!ENTITY nbsp "&#x00A0;"> ]>
<xsl:stylesheet 
    version="1.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    xmlns:msxml="urn:schemas-microsoft-com:xslt"
    xmlns:umbraco.library="urn:umbraco.library" xmlns:Exslt.ExsltCommon="urn:Exslt.ExsltCommon" xmlns:Exslt.ExsltDatesAndTimes="urn:Exslt.ExsltDatesAndTimes" xmlns:Exslt.ExsltMath="urn:Exslt.ExsltMath" xmlns:Exslt.ExsltRegularExpressions="urn:Exslt.ExsltRegularExpressions" xmlns:Exslt.ExsltStrings="urn:Exslt.ExsltStrings" xmlns:Exslt.ExsltSets="urn:Exslt.ExsltSets" 
    exclude-result-prefixes="msxml umbraco.library Exslt.ExsltCommon Exslt.ExsltDatesAndTimes Exslt.ExsltMath Exslt.ExsltRegularExpressions Exslt.ExsltStrings Exslt.ExsltSets ">


<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:param name="currentPage"/>
<xsl:template match="/">

<xsl:variable name="url" select="concat('http://',umbraco.library:RequestServerVariables('HTTP_HOST'))" />
<link rel="canonical" href="{$url}{umbraco.library:NiceUrl($currentPage/@id)}" />

</xsl:template>
</xsl:stylesheet>

/rasb

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Jul 22, 2009 @ 23:54

Hi Chris,

You raise a lot of valid points - which I share your frustrations... to go through them point by point.

Re: Google indexing the "mydomain.com/nodeid.aspx" URLs

There was a discussion on my blog last week about Robots.txt and what Google indexes. It was suggested that Google only indexes pages that are linked to. So is it possible that somewhere in your Umbraco 3 site that there were links to content pages using the "mydomain.com/nodeid.aspx" style? I know that the old RSS package used that URL structure for the <guid> tag... that could be the cause?

Petr is right about the canonical meta tag - it's the latest what all the "kool kids" are using these days - and Google love it, (cleans up their indexes big time!) ... and rasb's XSLT will do the trick nicely!

Re: Removing content pages from Umbraco

Here's an idea! (using Umbraco v4 Events)

When a page is deleted via the Umbraco back-end, the Delete events are triggered (BeforeDelete and AfterDelete) ... you could hook-up some code that could write the URL (of the page being deleted) to the robots.txt file. So when Google come around to re-indexing your site, it will see the "disallow" rule and remove that URL from it's index.

I do have reservations about doing this, as the robots.txt is meant to be about exclusion - not about a list of pages that don't exist (404).

Which leads me on to using a 404 handler to deal with it.

There is an Umbraco book about "not found handlers": http://umbraco.org/documentation/books/not-found-handlers

If the standard 404handler isn't enough, then you could look at putting together a custom 404 handler to check against a list of old (deleted) URLs and serve-up something accordingly? (The list of old/deleted URLs could be populated via the BeforeDelete event, as mentioned above)

Hope this helps in some way, let me know your thoughts - it's a good discussion!

Cheers,

- Lee

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Jul 26, 2009 @ 01:02

I was thinking about what happens when you rename a page in Umbraco - any existing links to that page break (hence why the /nodeid.aspx URL isn't such a bad GUID/permalink).

Following on from my last post, one solution could be when a content document is updated, we hook into the Save event, check if the "page name" is different - if so, then we can insert/append the old page name to the "umbracoUrlAlias" field (if your doc-type has it).

This way your old URLs aren't broken.

I haven't wrote any code for this (yet) ... but when I do, I'll post it here. Unless someone else likes the idea and writes the code?

Cheers,

- Lee

Copy Link

Chris Houston 535 posts 980 karma points MVP admin c-trib

Jul 27, 2009 @ 10:29

Hi Lee,

One thing that needs to be taken into account is that the user may well rename a page and then create a new page with the original name, so in this senario the re-direct would need to be cancelled. I think it should always be given to the user as an option when they rename / delete a page.

I might have a play with this later this week if you've not already got it sorted :)

Cheers,

Chris

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Jul 27, 2009 @ 12:53

Hi Chris,

As far as I am aware, the "umbracoUrlAlias" is used by a Not Found Handler (by default it's the first in the list) - so the aliases will only work if the original page/URL is not found. So if the user/editor creates a new page - with the original page title/URL ... then it will be picked up first (and not by the "umbracoUrlAlias").

Of course, to keep things clean, some code could be written to remove the "umbracoUrlAlias" from an old page ... but there would be some overhead with that (i.e. look-ups in the DB/XML cache, writing the property back to the database, etc).

Code-wise, I doubt I'll have time to write anything like that in the next few weeks ... couple of client project's deadlines are looming, etc.

But do let me know if you write anything, I'd be happy to help test, etc.

Cheers,

- Lee

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

Feb 15, 2010 @ 02:38

Hi guys,

Following up on an old topic (from when I released the Robots.txt Editor) ... finally got around to coming up with a solution for renaming/deleting old content pages. Behold the 301 Moved Permanently (NotFoundHandler)! (I wanted to call it Permanent Redirect, but Peter Gregory got there first!)

As an alternative to the canonical link, this package lets you add a new property to your document-type to include old/bad URLs. It works in the same way as the the "umbracoUrlAlias" property alias - but instead redirects the user to the new content page/node/URL, along with a 301 HTTP status code.

Let me know if you use it... look forward to any feedback.

Cheers, Lee.

Copy Link

Qube 74 posts 116 karma points

May 17, 2010 @ 01:01

I've been trying to tackle this issue too. My approach was to write a wrapper for UrlRewriting.config. Every umbraco install has UrlRewriting built in, so I figured it was the most open and reliable way to handle it.

In a nutshell, you can add a "Url Manager" property to your document type, and it will list all the rules in UrlRewriting.config that apply to a piece of content. You can add new rules and remove old ones. Saving stores the changes in the database, not XML. Publishing commits the changes to XML, and if the primary page name has changed, a new rule pointing the old URL to the new one is created.

The rules themselves are structured in such a way that they perform a 301 redirect.

The extension needs some work before it's turned into a project (check for duplicates, better UI etc.), but it's already at work in our corporate website, and it works great so far.

Copy Link

Lee Kelleher 4026 posts 15837 karma points MVP 13x admin c-trib

May 17, 2010 @ 08:28

Hi Ben, good idea... just checking, have you taken a look at the 301 URL Tracker package yet?

Copy Link

Qube 74 posts 116 karma points

Jun 17, 2010 @ 06:26

No I haven't. Looks pretty much perfect :)

I've since abandoned my UrlRewriting wrapper, because it will never be able to support multiple domain setups in umbraco (limitation of UrlRewriting.net). Look forward to investigating UrlTracker!

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

Ideas for automatic creating of Robot.txt rules.