great package, and just what i need... in a multi site environement this is very usefull really,
it looks usefull in the sitemap urls like your example, but how would you handle full urls for allow or disallow
the robots.txt has relative paths like disallow /umbraco/
but in the following explanation i would like to disallow everything requested on a specific url, say, you have a site for a company in europe, you setup the corporate site, and are now working on 3 country sites belgian, netherlands and luxembourg sites are still in development (content beeing finalized and all) can this package handle url exclusion?
the client prefers to publish everything already, but use a staging URL, i know thats possible, but because its all published, we need to exclude the URLs from the robots.txt otherwise google will start indexing temporary pages which is not really a good idea.
But it's not exactly "safe", what we usually do is put windows authentication up on the temporary site so that no links to the site can ever appear in Google.
ah but there is the problem, i can only use this technique if the staging sites are in a different umbraco
i knew this was possible, however due to budget related descisions, the client has only 1 umbraco
with multiple sites in it. some of them already have live hostnames, others are still being implemented content wise.
the 1 robots.txt cannot handle disallow: http://mysite.com/ but by placing disallow: / would also disallow it for every site in that umbraco. incluiding the ones that are live.
i know the situation is not optimal, and if the money was there we would have two umbraco's and a courrier connection inbetween for staging versus live.
as explained that has been an issue with this project.
i've been thinking about adding urlrewriting for those specific urls too. as an other approach to the issue. redirecting towards another robots.txt with only the disallow:/ in it. but so far that has been unsuccessfull. thats why i went looking for your package :)
Well, it turns out the source for this is extremely simple, so what you'd need to do I think is based on the HTTP_HOST, write only the disallow rule and not anything else. Not sure how you would make this configurable, but you could probably just hardcode the domain(s) in for now:
using System.IO;
using System.Web;
namespace Cultiv.DynamicRobots
{
public class RobotsTxt : IHttpHandler
{
public void ProcessRequest(HttpContext context)
{
context.Response.ContentType = "text/plain";
var robotsTemplate = HttpContext.Current.Server.MapPath(VirtualPathUtility.ToAbsolute("~/robots.txt"));
if (File.Exists(robotsTemplate))
{
var streamReader = File.OpenText(robotsTemplate);
var input = streamReader.ReadToEnd();
context.Response.Write(input.Replace("{HTTP_HOST}", HttpContext.Current.Request.ServerVariables["HTTP_HOST"]));
streamReader.Close();
streamReader.Dispose();
}
else
{
context.Response.Write("");
}
}
public bool IsReusable
{
get { return true; }
}
}
}
full path disallow
great package, and just what i need... in a multi site environement this is very usefull really,
it looks usefull in the sitemap urls like your example, but how would you handle full urls for allow or disallow
the robots.txt has relative paths like disallow /umbraco/
but in the following explanation i would like to disallow everything requested on a specific url,
say, you have a site for a company in europe, you setup the corporate site, and are now working on 3 country sites
belgian, netherlands and luxembourg sites are still in development (content beeing finalized and all)
can this package handle url exclusion?
the client prefers to publish everything already, but use a staging URL,
i know thats possible, but because its all published, we need to exclude the URLs from the robots.txt otherwise google will start indexing temporary pages which is not really a good idea.
so, is there a way to say
disallow: http://dev.mysite.be/
best regards
Sander Houttekier
You would do that like so:
User-agent: *
Disallow: /
But it's not exactly "safe", what we usually do is put windows authentication up on the temporary site so that no links to the site can ever appear in Google.
ah but there is the problem, i can only use this technique if the staging sites are in a different umbraco
i knew this was possible, however due to budget related descisions, the client has only 1 umbraco
with multiple sites in it. some of them already have live hostnames, others are still being implemented content wise.
the 1 robots.txt cannot handle disallow: http://mysite.com/
but by placing disallow: / would also disallow it for every site in that umbraco. incluiding the ones that are live.
Ah I see, well in that case: I should release the source so you can make your own hack in it to allow for this... :) I'll try to do so this evening!
that would be great
i know the situation is not optimal, and if the money was there we would have two umbraco's and a courrier connection inbetween for staging versus live.
as explained that has been an issue with this project.
i've been thinking about adding urlrewriting for those specific urls too. as an other approach to the issue. redirecting towards another robots.txt with only the disallow:/ in it. but so far that has been unsuccessfull. thats why i went looking for your package :)
Well, it turns out the source for this is extremely simple, so what you'd need to do I think is based on the HTTP_HOST, write only the disallow rule and not anything else. Not sure how you would make this configurable, but you could probably just hardcode the domain(s) in for now:
is working on a reply...