Robots.txt for a Multi-site Sitecore Instance

Sitecore Version: 8.1

Every Website that has some level of relative popularity needs to have a Robots.txt.  For Sitecore-based sites, you could easily put a robots.txt on the root of the site.  However, this solution doesn’t always work because of:

  • a multi-tenant Sitecore instance may require different rules per site
  • IT policies may not easily allow you to upload/update the robots.txt file on the server

I’ve seen some solutions on creating the robots.txt dynamically but didn’t work for me. I wanted to handle the robots.txt before any Sitecore-related processing was done to keep the solution lean.  This means that I cannot use the httpRequestBegin pipeline to access the Site context, etc.  (If you can use the httpRequestbegin, see srinivas_r’s post).

To get started, you need to update your site’s root item with a new multi-line text field called Robots.

The code is commented well enough for you to follow along.

    ///
<summary>
    /// Requires: Robots multi-line field on the site's root item
    /// </summary>

    public class RobotsHandler : IHttpHandler
    {
        ///
<summary>
        /// The default robots
        /// </summary>

        private string defaultRobots = @"User-agent: *"
                                       + Environment.NewLine +
                                       " Disallow: /sitecore";

        ///
<summary>
        /// You will need to configure this handler in the Web.config file of your
        /// web and register it with IIS before being able to use it. For more information
        /// see the following link: http://go.microsoft.com/?linkid=8101007
        /// </summary>

        /// <returns>true if the <see cref="T:System.Web.IHttpHandler" /> instance is reusable; otherwise, false.</returns>
        public bool IsReusable
        {
            // Return false in case your Managed Handler cannot be reused for another request.
            // Usually this would be false in case you have some state information preserved per request.
            get { return true; }
        }

        ///
<summary>
        /// Enables processing of HTTP Web requests by a custom HttpHandler that implements the <see cref="T:System.Web.IHttpHandler" /> interface.
        /// </summary>

        /// <param name="context">An <see cref="T:System.Web.HttpContext" /> object that provides references to the intrinsic server objects (for example, Request, Response, Session, and Server) used to service HTTP requests.</param>
        public void ProcessRequest(HttpContext context)
        {
// Let's make sure we're only handling robots.txt
            if (context.Request.Url.AbsolutePath.IndexOf("robots.txt") == -1)
            {
                return;
            }

            string robotsTxt = defaultRobots;

            // Get the list of available sites in Sitecore
            List<SiteInfo> sites = Sitecore.Configuration.Factory.GetSiteInfoList();
            string host = context.Request.Url.Host.ToLower();
            SiteContext siteContext;
// Find the site based on the current URL
            var siteInfo = (from site in sites
                            where
                                !String.IsNullOrEmpty(site.HostName)
                            select new
                            {
                                SiteInfo = site,
                                Hosts = site.HostName.Split(new char[] { '|' },
StringSplitOptions.RemoveEmptyEntries)
                            } into hosts
                            where hosts.Hosts.Any(x => x.ToLower() == host)
                            select hosts.SiteInfo).FirstOrDefault();
// If not found, let's try the default site definition, website
            if (siteInfo == null)
            {
                siteInfo = (from site in sites
                            where site.Name == "website"
                            select site).FirstOrDefault();
            }

            if (siteInfo != null)
            {
// This requires a multi-line field called "Robots" on the
// root item of the site
                siteContext = new SiteContext(siteInfo);
                Item rootItem = siteContext.Database.GetItem(siteContext.RootPath);

                if (rootItem != null && rootItem.Fields["Robots"] != null && !String.IsNullOrEmpty(rootItem.Fields["Robots"].Value))
                    robotsTxt = rootItem.Fields["Robots"].Value;

            }
            context.Response.ContentType = "text/plain";
            context.Response.Write(robotsTxt);
        }

    }

In general, we use the Sitecore.Configuration.Factory.GetSiteInfoList() to get a list of sites and compare the hostName of each site with the current request URL.  Once we find it, it’s basically getting the correct item and the corresponding field, Robots.

Lastly, we need to update our web.config so that our site can process any requests to the robots.txt. Add the following to the / conifguration / system.webServer / handlers section of the web.config:


<add verb="*" path="robots.txt" type="BoK.Extentions.Handlers.RobotsHandler,
BoK.Extentions" name="BoK.Extentions.Handlers.RobotsHandler" />

That’s it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s