Thread: Robots.txt/sitemap/crawling issue

  1. #1
    SkippyMcgee
    
    

    Default Robots.txt/sitemap/crawling issue

    Is there an easy way to disallow URLs with parameters in them? Anything after the (?)?

    To avoid massive duplication issues, we've disallowed dynamic URLs that are created by our left nav parametric filtering. So any URLS with attributes or a brand are blocked.

    The problem with something like Disallow: /*brand is that we have 800 URLs that have the word 'brand' in the URL, so not only will it disallow www.example.comaspx?brand=15, it will disallow

    And this is screwing us up with Google Products in a big way.

    I can't just put Disallow: /*? because DNSF's sitemap has question marks in it, and Google rejects the sitemap.

    Any thoughts? (other than rewriting 800 product titles to change URLs?)


  2. #2
    deanfp
    
    


    Tricky one that I am also looking for a solution to.

    This maybe of some use

    You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:

    User-agent: *
    Allow: /*?$
    Disallow: /*?

    The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

    The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

  3. #3
    SkippyMcgee
    
    


    Thanks deanfp! We're trying some new options, and that's one of them.

  4. #4
    fooster
    
    


    Sounds like you need to use the canonical meta tag on these pages. This will fix the issue without accidentally causing issues by messing with your robots.txt file.

    Have a search for canonical on the forums- someone has implanted an xmlpackage which you will be able to adapt to your specific needs.
