Important Notice from AspDotNetStorefront
It is with dismay that we report that we have been forced, through the action of hackers, to shut off write-access to this forum. We are keen to leave the wealth of material available to you for research. We have opened a new forum from which our community of users can seek help, support and advice from us and from each other. To post a new question to our community, please visit: http://forums.vortx.com
Results 1 to 4 of 4

Thread: Robots.txt/sitemap/crawling issue

  1. #1
    SkippyMcgee is offline Junior Member
    Join Date
    Jul 2011
    Posts
    4

    Default Robots.txt/sitemap/crawling issue

    Is there an easy way to disallow URLs with parameters in them? Anything after the (?)?

    To avoid massive duplication issues, we've disallowed dynamic URLs that are created by our left nav parametric filtering. So any URLS with attributes or a brand are blocked.

    The problem with something like Disallow: /*brand is that we have 800 URLs that have the word 'brand' in the URL, so not only will it disallow www.example.comaspx?brand=15, it will disallow www.example.com/p-1-brand-t-shirt.aspx.

    And this is screwing us up with Google Products in a big way.

    I can't just put Disallow: /*? because DNSF's sitemap has question marks in it, and Google rejects the sitemap.

    Any thoughts? (other than rewriting 800 product titles to change URLs?)

    Thanks!

  2. #2
    deanfp is offline Senior Member
    Join Date
    May 2009
    Location
    Sweden
    Posts
    556

    Default

    Tricky one that I am also looking for a solution to.

    This maybe of some use

    You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:

    User-agent: *
    Allow: /*?$
    Disallow: /*?

    The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

    The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

  3. #3
    SkippyMcgee is offline Junior Member
    Join Date
    Jul 2011
    Posts
    4

    Default

    Thanks deanfp! We're trying some new options, and that's one of them.

  4. #4
    fooster is offline Member
    Join Date
    Jan 2007
    Posts
    98

    Default

    Sounds like you need to use the canonical meta tag on these pages. This will fix the issue without accidentally causing issues by messing with your robots.txt file.

    Have a search for canonical on the forums- someone has implanted an xmlpackage which you will be able to adapt to your specific needs.

    Ben