Robots.txt/sitemap/crawling issue

11-15-2011, 05:40 AM

Is there an easy way to disallow URLs with parameters in them? Anything after the (?)?

To avoid massive duplication issues, we've disallowed dynamic URLs that are created by our left nav parametric filtering. So any URLS with attributes or a brand are blocked.

The problem with something like Disallow: /*brand is that we have 800 URLs that have the word 'brand' in the URL, so not only will it disallow www.example.comaspx?brand=15, it will disallow www.example.com/p-1-brand-t-shirt.aspx.

And this is screwing us up with Google Products in a big way.

I can't just put Disallow: /*? because DNSF's sitemap has question marks in it, and Google rejects the sitemap.

Any thoughts? (other than rewriting 800 product titles to change URLs?)

Thanks!

11-15-2011, 07:46 AM

Tricky one that I am also looking for a solution to.

This maybe of some use

You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:

User-agent: *
Allow: /*?$
Disallow: /*?

The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

11-18-2011, 05:14 AM

Thanks deanfp! We're trying some new options, and that's one of them.

11-18-2011, 10:35 AM

Sounds like you need to use the canonical meta tag on these pages. This will fix the issue without accidentally causing issues by messing with your robots.txt file.

Have a search for canonical on the forums- someone has implanted an xmlpackage which you will be able to adapt to your specific needs.

Ben

Thread: Robots.txt/sitemap/crawling issue

Robots.txt/sitemap/crawling issue