Webmasters Stack Exchange is a question and answer site for pro webmasters. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

Could you please let me know how to block such URLs from robots.txt for Googlebots to stop indexing?

http://www.example.com/+rt6s4ayv1e/d112587/ia0g64491218q

My website was hacked which is now recovered but the hacker indexed 5000 URLs in Google and now I get error 404 on random generated links as above all starting with /+ like above link.

I was wondering if there is a quick way other than to manually remove these URLs from the google webmaster tools?

Can we block this with robots.txt to URLs starting with + sign?

share|improve this question

migrated from serverfault.com 3 hours ago

This question came from our site for system and network administrators.

1  
There is nothing special about + (plus) in the URL-path, it is just a character like any other. – w3dk 2 hours ago

My website was hacked which is now recovered but the hacker indexed 5000 URLs in Google and now I get error 404

A 404 is probably preferable to blocking with robots.txt if you want these URLs dropped from the search engines (ie. Google). If you block crawling then the URL could still remain indexed. (Note that robots.txt primarily blocks crawling, not indexing.)

If you want to "speed up" the de-indexing of these URLs then you could perhaps serve a "410 Gone" instead of the usual "404 Not Found". You could do something like the following with mod_rewrite (Apache) in your root .htaccess file:

RewriteEngine On
RewriteRule ^+ - [G]
share|improve this answer
User-Agent: *  
Disallow: /+

should do what you want. It will tell the robot to not request all URLs starting with a +.

share|improve this answer

If you really want to use robots.txt this would be a simple answer to your question. Also i have included a link to where you can read on the specifications on robots.txt.

User-agent: *
Disallow: /+

Read about robots.txt specs

But one other alternative might be to use .htaccess to make a rewrite rule (if you use Apache etc) to catch them and perhaps tell Google a better return HTTP code or to simply redirect the traffic to some other page.

share|improve this answer
1  
There is no need for the * (asterisk) at the end of the URL-path. It should be removed for greatest spider-compatibility. robots.txt is already prefix matching, so /+* is the same as /+ for bots that support wildcards, and for bots that don't support wildcards then /+* will not match at all. – w3dk 2 hours ago
    
You are right, i just wrote that based on his question about Googlebot. I have edited it to reflect better compatibility against multiple bots. – davidbl 1 hour ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.