Googlebot Gets Candid: Now All That You Would Like To Know!

March 7th, 2008 | RSS Feed



If you're new here, you may want to subscribe to our Full RSS feed to get a daily digest of news around search engine industry.

Googlebot is like a dream which knows us all , , and soul. Here in this interview, Maile Ohye as the website and Jeremy Lilley as the Googlebot from Google Central Webmasterblog would answer all those questions that you ever had.

Website: Would you crawl with the same headers if the site were in the U.S., Asia or Europe? Do you ever use different headers?

Googlebot: Typically the headers are the same world-wide. It crawls around to see what a page looks like for the default language and settings for the site. At times the User-Agent is different, for example, AdSense fetches use "Mediapartners-Google": User-Agent: Mediapartners-Google, Or for image search: User-Agent: Googlebot-Image/1.0. Wireless fetches mostly have carrier-specific user agents, while Google Reader RSS fetches extra info
such as number of subscribers as well.
However, in order not to affect the content due to session-specific info, I usually avoid cookies . Further I can even identify a session id, if a server is using a dynamic URL rather than a cookie and thus I can easily avoid crawling up same page a million time with a million different session ids.

Website: Do you index all URLs or are certain file extensions automatically filtered?

Googlebot: While indexing for regular web search, the links to MP3s and videos would not be downloaded by me. Similarly, I will treat a JPG, differently than an HTML or PDF link. Whereas if I'm looking for links as Google Scholar, I will be more interested in the PDF article than the JPG file. But if I'm crawling for image search, I'm more interested in JPGs and, HTML & images for news.

Website: How do you treat an unknown file extension, for example http://www.example.com/page1.LOL111?

Googlebot: I would treat it fair. After I download a file, I use the Content-Type header to check if it really is HTML, an image, text, or something else. And if it happens to be a special data type like a PDF file, Word document, or Excel spreadsheet, I'll look for valid format and extract the text content. So, while crawling http://www.example.com/page1.LOL111,
with an unknown file extension, first I would start off with downloading it and in the mean while would try to figure out the content type from the header, or it's a format that we don'tt index (e.g. mp3), then it'll be put aside. Otherwise, we proceed indexing the file.

Website: Can you explain your header: Accept-Encoding: gzip,deflate

Googlebot: The gzip compression of content is to save bandwidth. This question just doesn't have a simple answer. Both Apache and IIS have options to enable gzip and deflate compression, though there's a CPU cost involved for the bandwidth saved. Typically, it's only enabled for easily compressible text HTML/CSS/PHP content. And it only gets used if the user's browser or I (a search engine crawler) allow it. Personally, I prefer "gzip" over "deflate". Gzip is a slightly more robust encoding — there is consistently a checksum and a full header, giving me less guess-work than with deflate. Otherwise they're very similar compression algorithms. If you have some spare CPU on your servers, it might be worth experimenting with compression (links: Apache, IIS). But, if you're serving dynamic content and your servers are already heavily CPU loaded, you might want to hold off.

Website: What do you have to say about the over protective parent like robots.txt ?

Googlebot: Well, there are plenty of them; Some are mere HTML error pages rather than valid robots.txt. Others have infinite redirects to totally unrelated sites; while there are others, who are just huge and have thousands of different URLs listed individually. The problem that I face with them is that, after I see the restrictive robots.txt, I may have to start throwing away content I've already crawled in the index. And then I have to recrawl a lot of content once I'm allowed to hit the site again. At least a 503 response code would've been temporary. MostIy I re-check robots.txt once a day. For webmasters, trying to control crawl rate through robots.txt swapping usually backfires. It's better to set the rate to "slower" in Webmaster Tools.

Googlebot: Hey! Website, thanks for all of your questions, it's time to crawl away. you've been wonderful, but I'm going to have to say "FIN, my love."

Website: Thank you, Googlebot for everthing! Keep Crawling!

Click here to subscribe to our RSS feed to get a daily digest of news around search engine industry. PageTraffic SEO Blog is updated four times a day and is ranked as one of the best search engine resources blog by Pandia!


 


Comments

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

2 Responses to “Googlebot Gets Candid: Now All That You Would Like To Know!”

  1. martinsc Says:

    great article ;-)

  2. Anil Kumar Singh Says:

    Navneet – Nice explanation on Google Bot.

Leave a Reply

Back to Top

Connect with us

Connect us on twitter
Connect us on facebook
Connect us on flickr
Connect us on youtube

Life@PageTraffic on Flickr

Middle galleryWashroom AreaCafe f5


More >>

Subscribe To Our SEO Blog


Enter your email address:

Delivered by FeedBurner

Search


PageTraffic on Facebook
SEO Blogs - Blog Catalog Blog Directory
Feedback Form