Googlebot Gets Candid: Now All That You Would Like To Know!
March 7th, 2008 | RSS Feed
Googlebot is like a dream which knows us all
, , and soul. Here in this interview, Maile Ohye as the website and Jeremy Lilley as the Googlebot from Google Central Webmasterblog would answer all those questions that you ever had.Website: Would you crawl with the same headers if the site were in the U.S., Asia or Europe? Do you ever use different headers?
Googlebot: Typically the headers are the same world-wide. It crawls around to see what a page looks like for the default language and settings for the site. At times the User-Agent is different, for example, AdSense fetches use "Mediapartners-Google": User-Agent: Mediapartners-Google, Or for image search: User-Agent: Googlebot-Image/1.0. Wireless fetches mostly have carrier-specific user agents, while Google Reader RSS fetches extra info
such as number of subscribers as well.
However, in order not to affect the content due to session-specific info, I usually avoid cookies . Further I can even identify a session id, if a server is using a dynamic URL rather than a cookie and thus I can easily avoid crawling up same page a million time with a million different session ids.
Website: Do you index all URLs or are certain file extensions automatically filtered?
Googlebot: While indexing for regular web search, the links to MP3s and videos would not be downloaded by me. Similarly, I will treat a JPG, differently than an HTML or PDF link. Whereas if I'm looking for links as Google Scholar, I will be more interested in the PDF article than the JPG file. But if I'm crawling for image search, I'm more interested in JPGs and, HTML & images for news.
Website: How do you treat an unknown file extension, for example http://www.example.com/page1.LOL111?
Googlebot: I would treat it fair. After I download a file, I use the Content-Type header to check if it really is HTML, an image, text, or something else. And if it happens to be a special data type like a PDF file, Word document, or Excel spreadsheet, I'll look for valid format and extract the text content. So, while crawling http://www.example.com/page1.LOL111,
with an unknown file extension, first I would start off with downloading it and in the mean while would try to figure out the content type from the header, or it's a format that we don'tt index (e.g. mp3), then it'll be put aside. Otherwise, we proceed indexing the file.
Website: Can you explain your header: Accept-Encoding: gzip,deflate
Googlebot: The gzip compression of content is to save bandwidth. This question just doesn't have a simple answer. Both Apache and IIS have options to enable gzip and deflate compression, though there's a CPU cost involved for the bandwidth saved. Typically, it's only enabled for easily compressible text HTML/CSS/PHP content. And it only gets used if the user's browser or I (a search engine crawler) allow it. Personally, I prefer "gzip" over "deflate". Gzip is a slightly more robust encoding — there is consistently a checksum and a full header, giving me less guess-work than with deflate. Otherwise they're very similar compression algorithms. If you have some spare CPU on your servers, it might be worth experimenting with compression (links: Apache, IIS). But, if you're serving dynamic content and your servers are already heavily CPU loaded, you might want to hold off.
Website: What do you have to say about the over protective parent like robots.txt ?
Googlebot: Well, there are plenty of them; Some are mere HTML error pages rather than valid robots.txt. Others have infinite redirects to totally unrelated sites; while there are others, who are just huge and have thousands of different URLs listed individually. The problem that I face with them is that, after I see the restrictive robots.txt, I may have to start throwing away content I've already crawled in the index. And then I have to recrawl a lot of content once I'm allowed to hit the site again. At least a 503 response code would've been temporary. MostIy I re-check robots.txt once a day. For webmasters, trying to control crawl rate through robots.txt swapping usually backfires. It's better to set the rate to "slower" in Webmaster Tools.
Googlebot: Hey! Website, thanks for all of your questions, it's time to crawl away. you've been wonderful, but I'm going to have to say "FIN, my love."
Website: Thank you, Googlebot for everthing! Keep Crawling!
Click here to subscribe to our RSS feed to get a daily digest of news around search engine industry. PageTraffic SEO Blog is updated four times a day and is ranked as one of the best search engine resources blog by Pandia!
- del.icio.us
- Digg
- Furl
- Rojo
- StumbleUpon
- Technorati
- Yahoo!
Did you like this article?
Related Posts
Comments
2 Responses to “Googlebot Gets Candid: Now All That You Would Like To Know!”
Leave a Reply
Connect with us
SEO Tools
FEATURED CATEGORIES
- adCenter (82)
- AdSense (113)
- AdWords (298)
- Analytics (53)
- AOL (5)
- Ask (101)
- Bing (33)
- Blogging (19)
- Copywriting (1)
- Directory (6)
- Google (1876)
- Industry News (805)
- Keyword Research & Targeting (21)
- Link Building (1)
- Link Popularity (60)
- Live (78)
- Local SEO (7)
- Microsoft (131)
- Mobile Search (13)
- MSN (170)
- PageTraffic Happenings (6)
- Panama (21)
- Pay Per Click (33)
- Reputation Management (1)
- Search Engine Conferences (153)
- Search Engines (95)
- SEO (222)
- SEO Tools (40)
- Social Media (19)
- Tips & Tricks (12)
- Web Marketing (4)
- Yahoo! (567)
- Yahoo! Search Marketing (66)










March 7th, 2008 at 11:35
great article
March 10th, 2008 at 03:34
Navneet – Nice explanation on Google Bot.