Google Tend To Crawl & Index HTML Forms!

April 14th, 2008 | RSS Feed



If you're new here, you may want to subscribe to our Full RSS feed to get a daily digest of news around search engine industry.

On Friday, April11 2008 Google Webmasters Blog revealed that Google had been testing a new search related technology that would enable Google crawl agents to explore some HTML forms in an attempt to discover new web pages and URLs which have not yet been found and indexed.

Google's crawling and indexing team explained, "In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a

element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page."

The Google crawling agent also known as 'Googlebot' while searching for these unknown sites, still adheres strictly to 'robots.txt', 'nofollow', and 'noindex' commands. In the very same fashion, Google does not retrieve forms that may require any sort of user information. Also, forms that have a password input or that use terms commonly associated with personal information such as logins, user ids, contacts, etc. are also avoided by Googlebot.

The crawling for these yet unknown pages does not affect those websites that are already a part of the crawling process, thus eliminating any chances of a fall in PageRank. These pages that are hidden deep in the online abyss are also referred to as Deep Web, Hidden Web or Invisible Web.

Click here to subscribe to our RSS feed to get a daily digest of news around search engine industry. PageTraffic SEO Blog is updated four times a day and is ranked as one of the best search engine resources blog by Pandia!


 


Comments

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

Leave a Reply

Back to Top

Connect with us

Connect us on twitter
Connect us on facebook
Connect us on flickr
Connect us on youtube

Life@PageTraffic on Flickr

Cafe f5Gallery outside bay areasSweet Reception


More >>

Subscribe To Our SEO Blog


Enter your email address:

Delivered by FeedBurner

Search


PageTraffic on Facebook
SEO Blogs - Blog Catalog Blog Directory
Feedback Form