Yahoo! Research Papers On Web Spam And Information Retrieval

January 15th, 2007 | RSS Feed



If you're new here, you may want to subscribe to our Full RSS feed to get a daily digest of news around search engine industry.

There are two important papers at Yahoo! Research dealing with the problems in distributed information retrieval and web topology for detecting web spam. The paper 'Challenges in Distributed Information Retrieval' is by Flavio Junqueira, Ricardo Baeza-Yates, Fabrizio Silvestri, Vassilis Plachouras and Carlos Castillo. As the web sites are increasing at a great rate with over 20 billion indexed pages the centralized systems of the search engines will not be able to handle such a large data. There will be requirement of fully distributed search engines. In this paper all the researchers have put together the  recent research results and talk about the challenges that distributed Web retrieval system faces.

The other paper 'Know your Neighbors: Web Spam Detection using the Web Topology' is by Vanessa Murdock, Carlos Castillo, Fabrizio Silvestri, Debora Donato and Aristides Gionis who in the paper “present a spam detection system that uses the topology of the Web graph by exploiting the link dependencies among the Web pages, and the content of the pages themselves. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam that can be applied in practice to large-scale Web data.”

Click here to subscribe to our RSS feed to get a daily digest of news around search engine industry. PageTraffic SEO Blog is updated four times a day and is ranked as one of the best search engine resources blog by Pandia!


 


Comments

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

Leave a Reply

Back to Top

Connect with us

Connect us on twitter
Connect us on facebook
Connect us on flickr
Connect us on youtube

Life@PageTraffic on Flickr

Washroom AreaCafe f5Gallery outside bay areas


More >>

Subscribe To Our SEO Blog


Enter your email address:

Delivered by FeedBurner

Search


PageTraffic on Facebook
SEO Blogs - Blog Catalog Blog Directory
Feedback Form