Courses » Distributed Systems » 2015-2016 » Projects » Distributed, Fault-tolerant Web Crawling with RasPi

Distributed, Fault-tolerant Web Crawling with RasPi

People

Abstract

The project aims at building a distributed web crawling infrastructure, whose functionality is to fetch publications' metadata from the APICe online repository, provided a set of keywords are given for searching.

As far as non-functional properties are concerned, the infrastructure should be:

  • distributed and open, that is, any number of web crawlers may be deployed on any number of networked machines, possibly even at run-time
  • fault-tolerant to disconnections and crashes, that is, both disconnections and crashes should be (i) detected as soon as possible and (ii) properly managed --- e.g., crawling tasks of the disconnected/crashed crawler re-assigned
  • resource-efficient, that is, the infrastructure should be able to execute smoothly on resource-constrained devices --- e.g., RasPi systems
Usage of either (i) the TuCSoN middleware for coordinating crawlers, or (ii) the JADE framework for programming crawlers is mandatory. Usage of both is considered a plus.

Reference Material