Distributed, Fault-tolerant Web Crawling with RasPi
Distributed, Fault-tolerant Web Crawling with RasPi
Authors
Abstract
The project aims at building a distributed web crawling infrastructure, whose functionality is to fetch publications' metadata from the APICe online repository, provided a set of keywords are given for searching.
As far as non-functional properties are concerned, the infrastructure should be:
- distributed and open, that is, any number of web crawlers may be deployed on any number of networked machines, possibly even at run-time
- fault-tolerant to disconnections and crashes, that is, both disconnections and crashes should be detected as soon as possible and (ii) properly managed – e.g., crawling tasks of the disconnected/crashed crawler re-assigned –
- resource-efficient, that is, the infrastructure should be able to execute smoothly on resource-constrained devices—e.g., RasPi systems
Usage of either the TuCSoN middleware for coordinating crawlers, or (ii) the JADE framework for programming crawlers is mandatory. Usage of both is considered a plus.
References
- Laboratory lessons on TuCSoN and JADE
- TuCSoN4JADE integration library: https://bitbucket.org/smariani/tucson4jade/wiki/Home
- JADE
- home: http://jade.tilab.com
- add-ons page: http://jade.tilab.com/download/add-ons/
- 3rd party contributions page: http://jade.tilab.com/download/third-party-contributions/
Outcomes