Design
Sneakpeek has 6 core components:
Scrapers storage - stores list of scrapers and its metadata.
Tasks queue - populated by the scheduler or user and is consumed by the queue consumers
Lease storage - stores lease (global lock) for scheduler, to make sure there’s only 1 active scheduler at all times.
Scheduler - schedules periodic tasks using scrapers in the storage
Consumer - consumes tasks queue and executes tasks logic (e.g. scraper logic)
API - provides JsonRPC API for interacting with the system
All of the components are run by the SneakpeekServer
.
Scrapers Storage
Scraper storage interface is defined in sneakpeek.scraper.model.ScraperStorageABC
.
InMemoryScraperStorage
- in-memory storage. Should either be used in development environment or if the list of scrapers is static and wouldn’t be changed.RedisScraperStorage
- redis storage.
Tasks queue
Tasks queue consists of three components:
* Storage
- tasks storage
* Storage
- queue implementation
* Storage
- queue consumer implementation
Currently there 2 storage implementations:
InMemoryQueueStorage
- in-memory storage. Should only be used in development environment.RedisQueueStorage
- redis storage.
Lease storage
Lease storage is used by scheduler to ensure that at any point of time there’s no more than 1 active scheduler instance which can enqueue scraper jobs. This disallows concurrent execution of the scraper.
Lease storage interface is defined in LeaseStorageABC
.
Currently there 2 storage implementations:
InMemoryLeaseStorage
- in-memory storage. Should only be used in development environment.RedisLeaseStorage
- redis storage.
Scheduler
Scheduler
is responsible for:
scheduling scrapers based on their configuration.
finding scraper jobs that haven’t sent a heartbeat for a while and mark them as dead
cleaning up jobs queue from old historical scraper jobs
exporting metrics on number of pending jobs in the queue
As for now there’s only one implementation Scheduler
that uses APScheduler.
Queue consumer
Consumer constantly tries to dequeue a job and executes dequeued jobs.
As for now there’s only one implementation Consumer
.
API
Sneakpeek implements:
JsonRPC to programmatically interact with the system, it exposes following methods (available at
/api/v1/jsonrpc
): * CRUD methods to add, modify and delete scrapers * Get list of scraper’s jobs * Enqueue scraper jobsUI that allows you to interact with the system
Swagger documentation (available at
/api
)Copy of this documentation (available at
/docs
)