Design

Sneakpeek has 6 core components:

  • Scrapers storage - stores list of scrapers and its metadata.

  • Tasks queue - populated by the scheduler or user and is consumed by the queue consumers

  • Lease storage - stores lease (global lock) for scheduler, to make sure there’s only 1 active scheduler at all times.

  • Scheduler - schedules periodic tasks using scrapers in the storage

  • Consumer - consumes tasks queue and executes tasks logic (e.g. scraper logic)

  • API - provides JsonRPC API for interacting with the system

All of the components are run by the SneakpeekServer.

Scrapers Storage

Scraper storage interface is defined in sneakpeek.scraper.model.ScraperStorageABC.

  • InMemoryScraperStorage - in-memory storage. Should either be used in development environment or if the list of scrapers is static and wouldn’t be changed.

  • RedisScraperStorage - redis storage.

Tasks queue

Tasks queue consists of three components: * Storage - tasks storage * Storage - queue implementation * Storage - queue consumer implementation

Currently there 2 storage implementations:

  • InMemoryQueueStorage - in-memory storage. Should only be used in development environment.

  • RedisQueueStorage - redis storage.

Lease storage

Lease storage is used by scheduler to ensure that at any point of time there’s no more than 1 active scheduler instance which can enqueue scraper jobs. This disallows concurrent execution of the scraper.

Lease storage interface is defined in LeaseStorageABC.

Currently there 2 storage implementations:

  • InMemoryLeaseStorage - in-memory storage. Should only be used in development environment.

  • RedisLeaseStorage - redis storage.

Scheduler

Scheduler is responsible for:

  • scheduling scrapers based on their configuration.

  • finding scraper jobs that haven’t sent a heartbeat for a while and mark them as dead

  • cleaning up jobs queue from old historical scraper jobs

  • exporting metrics on number of pending jobs in the queue

As for now there’s only one implementation Scheduler that uses APScheduler.

Queue consumer

Consumer constantly tries to dequeue a job and executes dequeued jobs. As for now there’s only one implementation Consumer.

API

Sneakpeek implements:

  • JsonRPC to programmatically interact with the system, it exposes following methods (available at /api/v1/jsonrpc): * CRUD methods to add, modify and delete scrapers * Get list of scraper’s jobs * Enqueue scraper jobs

  • UI that allows you to interact with the system

  • Swagger documentation (available at /api)

  • Copy of this documentation (available at /docs)