Robots.txt

Robots.txt middleware can log and optionally block requests if they are disallowed by website robots.txt. If robots.txt is unavailable (e.g. request returns 5xx code) all requests will be allowed.

Configuration of the middleware is defined in RobotsTxtMiddlewareConfig.

How to configure middleware for the SneakpeekServer (will be used globally for all requests):

from sneakpeek.middleware.robots_txt_middleware import RobotsTxtMiddleware, RobotsTxtMiddlewareConfig

server = SneakpeekServer.create(
    ...
    middleware=[
        ProxyMiddleware(
            ProxyMiddlewareConfig(
                violation_strategy = RobotsTxtViolationStrategy.THROW,
            )
        )
    ],
)

How to override middleware settings for a given scraper:

from aiohttp import BasicAuth
from sneakpeek.middleware.robots_txt_middleware import RobotsTxtMiddlewareConfig

scraper = Scraper(
    ...
    config=ScraperConfig(
        ...
        middleware={
            "robots_txt": ProxyMiddlewareConfig(
                violation_strategy = RobotsTxtViolationStrategy.LOG,
            )
        }
    ),
)