Block the URLFrontier queue on Retry-After via the queue stream#1973
Block the URLFrontier queue on Retry-After via the queue stream#1973dpol1 wants to merge 7 commits into
Conversation
…867, #784) On an HTTP 429 or 503 response, FetcherBolt parses the Retry-After header (delta-seconds or IMF-fixdate), caps it via the new fetcher.max.retry.after setting (default 24h) and emits (key, blockUntil) on the new "hostinfo" stream. The key is derived with URLPartitioner from partition.url.mode so it matches the URLFrontier queue key, which may differ from the fetcher's internal queue mode. The new HostBlockBolt (external/urlfrontier) consumes the stream and calls blockQueueUntil, so the frontier stops handing out URLs for the host until the requested time. The fetch pipeline itself never delays or holds tuples.
|
Thanks @dpol1, 2 very quick comments for now
What do you think? |
|
Taking the second point first, since it changes the answer to the first one.
Agreed on the sequencing. I had a look at branch 990 and have one question about the stream schema: it emits just Would you be open to
If we go the route above, this PR stops touching FetcherBolt entirely, so the two things become independent. I agree the class needs breaking down, but that feels like its own issue rather than a blocker here. Happy to help with it separately. On branch 990: do you plan to open a PR for it yourself, or should I pick the core parts (the stream constant, the declaration in |
It totally makes sense to have key + metadata. The code in the 990 branch was a rather minimal starting point. Whatever bolt receives the stream can choose to use metadata or not, based on what it needs to do.
100%, breaking down the FetcherBolt is a separate task
I'll have a go at writing a PR for 990 today |
Moves the parsing logic out of FetcherBolt so that any bolt can reuse it; behaviour and test coverage unchanged.
Superseded by the generic queue stream (#1974): the Retry-After header already reaches the status updater in the status metadata, so the back-off signal no longer needs to be produced by the FetcherBolt. FetcherBolt, Constants and crawler-default.yaml go back to their main versions.
…tatus updater Mirrors the OpenSearch StatusUpdaterBolt behaviour introduced in #1974 so that queue/host-aware bolts can be wired in URLFrontier topologies too.
HostBlockBolt now consumes the (key, metadata) tuples of the generic queue stream: on a 429/503 whose metadata carries a valid Retry-After header it calls blockQueueUntil on the frontier for that queue key. The honoured delay is capped by urlfrontier.max.retry.after (default 24h, -1 to disable). Closes the loop described in #867 / #784 without touching the FetcherBolt.
|
Ready for review now - Fancy leave a comment on Stage 2 in #867 discussion? Just like Sebastian did. Even just a thumbs-up is fine :-)
|
|
thanks @dpol1, I like this very much. Instead of having a new bolt, just for handling block events, what about a generic QueueBolt, just like the one in the OpenSearch module but with a pluggable mechanism to handle the blocking logic within it? URLFrontier handles the queues itself and does not need to be told about them like OpenSearch, so we could do without the QueueBolt but we probably don't want the duplication of HostBlocks implementations for every single type of backend. Instead we would have a single implementation in core and let each QueueBolt convert it into the appropriate action. Make things a bit more complex now, but easier to handle later. |
Maybe I am making it more complicated than necessary. We can have some duplication initially, learn from it and make things more generic afterwards. |
yes, I'd wait too. |
Closes #784.
On an HTTP 429 or 503 response, the
Retry-Afterheader (delta-seconds or IMF-fixdate) already reaches the status stream: the protocol layer stores response headers in the status metadata, along withfetch.statusCode. The URLFrontierStatusUpdaterBoltnow forwards(key, metadata)on the genericqueuestream introduced in #1974, mirroring the OpenSearch updater. A newHostBlockBoltconsumes that stream, parses the header via the standaloneRetryAfterParser, caps the delay with the newurlfrontier.max.retry.aftersetting (default 24h,-1for no cap) and callsblockQueueUntil, so the frontier stops handing out URLs for that host until the requested time.FetcherBoltis untouched.This follows the conclusion of the discussion in #1944: the fetch pipeline itself never delays or holds tuples; enforcement lives with the queue owner. It implements the rate-limit case of #867 on top of the queue stream. The header parsing and the cap reuse the groundwork @rzo1 laid in #1944; what changes is where the signal is read and enforced.
Wiring:
A few design decisions worth calling out for review:
One scope question: should this solves #867 as well, or should the issue stay open for a future robots-delay signal on the same stream?
For all changes
For code changes