Block the URLFrontier queue on Retry-After via the queue stream by dpol1 · Pull Request #1973 · apache/stormcrawler

dpol1 · 2026-07-02T07:21:39Z

Closes #784.

On an HTTP 429 or 503 response, the Retry-After header (delta-seconds or IMF-fixdate) already reaches the status stream: the protocol layer stores response headers in the status metadata, along with fetch.statusCode. The URLFrontier StatusUpdaterBolt now forwards (key, metadata) on the generic queue stream introduced in #1974, mirroring the OpenSearch updater. A new HostBlockBolt consumes that stream, parses the header via the standalone RetryAfterParser, caps the delay with the new urlfrontier.max.retry.after setting (default 24h, -1 for no cap) and calls blockQueueUntil, so the frontier stops handing out URLs for that host until the requested time. FetcherBolt is untouched.

This follows the conclusion of the discussion in #1944: the fetch pipeline itself never delays or holds tuples; enforcement lives with the queue owner. It implements the rate-limit case of #867 on top of the queue stream. The header parsing and the cap reuse the groundwork @rzo1 laid in #1944; what changes is where the signal is read and enforced.

Wiring:

  - id: "hostblock"
    className: "org.apache.stormcrawler.urlfrontier.HostBlockBolt"
    parallelism: 1

  - from: "status"
    to: "hostblock"
    grouping:
      type: FIELDS
      args: ["key"]
      streamId: "queue"

A few design decisions worth calling out for review:

The queue key is the one the status updater computes with URLPartitioner from partition.url.mode, which is the same setting the frontier keys its queues by. A block on a differently derived key would be a silent no-op.
Only 429 and 503 are considered, where Retry-After has back-off semantics; Retry-After on a 3xx is deliberately ignored. A configurable list of status codes belongs to the adaptive follow-up.
The HTTP-date form accepts IMF-fixdate only, which is what servers produce in practice; the obsolete RFC 850/asctime forms are rejected.
The gRPC call is fire-and-forget: a missed block costs one extra fetch and the next 429 re-emits the signal.
Topologies that don't wire a consumer are unaffected: tuples emitted on a stream without subscribers are simply discarded.
URLs already sitting in the fetcher's internal queues for the blocked host are still fetched; the frontier block only stops new hand-outs. Discussed with @sebastian-nagel on Send host info on a specific stream #867: they fail on their own and reschedule via the status stream, since they never received a 429 themselves and no Status encodes a purge.

One scope question: should this solves #867 as well, or should the issue stay open for a future robots-delay signal on the same stream?

For all changes

Is there a issue associated with this PR? Is it referenced in the commit message?
Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit? (reworked after review, history kept for the discussion)
Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

For code changes

Have you ensured that the full suite of tests is executed via mvn clean verify?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0? (no new dependencies)
If applicable, have you updated the LICENSE file, including the main LICENSE file? (n/a)
If applicable, have you updated the NOTICE file, including the main NOTICE file? (n/a)

…867, #784) On an HTTP 429 or 503 response, FetcherBolt parses the Retry-After header (delta-seconds or IMF-fixdate), caps it via the new fetcher.max.retry.after setting (default 24h) and emits (key, blockUntil) on the new "hostinfo" stream. The key is derived with URLPartitioner from partition.url.mode so it matches the URLFrontier queue key, which may differ from the fetcher's internal queue mode. The new HostBlockBolt (external/urlfrontier) consumes the stream and calls blockQueueUntil, so the frontier stops handing out URLs for the host until the requested time. The fetch pipeline itself never delays or holds tuples.

jnioche · 2026-07-02T08:57:53Z

Thanks @dpol1, 2 very quick comments for now

This PR adds more content to FetcherBolt, which is already pretty massive as a class and I think we should break it down to make it more manageable. Should that be done first?
We should probably first add a basic mechanism for handling host info as part of the StatusUpdaterBolt (i.e. add hosts if they don't already exist). Branch 990 has code to that effect. We can then move on to more advanced uses of the queue / host stream like this one.

What do you think?

dpol1 · 2026-07-02T09:28:59Z

Taking the second point first, since it changes the answer to the first one.

We should probably first add a basic mechanism for handling host info as part of the StatusUpdaterBolt (i.e. add hosts if they don't already exist). Branch 990 has code to that effect. We can then move on to more advanced uses of the queue / host stream like this one.

Agreed on the sequencing. I had a look at branch 990 and have one question about the stream schema: it emits just (key). For a use case like this one the consumer needs a bit more context. With the okhttp protocol the response headers already end up in the status metadata (protocol.retry-after), and fetch.statusCode is there too. If the stream carried (key, metadata) instead, the whole Retry-After handling could move out of FetcherBolt: the URLFrontier-side bolt would read status code and header from the metadata and call BlockQueueUntil, with the parsing in a small standalone util class (the tests for it are already written). The stream being per-URL is not a problem here, since the consumer only acts on a 429/503 with a Retry-After present.

Would you be open to (key, metadata), or would you rather keep the basic stream minimal and add the extra info in a follow-up?

This PR adds more content to FetcherBolt, which is already pretty massive as a class and I think we should break it down to make it more manageable. Should that be done first?

If we go the route above, this PR stops touching FetcherBolt entirely, so the two things become independent. I agree the class needs breaking down, but that feels like its own issue rather than a blocker here. Happy to help with it separately.

On branch 990: do you plan to open a PR for it yourself, or should I pick the core parts (the stream constant, the declaration in AbstractStatusUpdaterBolt, the OpenSearch emit and QueueBolt) onto a fresh branch off main and open a PR from there? I'd then rework this one on top once the basic mechanism is in.

jnioche · 2026-07-02T10:27:55Z

Would you be open to (key, metadata), or would you rather keep the basic stream minimal and add the extra info in a follow-up?

It totally makes sense to have key + metadata. The code in the 990 branch was a rather minimal starting point. Whatever bolt receives the stream can choose to use metadata or not, based on what it needs to do.

This PR adds more content to FetcherBolt, which is already pretty massive as a class and I think we should break it down to make it more manageable. Should that be done first?

If we go the route above, this PR stops touching FetcherBolt entirely, so the two things become independent. I agree the class needs breaking down, but that feels like its own issue rather than a blocker here. Happy to help with it separately.

100%, breaking down the FetcherBolt is a separate task

On branch 990: do you plan to open a PR for it yourself, or should I pick the core parts (the stream constant, the declaration in AbstractStatusUpdaterBolt, the OpenSearch emit and QueueBolt) onto a fresh branch off main and open a PR from there? I'd then rework this one on top once the basic mechanism is in.

I'll have a go at writing a PR for 990 today

jnioche · 2026-07-02T12:45:04Z

@dpol1 see #1974

Moves the parsing logic out of FetcherBolt so that any bolt can reuse it; behaviour and test coverage unchanged.

Superseded by the generic queue stream (#1974): the Retry-After header already reaches the status updater in the status metadata, so the back-off signal no longer needs to be produced by the FetcherBolt. FetcherBolt, Constants and crawler-default.yaml go back to their main versions.

…tatus updater Mirrors the OpenSearch StatusUpdaterBolt behaviour introduced in #1974 so that queue/host-aware bolts can be wired in URLFrontier topologies too.

HostBlockBolt now consumes the (key, metadata) tuples of the generic queue stream: on a 429/503 whose metadata carries a valid Retry-After header it calls blockQueueUntil on the frontier for that queue key. The honoured delay is capped by urlfrontier.max.retry.after (default 24h, -1 to disable). Closes the loop described in #867 / #784 without touching the FetcherBolt.

dpol1 · 2026-07-03T12:45:50Z

Ready for review now - Fancy leave a comment on Stage 2 in #867 discussion? Just like Sebastian did. Even just a thumbs-up is fine :-)

One scope question: should this solves #867 as well, or should the issue stay open for a future robots-delay signal on the same stream?

~~and on this as well thx~~ Never mind, I don't think so

jnioche · 2026-07-03T13:56:00Z

thanks @dpol1, I like this very much.

Instead of having a new bolt, just for handling block events, what about a generic QueueBolt, just like the one in the OpenSearch module but with a pluggable mechanism to handle the blocking logic within it?

URLFrontier handles the queues itself and does not need to be told about them like OpenSearch, so we could do without the QueueBolt but we probably don't want the duplication of HostBlocks implementations for every single type of backend. Instead we would have a single implementation in core and let each QueueBolt convert it into the appropriate action.

Make things a bit more complex now, but easier to handle later.

jnioche · 2026-07-03T15:42:21Z

thanks @dpol1, I like this very much.

Instead of having a new bolt, just for handling block events, what about a generic QueueBolt, just like the one in the OpenSearch module but with a pluggable mechanism to handle the blocking logic within it?

URLFrontier handles the queues itself and does not need to be told about them like OpenSearch, so we could do without the QueueBolt but we probably don't want the duplication of HostBlocks implementations for every single type of backend. Instead we would have a single implementation in core and let each QueueBolt convert it into the appropriate action.

Make things a bit more complex now, but easier to handle later.

Maybe I am making it more complicated than necessary. We can have some duplication initially, learn from it and make things more generic afterwards.

dpol1 · 2026-07-03T17:19:54Z

Maybe I am making it more complicated than necessary. We can have some duplication initially, learn from it and make things more generic afterwards.

yes, I'd wait too. URLFrontier is the only backend with a real blocking action today (blockQueueUntil); for OpenSearch or Solr I wouldn't even know what the action should be yet, so a shared base class now would end up designed around a single case. The pieces that would move to core are already isolated anyway: the parsing is in RetryAfterParser, and the 429/503 decision in HostBlockBolt is a static function with no gRPC in it. Whenever a second backend shows up it's a mechanical lift.

dpol1 mentioned this pull request Jul 2, 2026

Send host info on a specific stream #867

Open

dpol1 added the help wanted label Jul 2, 2026

dpol1 requested review from jnioche, rzo1 and sebastian-nagel July 2, 2026 07:34

dpol1 marked this pull request as draft July 2, 2026 09:29

jnioche mentioned this pull request Jul 2, 2026

Queue stream #1974

Merged

dpol1 added 6 commits July 3, 2026 13:36

Merge branch 'main' into feat/867-host-info-backoff

3048a03

refactor: extract Retry-After parsing into RetryAfterParser util

7dd94f4

Moves the parsing logic out of FetcherBolt so that any bolt can reuse it; behaviour and test coverage unchanged.

feat: emit (key, metadata) on the queue stream from the URLFrontier s…

8288458

…tatus updater Mirrors the OpenSearch StatusUpdaterBolt behaviour introduced in #1974 so that queue/host-aware bolts can be wired in URLFrontier topologies too.

Reformat

2a350d1

dpol1 changed the title ~~Emit Retry-After host back-off and block the URLFrontier queue~~ Block the URLFrontier queue on Retry-After via the queue stream Jul 3, 2026

dpol1 marked this pull request as ready for review July 3, 2026 12:43

dpol1 added this to the 3.7.0 milestone Jul 3, 2026

dpol1 added the enhancement label Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Block the URLFrontier queue on Retry-After via the queue stream#1973

Block the URLFrontier queue on Retry-After via the queue stream#1973
dpol1 wants to merge 7 commits into
mainfrom
feat/867-host-info-backoff

dpol1 commented Jul 2, 2026 •

edited

Loading

Uh oh!

jnioche commented Jul 2, 2026

Uh oh!

dpol1 commented Jul 2, 2026

Uh oh!

jnioche commented Jul 2, 2026

Uh oh!

jnioche commented Jul 2, 2026

Uh oh!

dpol1 commented Jul 3, 2026 •

edited

Loading

Uh oh!

jnioche commented Jul 3, 2026

Uh oh!

jnioche commented Jul 3, 2026

Uh oh!

dpol1 commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dpol1 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnioche commented Jul 2, 2026

Uh oh!

dpol1 commented Jul 2, 2026

Uh oh!

jnioche commented Jul 2, 2026

Uh oh!

jnioche commented Jul 2, 2026

Uh oh!

dpol1 commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnioche commented Jul 3, 2026

Uh oh!

jnioche commented Jul 3, 2026

Uh oh!

dpol1 commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dpol1 commented Jul 2, 2026 •

edited

Loading

dpol1 commented Jul 3, 2026 •

edited

Loading