Skip to content

Block the URLFrontier queue on Retry-After via the queue stream#1973

Open
dpol1 wants to merge 7 commits into
mainfrom
feat/867-host-info-backoff
Open

Block the URLFrontier queue on Retry-After via the queue stream#1973
dpol1 wants to merge 7 commits into
mainfrom
feat/867-host-info-backoff

Conversation

@dpol1

@dpol1 dpol1 commented Jul 2, 2026

Copy link
Copy Markdown
Member

Closes #784.

On an HTTP 429 or 503 response, the Retry-After header (delta-seconds or IMF-fixdate) already reaches the status stream: the protocol layer stores response headers in the status metadata, along with fetch.statusCode. The URLFrontier StatusUpdaterBolt now forwards (key, metadata) on the generic queue stream introduced in #1974, mirroring the OpenSearch updater. A new HostBlockBolt consumes that stream, parses the header via the standalone RetryAfterParser, caps the delay with the new urlfrontier.max.retry.after setting (default 24h, -1 for no cap) and calls blockQueueUntil, so the frontier stops handing out URLs for that host until the requested time. FetcherBolt is untouched.

This follows the conclusion of the discussion in #1944: the fetch pipeline itself never delays or holds tuples; enforcement lives with the queue owner. It implements the rate-limit case of #867 on top of the queue stream. The header parsing and the cap reuse the groundwork @rzo1 laid in #1944; what changes is where the signal is read and enforced.

Wiring:

  - id: "hostblock"
    className: "org.apache.stormcrawler.urlfrontier.HostBlockBolt"
    parallelism: 1

  - from: "status"
    to: "hostblock"
    grouping:
      type: FIELDS
      args: ["key"]
      streamId: "queue"

A few design decisions worth calling out for review:

  • The queue key is the one the status updater computes with URLPartitioner from partition.url.mode, which is the same setting the frontier keys its queues by. A block on a differently derived key would be a silent no-op.
  • Only 429 and 503 are considered, where Retry-After has back-off semantics; Retry-After on a 3xx is deliberately ignored. A configurable list of status codes belongs to the adaptive follow-up.
  • The HTTP-date form accepts IMF-fixdate only, which is what servers produce in practice; the obsolete RFC 850/asctime forms are rejected.
  • The gRPC call is fire-and-forget: a missed block costs one extra fetch and the next 429 re-emits the signal.
  • Topologies that don't wire a consumer are unaffected: tuples emitted on a stream without subscribers are simply discarded.
  • URLs already sitting in the fetcher's internal queues for the blocked host are still fetched; the frontier block only stops new hand-outs. Discussed with @sebastian-nagel on Send host info on a specific stream  #867: they fail on their own and reschedule via the status stream, since they never received a 429 themselves and no Status encodes a purge.

One scope question: should this solves #867 as well, or should the issue stay open for a future robots-delay signal on the same stream?

For all changes

  • Is there a issue associated with this PR? Is it referenced in the commit message?
  • Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?
  • Has your PR been rebased against the latest commit within the target branch (typically main)?
  • Is your initial contribution a single, squashed commit? (reworked after review, history kept for the discussion)
  • Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

For code changes

  • Have you ensured that the full suite of tests is executed via mvn clean verify?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0? (no new dependencies)
  • If applicable, have you updated the LICENSE file, including the main LICENSE file? (n/a)
  • If applicable, have you updated the NOTICE file, including the main NOTICE file? (n/a)

…867, #784)

On an HTTP 429 or 503 response, FetcherBolt parses the Retry-After
header (delta-seconds or IMF-fixdate), caps it via the new
fetcher.max.retry.after setting (default 24h) and emits
(key, blockUntil) on the new "hostinfo" stream. The key is derived
with URLPartitioner from partition.url.mode so it matches the
URLFrontier queue key, which may differ from the fetcher's internal
queue mode.

The new HostBlockBolt (external/urlfrontier) consumes the stream and
calls blockQueueUntil, so the frontier stops handing out URLs for the
host until the requested time. The fetch pipeline itself never delays
or holds tuples.
@jnioche

jnioche commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Thanks @dpol1, 2 very quick comments for now

  • This PR adds more content to FetcherBolt, which is already pretty massive as a class and I think we should break it down to make it more manageable. Should that be done first?

  • We should probably first add a basic mechanism for handling host info as part of the StatusUpdaterBolt (i.e. add hosts if they don't already exist). Branch 990 has code to that effect. We can then move on to more advanced uses of the queue / host stream like this one.

What do you think?

@dpol1

dpol1 commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

Taking the second point first, since it changes the answer to the first one.

We should probably first add a basic mechanism for handling host info as part of the StatusUpdaterBolt (i.e. add hosts if they don't already exist). Branch 990 has code to that effect. We can then move on to more advanced uses of the queue / host stream like this one.

Agreed on the sequencing. I had a look at branch 990 and have one question about the stream schema: it emits just (key). For a use case like this one the consumer needs a bit more context. With the okhttp protocol the response headers already end up in the status metadata (protocol.retry-after), and fetch.statusCode is there too. If the stream carried (key, metadata) instead, the whole Retry-After handling could move out of FetcherBolt: the URLFrontier-side bolt would read status code and header from the metadata and call BlockQueueUntil, with the parsing in a small standalone util class (the tests for it are already written). The stream being per-URL is not a problem here, since the consumer only acts on a 429/503 with a Retry-After present.

Would you be open to (key, metadata), or would you rather keep the basic stream minimal and add the extra info in a follow-up?

This PR adds more content to FetcherBolt, which is already pretty massive as a class and I think we should break it down to make it more manageable. Should that be done first?

If we go the route above, this PR stops touching FetcherBolt entirely, so the two things become independent. I agree the class needs breaking down, but that feels like its own issue rather than a blocker here. Happy to help with it separately.

On branch 990: do you plan to open a PR for it yourself, or should I pick the core parts (the stream constant, the declaration in AbstractStatusUpdaterBolt, the OpenSearch emit and QueueBolt) onto a fresh branch off main and open a PR from there? I'd then rework this one on top once the basic mechanism is in.

@dpol1 dpol1 marked this pull request as draft July 2, 2026 09:29
@jnioche

jnioche commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Would you be open to (key, metadata), or would you rather keep the basic stream minimal and add the extra info in a follow-up?

It totally makes sense to have key + metadata. The code in the 990 branch was a rather minimal starting point. Whatever bolt receives the stream can choose to use metadata or not, based on what it needs to do.

This PR adds more content to FetcherBolt, which is already pretty massive as a class and I think we should break it down to make it more manageable. Should that be done first?

If we go the route above, this PR stops touching FetcherBolt entirely, so the two things become independent. I agree the class needs breaking down, but that feels like its own issue rather than a blocker here. Happy to help with it separately.

100%, breaking down the FetcherBolt is a separate task

On branch 990: do you plan to open a PR for it yourself, or should I pick the core parts (the stream constant, the declaration in AbstractStatusUpdaterBolt, the OpenSearch emit and QueueBolt) onto a fresh branch off main and open a PR from there? I'd then rework this one on top once the basic mechanism is in.

I'll have a go at writing a PR for 990 today

@jnioche jnioche mentioned this pull request Jul 2, 2026
@jnioche

jnioche commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@dpol1 see #1974

dpol1 added 6 commits July 3, 2026 13:36
Moves the parsing logic out of FetcherBolt so that any bolt can reuse it;
behaviour and test coverage unchanged.
Superseded by the generic queue stream (#1974): the Retry-After header
already reaches the status updater in the status metadata, so the
back-off signal no longer needs to be produced by the FetcherBolt.
FetcherBolt, Constants and crawler-default.yaml go back to their main
versions.
…tatus updater

Mirrors the OpenSearch StatusUpdaterBolt behaviour introduced in #1974 so
that queue/host-aware bolts can be wired in URLFrontier topologies too.
HostBlockBolt now consumes the (key, metadata) tuples of the generic
queue stream: on a 429/503 whose metadata carries a valid Retry-After
header it calls blockQueueUntil on the frontier for that queue key. The
honoured delay is capped by urlfrontier.max.retry.after (default 24h,
-1 to disable). Closes the loop described in #867 / #784 without
touching the FetcherBolt.
@dpol1 dpol1 changed the title Emit Retry-After host back-off and block the URLFrontier queue Block the URLFrontier queue on Retry-After via the queue stream Jul 3, 2026
@dpol1 dpol1 marked this pull request as ready for review July 3, 2026 12:43
@dpol1 dpol1 added this to the 3.7.0 milestone Jul 3, 2026
@dpol1

dpol1 commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

Ready for review now - Fancy leave a comment on Stage 2 in #867 discussion? Just like Sebastian did. Even just a thumbs-up is fine :-)

One scope question: should this solves #867 as well, or should the issue stay open for a future robots-delay signal on the same stream?

and on this as well thx Never mind, I don't think so

@jnioche

jnioche commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

thanks @dpol1, I like this very much.

Instead of having a new bolt, just for handling block events, what about a generic QueueBolt, just like the one in the OpenSearch module but with a pluggable mechanism to handle the blocking logic within it?

URLFrontier handles the queues itself and does not need to be told about them like OpenSearch, so we could do without the QueueBolt but we probably don't want the duplication of HostBlocks implementations for every single type of backend. Instead we would have a single implementation in core and let each QueueBolt convert it into the appropriate action.

Make things a bit more complex now, but easier to handle later.

@jnioche

jnioche commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

thanks @dpol1, I like this very much.

Instead of having a new bolt, just for handling block events, what about a generic QueueBolt, just like the one in the OpenSearch module but with a pluggable mechanism to handle the blocking logic within it?

URLFrontier handles the queues itself and does not need to be told about them like OpenSearch, so we could do without the QueueBolt but we probably don't want the duplication of HostBlocks implementations for every single type of backend. Instead we would have a single implementation in core and let each QueueBolt convert it into the appropriate action.

Make things a bit more complex now, but easier to handle later.

Maybe I am making it more complicated than necessary. We can have some duplication initially, learn from it and make things more generic afterwards.

@dpol1

dpol1 commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

Maybe I am making it more complicated than necessary. We can have some duplication initially, learn from it and make things more generic afterwards.

yes, I'd wait too. URLFrontier is the only backend with a real blocking action today (blockQueueUntil); for OpenSearch or Solr I wouldn't even know what the action should be yet, so a shared base class now would end up designed around a single case. The pieces that would move to core are already isolated anyway: the parsing is in RetryAfterParser, and the 429/503 decision in HostBlockBolt is a static function with no gRPC in it. Whenever a second backend shows up it's a mechanical lift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support retry-after in FetcherBolt

2 participants