Skip to content

Only splice when needed#14

Open
amackillop wants to merge 2 commits intoaustin_defer-usability-checkfrom
austin_only-splice-when-needed
Open

Only splice when needed#14
amackillop wants to merge 2 commits intoaustin_defer-usability-checkfrom
austin_only-splice-when-needed

Conversation

@amackillop
Copy link
Copy Markdown

Problem

Every time a peer reconnects with pending HTLCs, the LSP triggers a splice that turns out to be unnecessary.

The sequence: LSP intercepts HTLC for offline peer, sends webhook, peer reconnects. peer_connected fires before channel_reestablish completes, so all channels are is_usable=false. calculate_htlc_actions_for_peer builds its capacity map from usable channels (empty), finds zero capacity, falls through to the splice filter which matches on is_channel_ready (finds the mid-reestablish channel), and emits a splice. ~7 seconds later the splice completes. Meanwhile reestablish finished in ~100-500ms and the HTLC was forwarded through the existing channel that had plenty of capacity all along.

The wasted splice also sets a 30s liquidity cooldown, blocking all subsequent liquidity actions for that peer.

Fix

Two commits, split because the first is a prerequisite for the second.

1. Persist HTLCs before action calculation

htlc_intercepted only persisted HTLCs when a liquidity action was needed. This prevents calculate_htlc_actions_for_peer from ever returning "do nothing" (empty actions) because the HTLC would be lost — not in the store for the timer, not forwarded.

Move the insert before action calculation. The cost is two extra S3 round-trips (write + delete) per HTLC on the forward path, where previously there were none. This hits every payment. Unavoidable because we don't know the action result before calculating it.

Also replaces the unwrap() on insert with explicit error handling: best-effort for forwards, fail-back for liquidity actions.

2. Prevent spurious splice on reconnect

Two changes in calculate_htlc_actions_for_peer:

  • Early return when channels exist but none are usable. The capacity map is empty so any decision (splice, open, forward) would be wrong. The HTLC stays in the store and the 1Hz timer retries once reestablish completes.

  • Splice candidate filter changed from is_channel_ready to is_usable. A mid-reestablish channel may already have sufficient capacity that just isn't visible yet. Splicing adds unnecessary on-chain cost for capacity that was never actually insufficient.

Verification

  • cargo test -p lightning-liquidity --lib — 65/65 pass
  • After deploying: logs should show "deferring decision" instead of "Splicing into channel" on reconnect
  • Payment latency should drop from ~10s (3s startup + 7s splice) to ~4s (3s startup + ~1s timer tick)

htlc_intercepted only persisted HTLCs when a liquidity action
(splice/open) was needed. This prevents calculate_htlc_actions
from returning empty actions (i.e. "do nothing, let the timer
handle it") because the HTLC would never make it into the store
and the timer would never see it.

Move the insert before calculate_htlc_actions_for_peer so every
intercepted HTLC is in the store before we decide what to do
with it. execute_htlc_actions already removes the HTLC on
successful forward, so the store entry is short-lived on the
happy path.

The cost is two extra KV store round-trips (S3 write + delete)
per HTLC on the forward-only path, where previously there were
none. This hits every payment since LSPS4 intercepts all of
them. Unfortunately unavoidable: the insert must happen before
action calculation because we don't know yet whether the result
will be "forward now" or "do nothing and defer to the timer."
If we wait until after and the result is "do nothing," we've
already lost the HTLC.

Replace the unconditional unwrap on insert with explicit error
handling. On persistence failure:

- Forward-only: proceed. If the forward succeeds the store
  entry was redundant. If persistence failed and the channel
  becomes unusable between calculate and execute, the HTLC is
  orphaned (not in store for timer retry, not forwarded) until
  LDK's CLTV timeout cleans it up. Requires both S3 failure
  and a channel state change between two adjacent calls.

- Liquidity action needed: fail the HTLC back to sender. The
  splice/open is async and the timer needs the stored HTLC to
  forward once the channel is ready. Without it, we'd spend
  on-chain fees on a splice that never results in a forward.
When the LSP intercepts an HTLC for an offline peer, the peer
reconnects and peer_connected fires before channel_reestablish
completes. All channels report is_usable=false at this point.

calculate_htlc_actions_for_peer builds its capacity map from
is_usable channels (empty set), finds zero capacity, then falls
through to the splice candidate filter. That filter used
is_channel_ready, which matches mid-reestablish channels, so it
emits a splice. Once reestablish finishes,
the HTLC gets forwarded through the existing
channel. The splice was wasted, and the 30s liquidity cooldown
it sets blocks all subsequent liquidity actions for that peer.

Two fixes in calculate_htlc_actions_for_peer:

1. Early return when channels exist but none are usable. The
   capacity map is empty so any decision would be wrong. The
   HTLC stays in the store (persisted by the prior commit) and
   the 1Hz timer retries once channels become usable.

2. Splice candidate filter changed from is_channel_ready to
   is_usable. A mid-reestablish channel maye already have sufficient
   capacity but it's just not visible yet. Splicing into it adds
   unnecessary on-chain cost and a 30s cooldown for capacity
   that was never actually insufficient. splice_channel() would
   also fail if the channel does become usable in time.
@amackillop amackillop changed the title Austin only splice when needed Only splice when needed Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant