Payments, Billing & Subscriptions for an LMS: Idempotent Webhooks, Entitlements & Reconciliation (Part 7)

Payments, Billing and Subscriptions, Part 7 of 10 — a consistency problem wearing a money hat.

For six parts Scholr has been a platform — most recently gaining search and recommendations in Part 6. In Part 7 it becomes a business — which means it has to take money, and taking money is the part founders bolt on last and regret first. Billing looks like a feature (“add a Stripe button”) and is actually a distributed-consistency problem wearing a money hat: two independent systems — your database and the payment processor — must agree, exactly once, about who paid for what. When they disagree, you either double-charge a paying customer or grant a course to someone who never paid, and both failures cost you trust you can’t easily win back. This part builds the payments layer the way the rest of the series builds everything: as a system with one hard property decided up front and proven.

Scholr’s billing crisis was almost a rite of passage. A learner upgraded to the annual plan; Stripe processed the charge and fired a payment.succeeded webhook; Scholr’s handler granted the entitlement and emailed a receipt; then the handler was slow to return its 200, so Stripe — doing exactly what it promises, delivering at least once — retried the webhook. The naive handler ran again: a second entitlement grant, a second receipt email, and in a worse version of the same bug a second charge. The customer got two invoices and a support ticket. Nobody had written a bug in the obvious sense; the handler worked perfectly, twice. The fix — idempotent webhook processing backed by reconciliation — is the spine of this entire part.

Pricing models: decide what you’re selling before you model it

Before any schema, a product decision: how does an LMS charge? The options are not interchangeable, and each implies a different billing model underneath.

Model Fits Billing complexity
One-time course purchase Marketplaces, individual courses Low — a payment grants a permanent entitlement
Learner subscription “All-access” consumer plans Medium — recurring, with renewal and churn
Per-seat / org plan B2B, the tenants from Part 2 High — seat counts, proration, admin assignment
Usage / credits Cohorts, coaching, certifications High — metering and balances
Bundles & free trials Acquisition, upsell Cross-cutting — modifies any of the above

Scholr’s reference implementation models the two that cover most ground — a learner subscription and a one-time purchase — through a single Plan abstraction with a billing interval, because the data model should accommodate the pricing model rather than hard-code one. A subtle but important detail lives in that Plan: money is stored as integer minor units (cents), never a floating-point dollar amount. Float arithmetic on money rounds in ways that leave you a cent off and unable to reconcile, and “unable to reconcile” is the one thing a billing system must never be.

The per-seat and org plans deserve a special note because they connect directly to Part 2’s multi-tenancy. In a B2B sale, the customer is the organization (a tenant), not the individual learner: the org buys N seats, an admin assigns them, and entitlements are granted per assigned learner while the invoice goes to the org. This is why entitlements being a separate, per-learner record matters even more in B2B — the billing relationship lives at the tenant level, but access is granted and revoked at the learner level as seats are assigned and reclaimed. The same separation that keeps a consumer access check fast is what makes seat management tractable.

One more upstream decision: build the billing engine or use a processor-managed one? Almost always, use the processor’s billing product (Stripe Billing, Paddle, Chargebee) for the recurring-billing machinery — subscriptions, invoicing, dunning, tax — and keep in your own system only what is genuinely yours: the mapping from a billing outcome to an entitlement.

Concern Build it yourself Processor-managed (recommended)
Recurring charges, proration, invoicing Months of work; easy to get wrong Done, battle-tested
Dunning & card retries You build the retry/email engine Built in, tuned on huge data
Tax (VAT/GST/sales) A legal liability to hand-code A managed tax product
Entitlement mapping Yours — this is the part to build Not their job; only you know your access model

The billing flow for an LMS: a learner starts a hosted checkout on the payment processor so card data never touches the app; the processor charges the card and sends an at-least-once webhook; the idempotent webhook handler dedupes by the processor event id, drives the subscription state machine, and syncs the entitlement; a periodic reconciliation job diffs the database against the processor and repairs any drift; access checks read the entitlement on every request.

The billing data model: subscriptions, entitlements, and a state machine

The model has four tables, and the most important design decision is the separation between two of them. A subscription is the billing relationship — it has a lifecycle and is the source of truth about whether someone is paying. An entitlement is the access record — “this learner may access this thing” — and it is read on every single course-access request. These are deliberately separate:

// entitlement: the small, hot record the access check reads — independent of billing internals
public boolean hasAccess(UUID learnerId, String entitlementKey) {
    return entitlements.findByLearnerIdAndEntitlementKey(learnerId, entitlementKey)
        .map(Entitlement::isActive).orElse(false);   // one indexed row, cheap
}

Why split them? Because the access check must be fast and must not depend on the tangle of billing state, and because a single entitlement might be granted by different sources over its life — a subscription now, a one-time purchase or a manual comp later — without the access check caring which. Billing events flip the entitlement’s active flag; access control only ever reads it. The subscription drives; the entitlement grants.

The subscription itself is a state machine, and modeling it as one — with guarded transitions — is what keeps an out-of-order or nonsensical webhook from corrupting it. You cannot reactivate a canceled subscription with a stray event; the model refuses:

public enum SubscriptionStatus { TRIALING, ACTIVE, PAST_DUE, CANCELED }

public void activate() {                  // payment succeeded (initial or dunning recovery)
    if (status != CANCELED) status = ACTIVE;   // a CANCELED sub can't be silently revived
}
public void markPastDue() {               // a payment failed → grace/dunning window
    if (status == ACTIVE || status == TRIALING) status = PAST_DUE;
}
public void cancel() { status = CANCELED; }   // one-way, terminal

A @Version optimistic lock on the subscription means two concurrent webhook deliveries can’t both transition it — the same concurrency discipline as the seat invariant in Part 2 and exam submission in Part 4, now guarding money.

The subscription state machine: TRIALING moves to ACTIVE on first payment; ACTIVE moves to PAST_DUE on a failed payment but retains access during the dunning grace window; PAST_DUE returns to ACTIVE if dunning recovers the payment or moves to CANCELED if dunning is exhausted; ACTIVE moves to CANCELED on customer cancel; CANCELED is terminal and revokes access. Guarded transitions prevent an out-of-order webhook from reviving a canceled subscription.

From → To Trigger Access?
TRIALING → ACTIVE First payment succeeds Granted throughout
ACTIVE → PAST_DUE A renewal payment fails Retained (grace window)
PAST_DUE → ACTIVE Dunning recovers payment Granted
PAST_DUE → CANCELED Dunning exhausted Revoked
ACTIVE → CANCELED Customer cancels Revoked (at period end)

Integrating a processor without coupling to it

Scholr uses Stripe, and you almost certainly should use a processor rather than build payments yourself — but the domain must not know it uses Stripe. Hard-wiring a payment SDK into your business logic is the same mistake as hard-wiring an LLM vendor or a specific broker: it couples your core to one company’s API surface and makes the processor un-swappable and untestable. Scholr puts a thin port between them, exactly as it did for the broker in Part 4 and the event publisher in Part 5:

public interface PaymentGateway {
    String createCheckoutSession(String planProviderRef, UUID learnerId);
    Optional<SubscriptionStatus> fetchStatus(String providerRef); // the authority reconciliation diffs
}

The domain depends on the intent (“start a checkout”, “what does the processor think this subscription’s status is?”), never on a Stripe class. A real Stripe adapter implements the port in production; an in-memory fake implements it in tests. Swapping processors, or testing the entire billing flow without a network call, changes only the adapter. And it keeps card data out of your system entirely — checkout happens on the processor’s hosted page, which is the foundation of PCI-scope minimization below.

The checkout flow has one trap that catches nearly everyone, and it is worth stating plainly because getting it wrong is a security hole, not just a bug: grant access on the webhook, never on the browser redirect. The tempting shortcut is to mark the learner as paid when their browser returns to your success URL after checkout. But that redirect is client-controlled — a user can navigate directly to the success URL without paying, the redirect can be lost if they close the tab, and it tells you nothing authoritative about whether the charge actually settled. The only trustworthy signal that money moved is the server-to-server webhook from the processor. So the success page says a reassuring “thanks, setting up your access,” and the entitlement is granted only when the payment.succeeded webhook arrives and is verified. Treat the redirect as a UX convenience and the webhook as the truth, and a whole category of “I got access without paying” and “I paid but have no access” bugs simply never happens.

Webhooks done right: the idempotency that saves the bill

Now the heart of the part, and the fix for Scholr’s double-charge. A payment processor communicates asynchronously through webhooks, and it guarantees at-least-once delivery: you will receive every event, but you may receive some more than once — a retry after a slow response, a replay after a network hiccup. A handler that acts on every delivery double-grants, double-emails, and double-charges. The fix is the same one Part 5 used for its event consumer, and the same idempotency principle the production engineering playbook insists on for any retried operation: make the handler idempotent by recording each event’s id before acting on it, keyed on the processor’s own immutable event id, so a duplicate is a cheap no-op:

@Transactional
public boolean handleWebhook(WebhookEvent event) {
    if (processedWebhooks.existsById(event.providerEventId())) {
        return false;                       // already handled — at-least-once in, exactly-once effect
    }
    Subscription sub = subscriptions.findByProviderRef(event.subscriptionRef()).orElse(null);
    if (sub != null) {
        switch (event.type()) {
            case "payment.succeeded"      -> sub.activate();
            case "payment.failed"         -> sub.markPastDue();
            case "subscription.canceled"  -> sub.cancel();
            default -> { /* an event we don't act on; still record it as processed */ }
        }
        subscriptions.save(sub);
        plans.findById(sub.planId()).ifPresent(plan -> syncEntitlement(sub, plan));
    }
    processedWebhooks.save(new ProcessedWebhook(event.providerEventId(), Instant.now(clock)));
    return true;
}

The dedup write and the state change commit in one transaction, so a crash mid-handle simply redelivers and re-applies cleanly. A retried payment.succeeded now grants the entitlement exactly once; Scholr’s double-charge is structurally impossible.

But idempotency alone is not enough, because webhooks can also be lost or arrive out of order — a processor’s delivery is reliable, not perfect, and a brief outage on your side can drop one entirely. So the real source of truth is not the webhook stream but reconciliation: a periodic job that diffs your database against the processor and repairs any divergence. The webhook path is the fast, optimistic update; reconciliation is the slow, authoritative backstop that heals whatever the webhooks missed.

@Transactional
public boolean reconcile(UUID subscriptionId) {
    Subscription sub = subscriptions.findById(subscriptionId).orElseThrow(...);
    SubscriptionStatus truth = gateway.fetchStatus(sub.providerRef()).orElse(null);
    if (truth == null || truth == sub.status()) return false;  // agree — nothing to do
    switch (truth) {                          // drive local state to the processor's authority
        case ACTIVE -> sub.activate();
        case PAST_DUE -> sub.markPastDue();
        case CANCELED -> sub.cancel();
        case TRIALING -> { }
    }
    subscriptions.save(sub);
    plans.findById(sub.planId()).ifPresent(plan -> syncEntitlement(sub, plan));
    return true;                              // a drift was found and repaired
}
Webhook failure Naive handler Idempotent + reconciled
Retried after slow response Double grant / double receipt No-op via the dedup record
Lost entirely Entitlement never updated Reconciliation repairs it
Arrives out of order State corrupted Guarded transitions + reconciliation
Processor and DB disagree Undetected forever Reconciliation diffs and heals

Entitlements and access control: tying the gate to billing state

With subscriptions driving entitlements, course access becomes a single cheap check: hasAccess(learnerId, key). The nuance worth getting right is what happens on a failed payment. The naive instinct — revoke access the instant a renewal fails — is hostile and wrong: a learner whose card simply expired loses access mid-lesson while a perfectly recoverable problem resolves itself. So PAST_DUE retains access during a grace window while dunning retries the payment, and only an exhausted dunning sequence (which transitions the subscription to CANCELED) actually revokes. This is also an eventual-consistency window by nature: there is a short lag between a billing event at the processor and the entitlement flipping in your database, which is fine for access — a few seconds of stale “granted” harms no one — precisely because reconciliation guarantees it converges.

Lifecycle operations: proration, plan changes, and dunning

Subscriptions are not static, and the operations that change them are where billing gets fiddly. Upgrades and downgrades mid-cycle require proration — crediting the unused portion of the old plan and charging the prorated remainder of the new one — which is exactly the kind of money arithmetic you let the processor compute rather than reimplement (and get subtly wrong). Cancellations almost always mean “cancel at period end,” not “revoke now,” because the customer paid through the period. And the single highest-leverage piece of billing operations is dunning: the automated sequence of retries and reminder emails that recovers a failed payment. A meaningful fraction of all churn is involuntary — an expired or maxed-out card, not a customer who wanted to leave — and good dunning recovers a large share of it. Dunning is why PAST_DUE exists as a distinct, access-retaining state rather than an instant cancel: it is the platform giving a recoverable payment time to recover.

A good dunning sequence is itself a small system with real design choices. The retry schedule matters — retries spaced over days (e.g. day 1, 3, 5, 7) catch a card that gets topped up or a transient bank decline, where a burst of immediate retries just earns more hard declines and looks like fraud to the bank. The messaging matters — a clear “your card was declined, update it here” email recovers payments that no automatic retry ever will, because the fix is in the customer’s hands. And the grace-window length is a deliberate trade-off: too short and you revoke access from customers who would have paid, souring a recoverable relationship; too long and you give away service indefinitely to genuinely lapsed accounts. The right length is a business decision informed by your recovery-rate data, which is exactly why tracking dunning recovery (above) feeds back into setting it. None of this is exotic, and most of it the processor’s billing product implements for you — but you must understand it, because it directly governs how much revenue silently leaks out of involuntary churn.

Operation The right behavior The trap
Upgrade / downgrade Prorate; let the processor compute it Hand-rolling proration math
Cancellation Access until period end (already paid) Revoking immediately
Failed payment Dunning + grace window Instant revoke → recoverable churn lost
Refund Reverse entitlement deliberately Forgetting to revoke access

Two adversarial cases round out the lifecycle, and both end in a revoked entitlement. A refund is a deliberate reversal — the customer is given their money back, and the entitlement must be revoked as part of the same operation, because a refund that leaves access intact is just a free course. The trickier one is a chargeback (a dispute): the customer’s bank forcibly reverses the charge, often weeks later, and tells the processor, which fires a webhook. A chargeback is not just lost revenue — it carries a fee and, if your dispute rate climbs, jeopardizes your account with the processor — so it must be handled promptly and automatically: revoke the entitlement, flag the account, and (where the dispute is illegitimate) submit evidence through the processor. The lesson is that “revoke access” has several distinct triggers — voluntary cancel, dunning exhaustion, refund, chargeback — and modeling them as events that all converge on the same entitlement flip keeps the access model simple while the billing reasons stay distinct.

Tax, invoicing, and minimizing PCI scope

Two compliance realities shape the payments layer. First, tax: VAT, GST, and US sales tax depend on where the buyer is, what they bought, and thresholds that change, and getting it wrong is a legal liability rather than a bug. This is squarely in “use a specialized service” territory — the processor’s tax product or a dedicated tax engine computes and files it; you should not be hand-coding VAT rules. Proper invoices (with the legally required fields, sequential numbering, and retention) follow from the same systems.

Second, and non-negotiable: minimize PCI scope by never touching card data. The moment raw card numbers flow through your servers, you inherit the full weight of PCI-DSS compliance. The entire industry’s answer is to never let that happen — use the processor’s hosted checkout or tokenized fields so the card data goes from the learner’s browser straight to the processor, and your system only ever sees a token and a status. Scholr’s PaymentGateway port reflects this: it starts a hosted checkout and reads back status, and there is no method anywhere that accepts a card number, because there is no place in the system that should ever hold one.

Revenue analytics: closing the loop with the event pipeline

The same billing events that drive entitlements are the raw material for the metrics that tell you whether the business works — MRR and ARR (monthly and annual recurring revenue), churn (and the crucial split between voluntary and involuntary), LTV, and trial-conversion rate. The right way to compute them is the one Part 5 already built: emit billing events to the event stream and project them into a revenue warehouse, so MRR is a recomputable view over an immutable event log rather than a fragile number maintained by hand. When the definition of “active revenue” inevitably gets refined, you change a query and replay — the same operational superpower the analytics pipeline gave learning metrics, now applied to money.

The split that pays for itself to track is voluntary versus involuntary churn, because they demand opposite responses. Voluntary churn — a customer chose to leave — is a product and value problem you address with the experience. Involuntary churn — a payment failed for a recoverable reason — is an operations problem you address with better dunning, card-update prompts, and retry timing, and it is often a startlingly large fraction of total churn. A platform that lumps the two together optimizes the wrong lever; one that separates them can recover real revenue from involuntary churn with engineering rather than discounts. This is why the PAST_DUE state and the dunning grace window aren’t just billing hygiene — they are directly a revenue-retention feature, and measuring their recovery rate tells you how well that feature is working.

A final discipline ties revenue analytics back to correctness: your reported MRR should be reconcilable against the payment processor’s own revenue reports. If your warehouse says one number and the processor says another, one of them is wrong, and in billing “we’re not sure which revenue number is right” is an emergency. The same reconciliation mindset that keeps entitlements honest — diff your view against the processor’s authority, repair the difference — applies to the money totals, not just the per-subscription state.

The war story, resolved — and what we’d do differently

Scholr’s double-charge was a textbook at-least-once webhook handled as if it were exactly-once. The customer’s upgrade fired a payment.succeeded, the handler was slow to ack, Stripe retried, and a non-idempotent handler ran the side effects twice. The fix was idempotent processing keyed on the processor’s event id — a duplicate is now a no-op — plus reconciliation as the real source of truth, so even a lost or out-of-order webhook can’t leave the database and the processor in lasting disagreement. After it shipped, retried webhooks became invisible and the “the system thinks I didn’t pay” tickets stopped.

What would we do differently? We would have made the webhook handler idempotent from the very first event, because at-least-once is not an edge case — it is the documented, normal behavior of every payment processor. We would have built reconciliation alongside the webhooks rather than treating it as a someday-cleanup, because webhooks will be lost and the only question is whether you find out from a reconciliation job or from an angry customer. And we would have separated entitlements from subscriptions from the start, because retrofitting that split once access checks are scattered across the codebase is painful. The thread, one final time for this layer: payments are a consistency problem, so decide the consistency property up front, encode it once, and prove it.

Get the code and run it

Everything above is in the companion repository, evolving the same codebase the series has built since Part 1. Each part has its own branch frozen at that lesson’s checkpoint, and main always holds the latest cumulative code.

# this part's exact code:
git clone https://github.com/muasif80/tutorial-lms-platform.git
cd tutorial-lms-platform
git checkout part-7

# the latest cumulative build is always on main:
git checkout main

Verify it the way the build does — the idempotent webhook handler, the subscription state machine, entitlement grant/revoke, reconciliation repairing drift, and tenant isolation all run under one command:

mvn verify   # green = idempotent webhooks + state machine + reconciliation all hold

Where each idea in this article lives in the code:

  • Idempotent webhook processingbilling/BillingService.handleWebhook + billing/domain/ProcessedWebhook.java.
  • The subscription state machinebilling/domain/Subscription.java + SubscriptionStatus.java (with @Version).
  • Entitlements, separate from subscriptionsbilling/domain/Entitlement.java + the hasAccess check.
  • The processor-agnostic portbilling/PaymentGateway.java (Stripe plugs in; FakePaymentGateway is the test default).
  • Reconciliation as source of truthbilling/BillingService.reconcile.
  • Tenant isolation + RLS for billing tablesdb/migration/V5__billing.sql.
  • The proofBillingTest.java asserts webhook idempotency, the state machine, grace-window access, reconciliation, and tenant isolation.

Frequently asked questions

How do I stop Stripe webhook retries from double-charging or double-granting?

Make the webhook handler idempotent. Payment processors deliver at least once, so the same event can arrive more than once. Record each event’s processor-assigned id in a dedup table before acting on it, and skip any event id you’ve already processed — the dedup write and the side effect commit in one transaction. A retried webhook then becomes a cheap no-op instead of a second grant, receipt, or charge. Back it with reconciliation for events that are lost rather than duplicated.

Should I store entitlements separately from subscriptions?

Yes. The entitlement is the access record read on every protected request, so it should be a small, fast, single-row lookup that doesn’t depend on billing internals; the subscription is the source of truth that drives it. Separating them keeps access checks cheap, lets a single entitlement be granted by different sources over time (a subscription, a one-time purchase, a manual comp), and means billing events simply flip the entitlement’s active flag while access control only ever reads it.

How do I handle proration and plan changes?

Let the payment processor compute proration rather than reimplementing the money math yourself — on an upgrade or downgrade it credits the unused portion of the old plan and charges the prorated remainder of the new one. Treat cancellations as “cancel at period end” so the customer keeps the access they paid for, and model a failed payment as a distinct past-due state that retains access during a dunning grace window rather than revoking immediately, since much failed-payment churn is an expired card, not a departing customer.

How do I keep PCI scope minimal?

Never let raw card data touch your servers. Use the processor’s hosted checkout or tokenized payment fields so card numbers go straight from the learner’s browser to the processor, and your system only ever sees a token and a status. If you never store, process, or transmit card data, you stay in the smallest PCI-DSS scope. A clean sign of this done right: there is no method anywhere in your code that accepts a card number.

Conclusion

Billing is where an LMS becomes a business, and where a careless design quietly costs you money and trust. We chose a pricing-flexible data model with integer money; separated the subscription state machine (the source of truth) from the entitlement (the access record); kept the payment processor behind a port so the domain never couples to a vendor; made webhook processing idempotent so retries can’t double-charge; added reconciliation as the real source of truth for lost or out-of-order events; retained access through a dunning grace window; and minimized PCI scope by never touching card data. Scholr’s double-charge — and the whole class of “the system disagrees with the processor about who paid” bugs — is now designed out.

The full, tested implementation — the idempotent webhook handler, the subscription state machine, and reconciliation, all verified by a build that proves them — is on the part-7 branch of the companion repository. ⭐ Star it to follow the build. Next, in Part 8, we make Scholr a good citizen of the wider learning ecosystem: interoperability with LTI, SCORM, and xAPI — the standards that let your LMS plug into the tools institutions already use.

Previous

Leave a Reply

Your email address will not be published. Required fields are marked *