Adapter Layer

Sovereign AI

An adapter layer is a small, trainable module inserted inside a frozen transformer so the model can be adapted to a new task by training a tiny fraction of its parameters. Introduced for NLP by Houlsby and colleagues in 2019, adapters were among the first parameter-efficient fine-tuning techniques and remain a reference point against which newer methods are measured.

The bottleneck architecture

Each adapter is a two-layer feed-forward network with a bottleneck. It first projects the layer's hidden representation down to a much smaller dimension, applies a non-linearity, then projects it back up to the original size, with a residual (skip) connection around the block so the untuned path is preserved. Because only this narrow down-and-up projection is trained, the parameter count is tiny relative to the full network. In the original work, adapters were placed twice per transformer block, once after the multi-head attention sub-layer and once after the feed-forward sub-layer, so a large model could be specialized by learning only these small inserts.

Why it matters

Houlsby's adapters reached near full fine-tuning quality while training under about four percent of the model's parameters, and the frozen base weights can be shared across many tasks, each with its own small adapter swapped in as needed. For a self-hoster, this is the difference between needing a full copy of a large model per task and keeping one base model plus a folder of lightweight adapters. It also makes fine-tuning feasible on modest hardware, since the optimizer only has to track and store gradients for the handful of parameters that actually change, dramatically cutting the memory a training run demands.

Serial adapters and their descendants

The original adapters are sometimes called serial adapters because they sit in the main data path and every token must pass through them. That framing helps explain the family tree that followed: parallel adapters run alongside the frozen layer instead of in series, prefix and prompt tuning prepend trainable vectors rather than inserting modules, and low-rank methods express the update as a matrix that can be folded away entirely. Each variation trades a different slice of flexibility, memory, and inference cost, but all share the founding idea of freezing the base and training a sliver.

The trade-off worth naming

The main drawback compared with merge-friendly methods is that adapter modules stay in the forward pass, adding a little inference latency because the extra layers must run on every token. Later approaches such as LoRA were designed specifically to avoid that cost by expressing the update as a low-rank change that can be merged back into the base weights, leaving inference speed untouched. Adapters remain valuable where you want tasks kept strictly separate and hot-swappable rather than merged, which is exactly the situation for someone running several specialized behaviours off one base model at home.

Adapters also carry a governance advantage that appeals to sovereign builders. Because each adapter is a small, self-contained file that rides on top of unchanged base weights, it can be shared, versioned, and audited independently of the multi-gigabyte model it modifies. You can distribute a specialized behaviour as a few megabytes, inspect exactly what was trained, and roll it back cleanly by simply not loading it, all without touching or redistributing the base model. That modularity fits a world of open-weight foundation models perfectly: one carefully chosen base can be kept stable while a library of small, purpose-built adapters is developed, swapped, and improved around it. For someone running AI on their own hardware, that is a practical way to keep a fleet of specialized behaviours manageable without hoarding a full model copy for every task.

Adapters are the conceptual ancestor of much of the modern fine-tuning toolkit. For related approaches that adapt a frozen model without inserting new modules, see prefix tuning and prompt tuning in our glossary.

An adapter layer is a small, trainable module inserted inside a frozen transformer so the model can be adapted to a new task by training…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners