Definition
Test-time compute, also called inference-time scaling, is the practice of allocating more computation when a model answers a question rather than when it is trained. Instead of making the model bigger, you let it think longer or try more approaches at the moment of inference. The release of OpenAI's o1 made this a mainstream scaling axis, described as the most significant shift since pre-training scaling laws, because optimally spending compute at inference can beat simply adding parameters.
Common ways to spend the budget
Several techniques fall under this umbrella. The model can generate a longer chain of reasoning before answering. It can sample many candidate answers and select among them, for example by majority vote or by scoring. It can run search procedures such as tree traversal or Monte Carlo Tree Search over reasoning steps. Or it can iterate: draft, critique, and revise. What unites them is a deliberate trade, more compute now in exchange for a better answer, tunable per query based on how hard the problem is.
Why it matters for self-hosting
Inference-time scaling lets a modest, locally run model punch above its weight on hard problems by thinking harder, which is attractive when you cannot or will not run a frontier-scale model. The flip side is real cost and latency, and research has questioned whether every o1-style model scales as cleanly as claimed, so the budget should be matched to the task rather than applied blindly.
This budget powers reasoning models and underlies iterative methods like Self-Refine and selection by a verifier model.
In Simple Terms
Test-time compute, also called inference-time scaling, is the practice of allocating more computation when a model answers a question rather than when it is trained.…
