Definition
An inference endpoint is the network-addressable interface through which a deployed model receives input and returns predictions. In practice it is usually a REST or gRPC API exposed by the model serving layer, secured with authentication and access control, and designed for predictable latency and throughput.
Anatomy of an endpoint
A request to an inference endpoint carries the input payload — a prompt, an image, a feature vector — and the endpoint returns the model's output. Behind that simple contract, the endpoint typically validates the request against an expected schema, routes it to a model instance (possibly one of many replicas behind a load balancer), and may batch concurrent requests together to use the GPU efficiently. Endpoints can be real-time, answering one request at a time with low latency, or asynchronous for large batch jobs.
Why the endpoint is the trust boundary
The endpoint is where your AI service meets the outside world, which makes it the natural place to enforce security and governance. Rate limiting, authentication tokens, input sanitization, and request logging all live here. For a self-hosted, sovereignty-minded deployment, owning the endpoint means owning that trust boundary outright — you decide what data leaves your machine and what is logged, rather than trusting a third-party API.
Running your own endpoint on your own hardware keeps prompts and outputs private. To understand the surrounding lifecycle, see MLOps and how new versions are rolled out via canary deployment.
In Simple Terms
An inference endpoint is the network-addressable interface through which a deployed model receives input and returns predictions. In practice it is usually a REST or…
