Question 1

What is inference in AI?

Accepted Answer

Inference is the process of running a trained AI model on a new input to produce an output - a text response, image classification, translation, or prediction. It is the 'production' mode of AI, as opposed to training, which is the process of teaching the model using data. Every time a user submits a prompt to ChatGPT or an AI feature generates a response in your app, that is inference.

Question 2

What is the difference between AI training and inference?

Accepted Answer

Training is a one-time (or periodic) process that adjusts a model's weights using large datasets - it is computationally intensive, slow, and expensive, but happens infrequently. Inference is the ongoing process of using the trained model to generate outputs for new inputs - it is faster and cheaper per operation, but happens millions or billions of times in production. For most startups, inference costs dwarf training costs because inference runs continuously at scale.

Question 3

What is the difference between inference latency and throughput?

Accepted Answer

Latency is the time from when a request is submitted to when the first token (or full response) is returned - typically measured in milliseconds for time-to-first-token (TTFT) and seconds for complete response time. Throughput is the number of requests a system can process per second. High-latency, high-throughput configurations (batch processing) and low-latency, lower-throughput configurations (real-time chat) require different infrastructure optimizations. Most user-facing AI features require latency under 2–3 seconds for a good user experience.

Question 4

How much does LLM inference cost?

Accepted Answer

Inference is priced per token via APIs. As of early 2026: GPT-4o costs roughly $2.50 per million input tokens and $10 per million output tokens; Claude 3.5 Sonnet is ~$3/$15; GPT-4o mini is ~$0.15/$0.60. A startup processing 100,000 user queries per day, each averaging 500 input tokens and 300 output tokens, would spend approximately $4,000–$16,000 per month depending on model choice. Self-hosting open-source models on dedicated GPUs can reduce this by 50–90% at sufficient scale.

Question 5

What is speculative decoding and why does it speed up inference?

Accepted Answer

Speculative decoding is an inference optimization where a small, fast 'draft' model generates several candidate tokens, and the large target model verifies them in parallel rather than generating tokens one by one. Because the large model can accept correct draft tokens without full computation, the effective output speed increases 2–3x with no loss in quality. It is used in production by inference providers like Together AI and Groq to serve faster responses at lower cost.

Dimension	Training	Inference
When it happens	Once (or periodically)	Continuously, on every request
Compute required	Massive (thousands of GPUs, weeks)	Moderate (scales with request volume)
Cost structure	Fixed, large upfront cost	Variable, per-request
Who does it	Foundation model labs (mostly)	Every startup deploying AI
Optimization goal	Minimize training loss	Minimize latency + cost per token

Inference

What Is Inference?

Training vs. Inference: Two Different Problems

Key Inference Metrics

Time-to-first-token (TTFT)

Tokens per second (TPS) / throughput

Requests per second (RPS) / concurrent users

Cost per million tokens

Inference Architecture Options for Startups

API providers (managed inference)

Dedicated GPU instances (self-hosted)

Inference-optimized providers

Inference Optimization Techniques

Key Takeaway

Frequently Asked Questions

Comments