KV cache

KV cache is using the values of KV dot product from the cache which would save time and resources. It is an important part during inference.

$attention(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

In a normal attention, the matrix multiplication will be of $O(n^2)$. But KV cache will make it to linear time $O(n)$. So the prefill stage with KV cache will ideally be faster.

Prefill

Prefill stage is costly. It has the multiplication of the larger matrices. That is why time to first token take longer.

Decoding

After prefill phase we have decoding phase where we dont need much larger matrix multiplcations. Unlike prefill stage, this stage is not parallelise-able. Since the next token are predicted in auto-regressive fashion. There is a dependence on the previous tokens.