ML Wiki
Search
Search
Explorer
Tag: inference-efficiency
10 items with this tag.
Apr 10, 2026
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
source
flash-attention
attention
systems
inference-efficiency
gpu
Apr 10, 2026
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
source
gqa
grouped-query-attention
multi-query-attention
inference-efficiency
kv-cache
attention
Apr 10, 2026
Mixtral of Experts
source
mixtral
mixture-of-experts
moe
sparse-moe
inference-efficiency
open-weights
Apr 05, 2026
FlashAttention
concept
inference-efficiency
systems
attention
Apr 05, 2026
KV Cache
concept
inference-efficiency
Apr 05, 2026
Speculative Decoding
concept
inference-efficiency
serving
Apr 05, 2026
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
source
inference-efficiency
attention
systems
Apr 05, 2026
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
source
architecture
ssm
mamba
inference-efficiency
Apr 05, 2026
Efficient Memory Management for Large Language Model Serving with PagedAttention
source
inference-efficiency
serving
kv-cache
systems
Apr 05, 2026
Fast Inference from Transformers via Speculative Decoding
source
inference-efficiency
speculative-decoding
serving