ML Wiki

Tag: inference-efficiency

19 items with this tag.

May 09, 2026
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
May 09, 2026
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
May 04, 2026
Quantization
May 04, 2026
AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration
Apr 24, 2026
Mixture of Depths: Dynamically Allocating Compute in Transformer LLMs
Apr 20, 2026
GQA: Grouped-Query Attention — How Modern LLMs Got 5x Faster Without Losing Quality
Apr 18, 2026
Grouped Query Attention (GQA)
Apr 18, 2026
Sliding Window Attention (SWA)
Apr 18, 2026
Mistral 7B
Apr 17, 2026
Memory Efficiency
Apr 17, 2026
QLoRA: Efficient Finetuning of Quantized LLMs
Apr 10, 2026
Mixtral of Experts
Apr 05, 2026
FlashAttention
Apr 05, 2026
KV Cache
- concept
- inference-efficiency
Apr 05, 2026
Speculative Decoding
Apr 05, 2026
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Apr 05, 2026
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Apr 05, 2026
Efficient Memory Management for Large Language Model Serving with PagedAttention
Apr 05, 2026
Fast Inference from Transformers via Speculative Decoding