ML Wiki

Tag: inference-efficiency

10 items with this tag.

  • Apr 10, 2026

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    • source
    • flash-attention
    • attention
    • systems
    • inference-efficiency
    • gpu
  • Apr 10, 2026

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    • source
    • gqa
    • grouped-query-attention
    • multi-query-attention
    • inference-efficiency
    • kv-cache
    • attention
  • Apr 10, 2026

    Mixtral of Experts

    • source
    • mixtral
    • mixture-of-experts
    • moe
    • sparse-moe
    • inference-efficiency
    • open-weights
  • Apr 05, 2026

    FlashAttention

    • concept
    • inference-efficiency
    • systems
    • attention
  • Apr 05, 2026

    KV Cache

    • concept
    • inference-efficiency
  • Apr 05, 2026

    Speculative Decoding

    • concept
    • inference-efficiency
    • serving
  • Apr 05, 2026

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    • source
    • inference-efficiency
    • attention
    • systems
  • Apr 05, 2026

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    • source
    • architecture
    • ssm
    • mamba
    • inference-efficiency
  • Apr 05, 2026

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    • source
    • inference-efficiency
    • serving
    • kv-cache
    • systems
  • Apr 05, 2026

    Fast Inference from Transformers via Speculative Decoding

    • source
    • inference-efficiency
    • speculative-decoding
    • serving