ML Wiki

Tag: rlhf

12 items with this tag.

  • May 05, 2026

    Self-Rewarding Language Models

    • source
    • alignment
    • rlhf
    • dpo
    • reward-model
    • sft
    • ai-feedback
  • Apr 29, 2026

    ORPO: Monolithic Preference Optimization without Reference Model

    • source
    • alignment
    • dpo
    • rlhf
    • sft
    • training
  • Apr 28, 2026

    What's the right alignment stack post-RLHF?

    • thread
    • alignment
    • rlhf
    • dpo
  • Apr 25, 2026

    Learning to Summarize from Human Feedback

    • source
    • rlhf
    • alignment
    • reward-model
    • ppo
    • sft
    • fine-tuning
  • Apr 22, 2026

    KTO: Model Alignment as Prospect Theoretic Optimization

    • source
    • alignment
    • rlhf
    • dpo
    • reward-model
    • training
  • Apr 21, 2026

    GPT-4 Technical Report

    • source
    • llm
    • scaling
    • alignment
    • multimodal
    • rlhf
  • Apr 17, 2026

    Constitutional AI: Teaching Models to Self-Correct

    • papers
    • alignment
    • rlhf
    • safety
  • Apr 17, 2026

    KTO: Model Alignment as Prospect Theoretic Optimization

    • source
    • alignment
    • dpo
    • rlhf
    • preference-learning
  • Apr 17, 2026

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    • source
    • pre-training
    • rlhf
    • sft
    • gqa
    • alignment
    • llm
    • meta-ai
  • Apr 17, 2026

    Proximal Policy Optimization Algorithms

    • source
    • reinforcement-learning
    • ppo
    • policy-gradient
    • rlhf
    • training
  • Apr 10, 2026

    Reward Model

    • concept
    • alignment
    • training
    • rlhf
  • Apr 10, 2026

    Training language models to follow instructions with human feedback (InstructGPT)

    • source
    • alignment
    • rlhf
    • llm
    • fine-tuning
    • safety