Ingest Log
Entries are appended chronologically as sources are ingested.
[2026-04-11] ingest | An Image is Worth 16x16 Words (ViT)
- Source: an-image-is-worth-16x16-words
- Key concepts: patch-embeddings, vision-transformer, inductive-bias, transfer-learning, attention, classification-token
- One-line takeaway: Treating image patches as word tokens and running a standard transformer on them beats every CNN at scale — proving that with 300M pre-training images, learned representations trump built-in architectural priors.
[2026-04-10] ingest | Training language models to follow instructions with human feedback (InstructGPT)
- Source: training-language-models-to-follow-instructions-with-human-feedback
- Key concepts: rlhf, reward-model, alignment, ppo, sft
- One-line takeaway: A 3-stage RLHF pipeline (SFT → reward model → PPO) makes a 1.3B model preferred over GPT-3 175B, showing alignment via human feedback outperforms raw scaling by orders of magnitude.
[2026-04-09] ingest | CLIP: Learning Transferable Visual Models From Natural Language Supervision
- Source: clip-learning-transferable-visual-models
- Key concepts: contrastive-learning, zero-shot-transfer, multimodal-embeddings, vision-transformer
- One-line takeaway: Match 400M internet image-caption pairs with a simple contrastive loss → a vision model that classifies any concept you can describe in English, zero-shot.
[2026-04-08] ingest | Emergent Abilities of Large Language Models
- Source: emergent-abilities-of-large-language-models
- Key concepts: emergent-behavior, scaling-laws, phase-transition, in-context-learning
- One-line takeaway: Some LLM capabilities don’t scale smoothly — they’re absent, then suddenly present at a threshold, and you can’t predict which ones or when from smaller-scale experiments.
[2026-04-06] ingest | RoPE: Enhanced Transformer with Rotary Position Embedding
- Source: rope-rotary-position-embedding
- ArXiv: https://arxiv.org/abs/2104.09864
- Key concepts: attention, positional-encoding
- One-line takeaway: Rotate Q and K by angle proportional to position — dot product keeps only relative distance. Used in LLaMA, Mistral, Falcon, GPT-NeoX.
[2026-04-04] ingest | Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- Source: direct-preference-optimization-your-language-model-is-secretly-a-reward-model
- ArXiv: https://arxiv.org/abs/2305.18290
- Key concepts: rlhf, dpo, sft, distillation
- One-line takeaway: DPO reformulates RLHF as a supervised classification problem, eliminating the need for a separate reward model.
[2026-04-05] ingest | Attention Is All You Need
- Source: attention-is-all-you-need
- ArXiv: https://arxiv.org/abs/1706.03762
- Key concepts: transformer, attention, positional-encoding
- Entities: google-brain, noam-shazeer, ashish-vaswani
- One-line takeaway: Introduces the Transformer — the attention-only architecture that became the foundation of all modern LLMs.
[2026-04-05] ingest | LoRA: Low-Rank Adaptation of Large Language Models
- Source: lora-low-rank-adaptation
- ArXiv: https://arxiv.org/abs/2106.09685
- Key concepts: lora, sft, transformer
- Entities: microsoft-research
- One-line takeaway: LoRA enables efficient fine-tuning of large models by injecting trainable low-rank matrices, reducing trainable parameters by 10,000x with no inference overhead.
[2026-04-05] ingest | FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- Source: flash-attention-fast-and-memory-efficient-exact-attention
- ArXiv: https://arxiv.org/abs/2205.14135
- Key concepts: flash-attention, attention, inference-efficiency
- Entities: tri-dao, stanford-hazy-research
- One-line takeaway: IO-aware tiled attention algorithm that eliminates the O(N²) HBM bottleneck, enabling 2-3x faster training and long-context models.
[2026-04-05] ingest | Efficient Memory Management for Large Language Model Serving with PagedAttention
- Source: pagedattention-vllm
- ArXiv: https://arxiv.org/abs/2309.06180
- Key concepts: kv-cache, inference-efficiency
- Entities: uc-berkeley-sky-lab
- One-line takeaway: PagedAttention applies OS-style paging to KV cache memory, enabling 2-4x higher serving throughput via vLLM by eliminating memory fragmentation.
[2026-04-05] ingest | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Source: chain-of-thought-prompting
- ArXiv: https://arxiv.org/abs/2201.11903
- Key concepts: chain-of-thought, in-context-learning, emergent-abilities
- Entities: google-brain, jason-wei
- One-line takeaway: Few-shot prompting with reasoning chains elicits strong reasoning in 100B+ models without fine-tuning, establishing chain-of-thought as a foundational prompting technique.
[2026-04-05] ingest | Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
- Source: grokking-generalization-beyond-overfitting
- ArXiv: https://arxiv.org/abs/2201.02177
- Key concepts: grokking
- Entities: openai
- One-line takeaway: Neural networks can achieve perfect generalization long after overfitting via a phase transition called grokking, opening new questions about training dynamics and memorization.
[2026-04-05] ingest | A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization
- Source: grokking-systematic-empirical-study
- ArXiv: https://arxiv.org/abs/2603.25009
- Key concepts: grokking
- One-line takeaway: Grokking is governed by optimization-regularization interactions, not architecture; weight decay is the dominant control parameter in a narrow Goldilocks regime.
[2026-04-05] ingest | Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Source: mamba-linear-time-sequence-modeling
- ArXiv: https://arxiv.org/abs/2312.00752
- Key concepts: ssm-mamba, inference-efficiency
- Entities: tri-dao, albert-gu
- One-line takeaway: Selective SSMs with input-dependent parameters enable linear-time sequence modeling that matches Transformer quality while achieving 5x higher inference throughput.
[2026-04-05] ingest | Fast Inference from Transformers via Speculative Decoding
- Source: speculative-decoding
- ArXiv: https://arxiv.org/abs/2211.17192
- Key concepts: speculative-decoding, inference-efficiency
- Entities: google-research
- One-line takeaway: A small draft model proposes tokens that the large target model verifies in parallel, achieving 2-3x speedup with provably identical output distribution.
[2026-04-05] ingest | Falcon Perception: Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation
- Source: falcon-perception-vlm
- URL: https://huggingface.co/blog/tiiuae/falcon-perception
- Key concepts: early-fusion, visual-grounding, open-vocabulary-segmentation, distillation, inference-efficiency
- Entities: tii-uae
- One-line takeaway: A single 0.6B early-fusion Transformer with a hybrid attention mask and Chain-of-Perception interface outperforms SAM 3 on open-vocabulary segmentation (68.0 vs 62.3 Macro-F1), with the largest gains on compositional prompts requiring OCR, spatial, and relational reasoning.