Ingest Log
[2026-05-09] ingest | LLaVA-1.5: Improved Baselines with Visual Instruction Tuning
- Source: llava-1-5-improved-baselines-with-visual-instruction-tuning
- ArXiv: https://arxiv.org/abs/2310.03744
- Key concepts: vision-language-models, multimodal-instruction-tuning, multimodal-embeddings, patch-embeddings
- One-line takeaway: Three minimal changes to LLaVA — MLP connector, CLIP-ViT-L-336px, and academic VQA data with response-format prompts — push a 13B model to SOTA on 11 benchmarks using only 1.2M public examples and one day on 8 A100s, making LLaVA-1.5 the de facto open VLM baseline.
[2026-05-09] ingest | CodeAct: Executable Code Actions Elicit Better LLM Agents
- Source: codeact-executable-code-actions-llm-agents
- ArXiv: https://arxiv.org/abs/2402.01030
- Key concepts: tool-use-agents, code-generation, chain-of-thought, in-context-learning
- One-line takeaway: Replace JSON tool-call schemas with raw Python code as the agent’s action format; a Python interpreter executes and returns observations, errors become recoverable, and on API-Bank CodeAct beats text/JSON tool-calling by up to 20 points across 17 LLMs.
[2026-05-09] ingest | SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery
- Source: satmae-pretraining-transformers-temporal-multispectral-satellite-imagery
- ArXiv: https://arxiv.org/abs/2207.08051
- Key concepts: self-supervised-learning, vision-transformer, patch-embeddings, geospatial-foundation-models, masked-image-modeling
- One-line takeaway: Adapt MAE to satellite imagery with temporal embeddings and per-band-group spectral position encodings; on land-cover and segmentation tasks SatMAE beats supervised ImageNet-pretrained ViTs by 7-14 points and becomes the default geospatial foundation-model recipe.
[2026-05-09] ingest | t2vec: Deep Representation Learning for Trajectory Similarity Computation
- Source: t2vec-deep-representation-learning-trajectory-similarity
- Reference: ICDE 2018 (https://ieeexplore.ieee.org/document/8509283)
- Key concepts: trajectory-embeddings, encoder-decoder, self-supervised-learning, contrastive-learning
- One-line takeaway: Tokenize a noisy GPS trajectory as S2-like grid cells and train a denoising seq2seq RNN to map noisy/sparse traces to a fixed-length vector; trajectory similarity becomes a 5ms cosine-distance lookup instead of an O(n*m) DTW computation, with mean-rank dropping from ~87 (DTW) to ~3 (t2vec).
[2026-05-09] ingest | Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- Source: sentence-bert-siamese-bert-networks
- ArXiv: https://arxiv.org/abs/1908.10084
- Key concepts: sentence-embeddings, siamese-networks, contrastive-learning, bi-encoder, semantic-similarity
- One-line takeaway: Run BERT independently on each sentence (siamese encoder), mean-pool to a fixed vector, fine-tune with NLI triplet loss; finding the most similar pair in 10K sentences drops from 65 hours (cross-encoder BERT) to 5 seconds (SBERT) with negligible STS accuracy loss, making vector-index retrieval the standard pattern.
[2026-05-09] ingest | ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction
- Source: colbert-late-interaction-retrieval
- ArXiv: https://arxiv.org/abs/2004.12832
- Key concepts: sentence-embeddings, bi-encoder, late-interaction, semantic-similarity
- One-line takeaway: Keep one 128-dim vector per query token and per document token, index document tokens offline, and at query time apply MaxSim per query token over all document tokens summed across the query; this “late interaction” matches cross-encoder accuracy (MRR@10 36.0 vs 36.5) at bi-encoder latency (60ms vs 7000ms).
[2026-05-09] ingest | GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- Source: gptq-accurate-post-training-quantization
- ArXiv: https://arxiv.org/abs/2210.17323
- Key concepts: quantization, inference-efficiency, memory-efficiency, compression
- One-line takeaway: Quantize OPT-175B to 3-4 bits per weight in 4 GPU hours by walking columns left-to-right with Cholesky-stabilized lazy compensation from a 128-sequence calibration set; first method to fit a 175B model in a single 80GB GPU at near-zero perplexity loss, founding the LLM weight-quantization era.
[2026-05-09] ingest | MTEB: Massive Text Embedding Benchmark
- Source: mteb-massive-text-embedding-benchmark
- ArXiv: https://arxiv.org/abs/2210.07316
- Key concepts: sentence-embeddings, evaluation, benchmark, bi-encoder
- One-line takeaway: Unify 58 embedding-evaluation datasets across 8 task categories and 112 languages under one harness; the deflationary headline finding — no single embedding model dominates all categories — becomes the canonical leaderboard that drives BGE, GTE, E5, and every successor open-embedding family.
[2026-05-09] ingest | Word2Vec: Efficient Estimation of Word Representations in Vector Space
- Source: word2vec-efficient-estimation-word-representations
- ArXiv: https://arxiv.org/abs/1301.3781
- Key concepts: word-embeddings, self-supervised-learning, distributional-hypothesis, negative-sampling
- One-line takeaway: Drop the hidden layer entirely and train two embedding tables to predict surrounding words from a center word (Skip-gram) using negative sampling; 1.6B words train in a day, the resulting vectors satisfy king - man + woman ≈ queen, and the recipe (predict context, dense vectors fall out) becomes the foundation of all modern representation learning.
[2026-05-09] ingest | C-Pack / BGE: Packed Resources for General Chinese Embeddings
- Source: bge-c-pack-general-chinese-embeddings
- ArXiv: https://arxiv.org/abs/2309.07597
- Key concepts: sentence-embeddings, contrastive-learning, bi-encoder, self-supervised-learning
- One-line takeaway: Scale the bi-encoder recipe with three stages (RetroMAE pretraining, ~100M-pair contrastive pretraining, task-specific fine-tuning) and large in-batch negatives plus hard mining; BGE-large takes MTEB English SOTA (64.2) and BGE-large-zh dominates Chinese (64.5 vs ada-002’s 53.0), becoming the open-source default for retrieval and dense embeddings.
[2026-05-09] ingest | Phi-3 Technical Report
- Source: phi-3-technical-report
- ArXiv: https://arxiv.org/abs/2404.14219
- Key concepts: data-quality, scaling-laws, compute-optimal-training, pre-training
- Entities: microsoft-research
- One-line takeaway: A 3.8B model trained on 3.3T tokens of heavily-curated web plus GPT-4-generated synthetic textbooks matches Mixtral 8x7B on MMLU and runs on a phone, demonstrating that data quality is a separate scaling axis from parameter count.
[2026-05-09] ingest | PyTorch FSDP: Fully Sharded Data Parallel
- Source: pytorch-fsdp-fully-sharded-data-parallel
- ArXiv: https://arxiv.org/abs/2304.11277
- Key concepts: distributed-training, data-parallel, memory-efficiency, model-parallel
- Entities: meta-ai-fair
- One-line takeaway: PyTorch’s native ZeRO-3 implementation, co-designed with the autograd dispatcher and CUDA caching allocator, exposes HYBRID_SHARD (intra-node sharding plus cross-node replication) as the optimal default for multi-node training where cross-node bandwidth is the bottleneck.
[2026-05-09] ingest | Megatron-LM: Training Multi-Billion Parameter Language Models
- Source: megatron-lm-training-multi-billion-parameter-language-models
- ArXiv: https://arxiv.org/abs/1909.08053
- Key concepts: model-parallel, tensor-parallel, distributed-training, memory-efficiency
- One-line takeaway: Tensor parallelism splits each layer’s weight matrices across GPUs (column-then-row pattern for MLPs, head-wise for attention) with only 2 all-reduces per layer, training 8.3B parameter transformers at 76% scaling efficiency on 512 V100s and establishing pre-norm as the default for stable scaling.
[2026-05-09] ingest | Orca: A Distributed Serving System for Transformer-Based Generative Models
- Source: orca-distributed-serving-transformer-generative-models
- ArXiv: (OSDI 2022, https://www.usenix.org/conference/osdi22/presentation/yu)
- Key concepts: continuous-batching, inference-efficiency, kv-cache
- One-line takeaway: Iteration-level scheduling (instead of request-level) plus selective batching (apply batching only to position-independent ops) eliminates the bottleneck where short requests wait for long batchmates, delivering 36.9x throughput improvement at the same latency over FasterTransformer on GPT-3 175B.
[2026-05-09] ingest | FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- Source: flash-attention-2
- ArXiv: https://arxiv.org/abs/2307.08691
- Key concepts: flash-attention, attention, inference-efficiency, memory-efficiency
- Entities: tri-dao, stanford-hazy-research
- One-line takeaway: Once FlashAttention-1 fixed the HBM bottleneck, the new constraints became GPU occupancy and warp coordination; FA2 reduces non-matmul FLOPs, parallelizes over sequence-length tiles to saturate SMs, and partitions warps over output rows to eliminate cross-warp shared-memory traffic, doubling end-to-end throughput to 225 TFLOPs/s on A100 (72% MFU).
[2026-05-09] ingest | ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
- Source: zero-memory-optimizations-trillion-parameter-models
- ArXiv: https://arxiv.org/abs/1910.02054
- Key concepts: distributed-training, data-parallel, model-parallel, memory-efficiency, mixed-precision-training
- Entities: microsoft-research
- One-line takeaway: Partitioning optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across the data-parallel group eliminates the redundancy of standard DDP without changing communication volume materially, enabling 100B+ models on commodity hardware and turning data parallelism into a path to trillion-parameter training.
[2026-05-09] ingest | On Calibration of Modern Neural Networks
- Source: on-calibration-of-modern-neural-networks
- ArXiv: https://arxiv.org/abs/1706.04599
- Key concepts: calibration, temperature-scaling, uncertainty-estimation, isotonic-regression, expected-calibration-error
- One-line takeaway: Modern deep networks (ResNet, DenseNet) are systematically overconfident — saying 99% when right 71% — and a single learned scalar T applied to logits before softmax (temperature scaling) reduces ECE by 5-10x without changing accuracy on virtually every architecture-dataset pair tested.
[2026-05-09] ingest | Hidden Markov Map Matching Through Noise and Sparseness
- Source: hidden-markov-map-matching-noise-sparseness
- Reference: ACM SIGSPATIAL 2009 (Microsoft Research)
- Key concepts: hidden-markov-models, map-matching, viterbi
- Entities: microsoft-research
- One-line takeaway: Modeling a GPS trace plus road network as an HMM (Gaussian emission in perpendicular distance, exponential transition in |d_GPS - d_road|) and decoding with Viterbi is the canonical map-matching algorithm — robust to sparse sampling and 50m+ noise — and underlies every modern snap-to-road system.
[2026-05-09] ingest | SAM 2: Segment Anything in Images and Videos
- Source: sam-2-segment-anything-in-images-and-videos
- ArXiv: https://arxiv.org/abs/2408.00714
- Key concepts: promptable-segmentation, foundation-models, vision-transformer, zero-shot-transfer, kv-cache, long-context
- Entities: meta-ai-fair
- One-line takeaway: Adding a streaming memory bank — a small KV-cache-style store of recent and prompted-frame embeddings that each new frame cross-attends against — converts SAM’s image-only promptable segmentation into video segmentation that uses 3x fewer user clicks while running 6x faster than per-frame SAM on still images.
[2026-05-09] ingest | Qwen2.5-VL Technical Report
- Source: qwen2-5-vl-technical-report
- ArXiv: https://arxiv.org/abs/2502.13923
- Key concepts: vision-language-models, multimodal-embeddings, vision-transformer, patch-embeddings, visual-grounding, long-context
- One-line takeaway: Training a from-scratch dynamic-resolution ViT (with window attention to keep cost tractable) plus an absolute-time M-RoPE for video lets a 72B VLM match GPT-4o on document and chart understanding, emit pixel-accurate bounding boxes as native output, and serve as the open baseline for GUI-agent applications.
[2026-05-08] ingest | Flamingo: A Visual Language Model for Few-Shot Learning
- Source: flamingo-visual-language-model-few-shot-learning
- ArXiv: https://arxiv.org/abs/2204.14198
- Key concepts: multimodal-embeddings, vision-language-models, cross-attention, in-context-learning, perceiver-resampler
- Entities: google-deepmind
- One-line takeaway: Inserting gated cross-attention layers into a frozen 70B Chinchilla LM and training on 43M interleaved web pages produces the first model to do GPT-3-style few-shot prompting with images — outperforming fine-tuned models on 6 of 16 vision benchmarks using only 32 examples.
[2026-05-07] ingest | Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
- Source: self-rag-learning-to-retrieve-generate-critique
- ArXiv: https://arxiv.org/abs/2310.11511
- Key concepts: rag, rag, in-context-learning, long-context, ai-feedback
- One-line takeaway: Training a single LM to emit four reflection tokens (Retrieve / IsRel / IsSup / IsUse) — via a GPT-4-annotated critic distilled offline — produces adaptive retrieval-augmented generation that outperforms retrieval-augmented ChatGPT on open-domain QA and achieves dramatically higher citation accuracy on long-form generation, with no separate critic at inference.
[2026-05-06] ingest | RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP
- Source: rag-retrieval-augmented-generation
- ArXiv: https://arxiv.org/abs/2005.11401
- Key concepts: rag, encoder-decoder, in-context-learning, long-context, pre-training
- One-line takeaway: Pairing a pretrained BART generator with a DPR dense retriever over 21M Wikipedia passages — trained end-to-end on just (question, answer) pairs — creates a system that outperforms T5-11B on open-domain QA despite being 30× smaller, and whose knowledge can be updated by swapping the index without any retraining.
[2026-05-05] ingest | Self-Rewarding Language Models
- Source: self-rewarding-language-models
- ArXiv: https://arxiv.org/abs/2401.10020
- Key concepts: rlhf, dpo, reward-model, ai-feedback, alignment
- One-line takeaway: Merging the reward model into the LLM itself via iterative DPO — where the model scores its own outputs via LLM-as-a-Judge — lets both instruction-following and reward-modeling ability improve together, with LLaMA 2 70B M₃ outperforming Claude 2 and Gemini Pro on AlpacaEval 2.0 using only 3,200 seed examples.
[2026-05-04] ingest | Training Compute-Optimal Large Language Models (Chinchilla)
- Source: training-compute-optimal-large-language-models
- ArXiv: https://arxiv.org/abs/2203.15556
- Key concepts: scaling-laws, compute-optimal-training, pre-training, inference-efficiency
- Entities: google-deepmind
- One-line takeaway: Training over 400 models with IsoFLOP profiles reveals that compute-optimal training requires equal scaling of model size and tokens (~20 tokens per parameter), overturning Kaplan’s parameter-heavy allocation and showing that Chinchilla 70B outperforms Gopher 280B at identical compute.
[2026-05-04] ingest | AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration
- Source: awq-activation-aware-weight-quantization
- ArXiv: https://arxiv.org/abs/2306.00978
- Key concepts: quantization, inference-efficiency, memory-efficiency
- One-line takeaway: Looking at activation magnitudes (not weight magnitudes) to identify the 1% of weight channels that matter, then protecting them via an equivalent scaling transform instead of mixed-precision, recovers nearly all of INT3’s accuracy loss while staying hardware-friendly — and ships in vLLM, TensorRT-LLM, and llama.cpp.
[2026-05-03] ingest | Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (Othello-GPT)
- Source: emergent-world-representations-othello-gpt
- ArXiv: https://arxiv.org/abs/2210.13382
- Key concepts: mechanistic-interpretability, probing, emergent-behavior, transformer
- One-line takeaway: A GPT trained only on Othello move sequences develops a nonlinear internal model of the board state — one that can be surgically edited via activation interventions to change the model’s legal-move predictions, proving sequence models can build genuine world models.
[2026-05-02] ingest | Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
- Source: alibi-train-short-test-long
- ArXiv: https://arxiv.org/abs/2108.12409
- Key concepts: positional-encoding, attention, long-context, inductive-bias
- One-line takeaway: Replacing positional embeddings with a fixed per-head distance penalty on attention scores lets a 1.3B model trained on 1K-token sequences match a sinusoidal model trained on 2K tokens — 11% faster, 11% less memory, with clean extrapolation built in.
[2026-04-30] ingest | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Source: blip-2-bootstrapping-language-image-pretraining
- ArXiv: https://arxiv.org/abs/2301.12597
- Key concepts: multimodal-embeddings, vision-language-models, contrastive-learning, cross-attention, vision-transformer, pre-training, zero-shot-transfer
- One-line takeaway: Training only a 188M Q-Former bridge between a frozen ViT and frozen LLM beats 80B-parameter Flamingo on zero-shot VQA by 8.7% — proving modular VLM training works.
[2026-04-29] ingest | ORPO: Monolithic Preference Optimization without Reference Model
- Source: orpo-monolithic-preference-optimization
- ArXiv: https://arxiv.org/abs/2403.07691
- Key concepts: alignment, dpo, sft, rlhf, reward-model
- One-line takeaway: Adding an odds ratio penalty directly to SFT loss is sufficient for alignment — no reference model, no separate preference phase, beats DPO+SFT at 7B.
Entries are appended chronologically as sources are ingested.
[2026-04-28] ingest | Highly Accurate Protein Structure Prediction with AlphaFold
- Source: alphafold-2-protein-structure-prediction
- ArXiv: https://www.nature.com/articles/s41586-021-03819-2
- Key concepts: attention, transformer, protein-structure, evoformer, self-supervised-learning
- Entities: google-deepmind
- One-line takeaway: Evoformer alternating row/column attention over MSAs plus equivariant structure module achieves TM-score 0.92 on CASP14, solving the 50-year protein folding problem.
[2026-04-27] ingest | Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context
- Source: gemini-1-5-multimodal-long-context
- Key concepts: mixture-of-experts | long-context | in-context-learning | multimodal-embeddings
- One-line takeaway: Sparse MoE architecture with efficient ring attention unlocks near-perfect retrieval at 1M–10M tokens — and when given a grammar book for Kalamang (fewer than 200 speakers), the model learns to translate it from context alone.
[2026-04-26] ingest | Neural Machine Translation of Rare Words with Subword Units (BPE)
- Source: bpe-neural-machine-translation-subword-units
- Key concepts: tokenization | subword-units | vocabulary | compression
- One-line takeaway: Iteratively merging the most frequent character pairs (BPE) gives every word a representation from known subword pieces — eliminating unknown tokens without the slowness of character-level models, producing the tokenizer every major LLM uses today.
[2026-04-25] ingest | Learning to Summarize from Human Feedback
- Source: learning-to-summarize-human-feedback
- Key concepts: rlhf | reward-model | ppo | alignment | sft
- One-line takeaway: Train a reward model on human pairwise comparisons, then use PPO to optimize the language model against it — this three-stage pipeline produces summaries humans prefer over SFT baselines and over models 30× larger.
[2026-04-24] ingest | Mixture of Depths: Dynamically Allocating Compute in Transformer-Based Language Models
- Source: mixture-of-depths-dynamic-compute-allocation
- Key concepts: dynamic-computation | mixture-of-experts | inference-efficiency | transformer
- One-line takeaway: Routing 87.5% of tokens around transformer blocks via a learned top-k scalar — allocating compute across depth rather than uniformly — produces models that step 60% faster at equal loss with a static compute graph that GPUs can actually exploit.
[2026-04-24] ingest | Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
- Source: consensus-entropy-multi-vlm-agreement-ocr
- Key concepts: ensemble-methods | uncertainty-estimation | self-consistency | vision-language-models
- One-line takeaway: When multiple independent VLMs agree on an OCR output the answer is almost certainly correct; when they diverge, routing to a stronger model or flagging for review — all without labels or retraining — boosts verification F1 by 42% and OCR accuracy by 8%.
[2026-04-23] ingest | DINOv2: Learning Robust Visual Features without Supervision
- Source: dinov2-learning-robust-visual-features
- Key concepts: self-supervised-learning | vision-transformer | distillation | zero-shot-transfer | patch-embeddings
- One-line takeaway: Data curation was the missing piece — bootstrapping 142M curated images with existing SSL features, then combining DINO + iBOT training, produces frozen ViT features that match CLIP on ImageNet and beat it on robustness benchmarks, without any text supervision.
[2026-04-22] ingest | Distilling the Knowledge in a Neural Network
- Source: knowledge-distillation-hinton
- Key concepts: distillation | temperature-scaling | compression | ensemble-methods
- One-line takeaway: Training a small student on a teacher’s soft probability outputs — not hard labels — transfers the structured similarity knowledge that hard labels discard, letting a small model match ensemble accuracy at single-model cost.
[2026-04-22] ingest | KTO: Model Alignment as Prospect Theoretic Optimization
- Source: kto-model-alignment-as-prospect-theoretic-optimization
- Key concepts: alignment | dpo | rlhf | reward-model
- One-line takeaway: You don’t need paired preferences to align a language model — a thumbs up or thumbs down on individual outputs, framed through prospect theory, matches or beats DPO while requiring half the annotation effort.
[2026-04-21] ingest | GPT-4 Technical Report
- Source: gpt-4-technical-report
- Key concepts: scaling-laws | rlhf | alignment | multimodal-embeddings | emergent-behavior
- One-line takeaway: GPT-4 achieves human-level performance on professional exams via RLHF and predictable scaling — with the key finding that capabilities can be forecast from models trained with 1/10,000th the compute.
[2026-04-21] ingest | Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- Source: batch-normalization-accelerating-deep-network-training
- Key concepts: batch-normalization | optimization | vanishing-gradients | stochastic-gradient-descent
- One-line takeaway: Normalizing each layer’s inputs to zero mean and unit variance per mini-batch, then rescaling with learned parameters, eliminated internal covariate shift and made training 14x faster — later enabling 150+ layer networks via ResNet.
[2026-04-20] ingest | BART: Denoising Sequence-to-Sequence Pre-training
- Source: bart-denoising-sequence-to-sequence-pre-training
- Key concepts: encoder-decoder | pre-training | denoising | masked-language-model | fine-tuning
- One-line takeaway: Corrupting documents with span masking, token deletion, sentence permutation, and document rotation — then training a seq2seq model to reconstruct the original — produces a pre-trained model that excels at generation, summarization, and translation.
[2026-04-20] ingest | GQA: Grouped-Query Attention
- Source: gqa-grouped-query-attention
- Key concepts: attention | kv-cache | inference-efficiency | gqa
- One-line takeaway: Grouping 32 query heads into 8 teams that share KV heads cuts memory bandwidth 4x and inference latency 5x with only 0.1 quality loss — now standard in Llama 2/3, Mistral, and virtually every production LLM.
[2026-04-19] ingest | Deep Residual Learning for Image Recognition
- Source: deep-residual-learning-for-image-recognition
- Key concepts: residual-connections | vanishing-gradients | batch-normalization
- One-line takeaway: Skip connections that add each layer’s input to its output — F(x) + x — solved the degradation problem and made 152-layer networks trainable, winning all five ILSVRC/COCO 2015 competitions.
[2026-04-18] ingest | Evaluating Large Language Models Trained on Code (Codex)
- Source: codex-evaluating-large-language-models-trained-on-code
- Key concepts: pre-training | fine-tuning | sampling | scaling-laws | code-generation
- One-line takeaway: Fine-tuning GPT-3 on 54M GitHub repositories and sampling 100 completions at temperature 0.8 (pass@100) produces Codex-12B — solving 70.2% of HumanEval Python problems, introducing the benchmark that defined LLM coding evaluation.
[2026-04-18] ingest | Scalable Diffusion Models with Transformers (DiT)
- Source: dit-scalable-diffusion-models-with-transformers
- Key concepts: diffusion-models | vision-transformer | scaling-laws | latent-space | patch-embeddings
- One-line takeaway: Replacing U-Net with a Vision Transformer in latent diffusion — treating image patches as tokens and conditioning on class labels via adaptive layer norm — produces a model whose FID scales cleanly with compute, following transformer power laws.
[2026-04-18] ingest | Mistral 7B
- Source: mistral-7b
- Key concepts: gqa | sliding-window-attention | inference-efficiency | kv-cache
- One-line takeaway: Grouped-query attention plus sliding window attention (4096-token local context + flash for long sequences) makes Mistral 7B outperform LLaMA 2 13B on every benchmark at half the inference cost.
[2026-04-17] ingest | LIMA: Less Is More for Alignment
- Source: lima-less-is-more-for-alignment
- Key concepts: sft | alignment | instruction-following | data-quality
- One-line takeaway: Fine-tuning LLaMA-65B on just 1,000 carefully curated instruction-response pairs beats RLHF’d models in 43% of head-to-head comparisons — the Superficial Alignment Hypothesis: alignment is mostly style, not knowledge.
[2026-04-17] ingest | Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- Source: tree-of-thoughts-deliberate-problem-solving
- Key concepts: chain-of-thought | in-context-learning | reasoning-rl | tool-use-agents | sampling
- One-line takeaway: Generalizing CoT from a chain to a tree — where the LLM proposes, evaluates, and backtracks across candidate partial solutions via BFS/DFS — solves Game of 24 at 74% vs CoT’s 4%.
[2026-04-17] ingest | Self-Consistency Improves Chain of Thought Reasoning in Language Models
- Source: self-consistency-chain-of-thought-reasoning
- Key concepts: chain-of-thought | in-context-learning | sampling | reasoning-rl
- One-line takeaway: Sample 20-40 diverse reasoning paths from a single CoT prompt and majority-vote the final answer — no training, no verifier — improving GSM8K from 56% to 74% with PaLM.
[2026-04-17] ingest | QLoRA: Efficient Finetuning of Quantized LLMs
- Source: qlora-efficient-finetuning-quantized-llms
- Key concepts: lora | quantization | fine-tuning | memory-efficiency | inference-efficiency
- One-line takeaway: 4-bit NormalFloat quantization of the frozen backbone plus LoRA adapters in BF16 reduces a 65B model’s fine-tuning footprint to a single 48GB GPU — democratizing RLHF-scale experiments.
[2026-04-17] ingest | Emerging Properties in Self-Supervised Vision Transformers (DINO)
- Source: dino-self-supervised-vision-transformers
- Key concepts: self-supervised-learning | vision-transformer | distillation | contrastive-learning | zero-shot-transfer
- One-line takeaway: Self-distillation with no labels — a student ViT matches a momentum teacher’s output distribution — produces attention maps that segment objects without segmentation supervision, and k-NN classifiers that rival supervised ViTs.
[2026-04-17] ingest | Masked Autoencoders Are Scalable Vision Learners
- Source: mae-masked-autoencoders-scalable-vision-learners
- Key concepts: masked-language-model | vision-transformer | self-supervised-learning | patch-embeddings | pre-training
- One-line takeaway: Masking 75% of ViT patches at random and reconstructing raw pixels — using a lightweight decoder on only the visible tokens — yields richer representations than contrastive methods while being 3x faster to train.
[2026-04-17] ingest | A Simple Framework for Contrastive Learning of Visual Representations
- Source: simclr-contrastive-learning-visual-representations
- Key concepts: contrastive-learning | self-supervised-learning | data-augmentation | transfer-learning
- One-line takeaway: Two augmented views of the same image are pulled together and pushed apart from all other images in a batch — with a nonlinear projection head and large batches, SimCLR narrows the gap with supervised ImageNet to 7% without labels.
[2026-04-17] ingest | Proximal Policy Optimization Algorithms
- Source: proximal-policy-optimization
- Key concepts: ppo | policy-gradient | reinforcement-learning | rlhf
- One-line takeaway: Clipping the probability ratio in the surrogate objective to [1-ε, 1+ε] prevents destructively large policy updates — giving PPO TRPO-level sample efficiency at a fraction of the implementation complexity. The workhorse behind RLHF.
[2026-04-17] ingest | High-Resolution Image Synthesis with Latent Diffusion Models
- Source: latent-diffusion-models-high-resolution-image-synthesis
- Key concepts: diffusion-models | latent-space | vae | cross-attention
- One-line takeaway: Moving diffusion from pixel space to the latent space of a pre-trained VQ-VAE slashes compute 4-8x while enabling high-resolution synthesis guided by text, class labels, or images via cross-attention — the architecture behind Stable Diffusion.
[2026-04-17] ingest | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)
- Source: t5-exploring-the-limits-of-transfer-learning
- Key concepts: transfer-learning | pre-training | encoder-decoder | fine-tuning | scaling-laws
- One-line takeaway: Casting every NLP task as text-to-text and systematically sweeping architectures, objectives, datasets, and scale reveals that a unified encoder-decoder with C4 pretraining sets SOTA across GLUE, SuperGLUE, SQuAD, and translation.
[2026-04-17] ingest | Language Models are Unsupervised Multitask Learners
- Source: language-models-are-unsupervised-multitask-learners
- Key concepts: pre-training | zero-shot-transfer | in-context-learning | scaling-laws
- One-line takeaway: Training a 1.5B parameter language model on 8M WebPages (WebText) without any task labels — just next-token prediction — produces a model that zero-shot transfers to translation, summarization, QA, and reading comprehension.
[2026-04-17] ingest | Constitutional AI: Harmlessness from AI Feedback
- Source: constitutional-ai-harmlessness-from-ai-feedback
- Key concepts: rlhf | alignment | constitutional-ai | ai-feedback
- One-line takeaway: AI can self-correct toward harmlessness using written principles, eliminating the need for humans to label harmful outputs.
[2026-04-16] ingest | Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
- Source: whisper-robust-speech-recognition
- Key concepts: foundation-models | pre-training | transfer-learning
- One-line takeaway: Training on 680K hours of weakly-labeled internet audio with a multi-task objective (transcription, translation, language ID, timestamps) produces a zero-shot ASR model that matches domain-adapted systems without any fine-tuning.
[2026-04-16] ingest | Toolformer: Language Models Can Teach Themselves to Use Tools
- Source: toolformer-language-models-teach-themselves-tool-use
- Key concepts: tool-use-agents | in-context-learning | fine-tuning
- One-line takeaway: Self-supervised bootstrapping — generate candidate API calls, keep only those that reduce next-token loss, fine-tune on survivors — teaches a model when tool use actually helps, without human annotation.
[2026-04-16] ingest | Switch Transformers: Scaling to Trillion Parameter Models with Sparse MoE
- Source: switch-transformer-sparse-mixture-of-experts
- Key concepts: mixture-of-experts | transformer | scaling-laws
- One-line takeaway: Top-1 routing in MoE layers decouples parameter count from FLOPs per token — Switch Transformer achieves 7x faster pretraining vs T5-XXL at equal compute by activating only one expert per token.
[2026-04-16] ingest | ReAct: Synergizing Reasoning and Acting in Language Models
- Source: react-reasoning-and-acting
- Key concepts: tool-use-agents | chain-of-thought | in-context-learning
- One-line takeaway: Interleaving reasoning traces with external tool calls in a single context — each observation feeding the next thought — produces grounded, interpretable agents that outperform CoT-only and Act-only baselines.
[2026-04-16] ingest | RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP
- Source: rag-retrieval-augmented-generation
- Key concepts: rag | in-context-learning | foundation-models
- One-line takeaway: Grounding a seq2seq generator in passages retrieved from a dense Wikipedia index improves factual accuracy and enables knowledge updates without retraining the model.
[2026-04-16] ingest | LLaVA: Visual Instruction Tuning
- Source: llava-visual-instruction-tuning
- Key concepts: multimodal-instruction-tuning | contrastive-learning | sft
- One-line takeaway: A single linear projection connecting CLIP’s frozen image encoder to Vicuna, plus instruction-tuning on GPT-4-generated image dialogues, produces a capable open-source multimodal model trained in under a day.
[2026-04-16] ingest | GRPO: Group Relative Policy Optimization (DeepSeekMath)
- Source: grpo-deepseekmath-group-relative-policy-optimization
- Key concepts: grpo | reasoning-rl | rlhf
- One-line takeaway: Replacing PPO’s critic model with a group-sampled reward baseline halves memory requirements while preserving training stability — making RL on reasoning tasks tractable at scale.
[2026-04-16] ingest | DeepSeek-R1: Incentivizing Reasoning via Reinforcement Learning
- Source: deepseek-r1-reasoning-via-reinforcement-learning
- Key concepts: reasoning-rl | grpo | chain-of-thought
- One-line takeaway: Pure RL with GRPO — no distillation, no SFT cold-start — teaches a model to generate long reasoning chains and self-verify, matching o1-level performance on math and coding at open-source cost.
[2026-04-16] ingest | Denoising Diffusion Probabilistic Models
- Source: ddpm-denoising-diffusion-probabilistic-models
- Key concepts: diffusion-models | transformer
- One-line takeaway: Training a U-Net to reverse a fixed Gaussian noise schedule — predicting the noise added at each step — produces a generative model that surpasses GANs in sample quality without adversarial instability.
[2026-04-16] ingest | Constitutional AI: Harmlessness from AI Feedback
- Source: constitutional-ai-harmlessness-from-ai-feedback
- Key concepts: alignment | rlhf | sft
- One-line takeaway: 16 written principles plus an AI feedback loop (RLAIF) replace human red-teamers — the model self-critiques and revises its own harmful outputs, then trains a preference model from AI-generated comparisons.
[2026-04-16] ingest | Segment Anything
- Source: segment-anything
- Key concepts: foundation-models | promptable-segmentation | zero-shot-transfer | open-vocabulary-segmentation | vision-transformer
- One-line takeaway: SAM separates “where to look” (your prompt) from “how to segment” (the model) — encoding the image once then answering any spatial query in 50ms, zero-shot, across any domain.
[2026-04-15] ingest | Scaling Laws for Neural Language Models
- Source: scaling-laws-for-neural-language-models
- Key concepts: scaling-laws | compute-optimal-training | emergent-behavior | power-laws
- One-line takeaway: Loss follows clean power laws in N, D, and C — meaning you can predict large-scale performance from small experiments and optimize your compute budget analytically.
[2026-04-14] ingest | Adam: A Method for Stochastic Optimization
- Source: adam-a-method-for-stochastic-optimization
- Key concepts: optimization | stochastic-gradient-descent | adaptive-learning-rate | momentum | bias-correction
- One-line takeaway: Adam combines adaptive per-parameter learning rates with momentum, plus a mathematically clean bias correction for early training — making it the default optimizer for most deep learning.
[2026-04-13] ingest | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Source: bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding
- Key concepts: attention, pre-training, masked-language-model, fine-tuning
- One-line takeaway: Masking 15% of tokens and predicting them bidirectionally unlocks transfer learning for NLP
[2026-04-11] ingest | An Image is Worth 16x16 Words (ViT)
- Source: an-image-is-worth-16x16-words
- Key concepts: patch-embeddings, vision-transformer, inductive-bias, transfer-learning, attention, classification-token
- One-line takeaway: Treating image patches as word tokens and running a standard transformer on them beats every CNN at scale — proving that with 300M pre-training images, learned representations trump built-in architectural priors.
[2026-04-10] ingest | Training language models to follow instructions with human feedback (InstructGPT)
- Source: training-language-models-to-follow-instructions-with-human-feedback
- Key concepts: rlhf, reward-model, alignment, ppo, sft
- One-line takeaway: A 3-stage RLHF pipeline (SFT → reward model → PPO) makes a 1.3B model preferred over GPT-3 175B, showing alignment via human feedback outperforms raw scaling by orders of magnitude.
[2026-04-09] ingest | CLIP: Learning Transferable Visual Models From Natural Language Supervision
- Source: clip-learning-transferable-visual-models
- Key concepts: contrastive-learning, zero-shot-transfer, multimodal-embeddings, vision-transformer
- One-line takeaway: Match 400M internet image-caption pairs with a simple contrastive loss → a vision model that classifies any concept you can describe in English, zero-shot.
[2026-04-08] ingest | Emergent Abilities of Large Language Models
- Source: emergent-abilities-of-large-language-models
- Key concepts: emergent-behavior, scaling-laws, phase-transition, in-context-learning
- One-line takeaway: Some LLM capabilities don’t scale smoothly — they’re absent, then suddenly present at a threshold, and you can’t predict which ones or when from smaller-scale experiments.
[2026-04-06] ingest | RoPE: Enhanced Transformer with Rotary Position Embedding
- Source: rope-rotary-position-embedding
- ArXiv: https://arxiv.org/abs/2104.09864
- Key concepts: attention, positional-encoding
- One-line takeaway: Rotate Q and K by angle proportional to position — dot product keeps only relative distance. Used in LLaMA, Mistral, Falcon, GPT-NeoX.
[2026-04-04] ingest | Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- Source: direct-preference-optimization-your-language-model-is-secretly-a-reward-model
- ArXiv: https://arxiv.org/abs/2305.18290
- Key concepts: rlhf, dpo, sft, distillation
- One-line takeaway: DPO reformulates RLHF as a supervised classification problem, eliminating the need for a separate reward model.
[2026-04-05] ingest | Attention Is All You Need
- Source: attention-is-all-you-need
- ArXiv: https://arxiv.org/abs/1706.03762
- Key concepts: transformer, attention, positional-encoding
- Entities: google-brain, noam-shazeer, ashish-vaswani
- One-line takeaway: Introduces the Transformer — the attention-only architecture that became the foundation of all modern LLMs.
[2026-04-05] ingest | LoRA: Low-Rank Adaptation of Large Language Models
- Source: lora-low-rank-adaptation
- ArXiv: https://arxiv.org/abs/2106.09685
- Key concepts: lora, sft, transformer
- Entities: microsoft-research
- One-line takeaway: LoRA enables efficient fine-tuning of large models by injecting trainable low-rank matrices, reducing trainable parameters by 10,000x with no inference overhead.
[2026-04-05] ingest | FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- Source: flash-attention-fast-and-memory-efficient-exact-attention
- ArXiv: https://arxiv.org/abs/2205.14135
- Key concepts: flash-attention, attention, inference-efficiency
- Entities: tri-dao, stanford-hazy-research
- One-line takeaway: IO-aware tiled attention algorithm that eliminates the O(N²) HBM bottleneck, enabling 2-3x faster training and long-context models.
[2026-04-05] ingest | Efficient Memory Management for Large Language Model Serving with PagedAttention
- Source: pagedattention-vllm
- ArXiv: https://arxiv.org/abs/2309.06180
- Key concepts: kv-cache, inference-efficiency
- Entities: uc-berkeley-sky-lab
- One-line takeaway: PagedAttention applies OS-style paging to KV cache memory, enabling 2-4x higher serving throughput via vLLM by eliminating memory fragmentation.
[2026-04-05] ingest | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Source: chain-of-thought-prompting
- ArXiv: https://arxiv.org/abs/2201.11903
- Key concepts: chain-of-thought, in-context-learning, emergent-abilities
- Entities: google-brain, jason-wei
- One-line takeaway: Few-shot prompting with reasoning chains elicits strong reasoning in 100B+ models without fine-tuning, establishing chain-of-thought as a foundational prompting technique.
[2026-04-05] ingest | Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
- Source: grokking-generalization-beyond-overfitting
- ArXiv: https://arxiv.org/abs/2201.02177
- Key concepts: grokking
- Entities: openai
- One-line takeaway: Neural networks can achieve perfect generalization long after overfitting via a phase transition called grokking, opening new questions about training dynamics and memorization.
[2026-04-05] ingest | A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization
- Source: grokking-systematic-empirical-study
- ArXiv: https://arxiv.org/abs/2603.25009
- Key concepts: grokking
- One-line takeaway: Grokking is governed by optimization-regularization interactions, not architecture; weight decay is the dominant control parameter in a narrow Goldilocks regime.
[2026-04-05] ingest | Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Source: mamba-linear-time-sequence-modeling
- ArXiv: https://arxiv.org/abs/2312.00752
- Key concepts: ssm-mamba, inference-efficiency
- Entities: tri-dao, albert-gu
- One-line takeaway: Selective SSMs with input-dependent parameters enable linear-time sequence modeling that matches Transformer quality while achieving 5x higher inference throughput.
[2026-04-05] ingest | Fast Inference from Transformers via Speculative Decoding
- Source: speculative-decoding
- ArXiv: https://arxiv.org/abs/2211.17192
- Key concepts: speculative-decoding, inference-efficiency
- Entities: google-research
- One-line takeaway: A small draft model proposes tokens that the large target model verifies in parallel, achieving 2-3x speedup with provably identical output distribution.
[2026-04-05] ingest | Falcon Perception: Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation
- Source: falcon-perception-vlm
- URL: https://huggingface.co/blog/tiiuae/falcon-perception
- Key concepts: early-fusion, visual-grounding, open-vocabulary-segmentation, distillation, inference-efficiency
- Entities: tii-uae
- One-line takeaway: A single 0.6B early-fusion Transformer with a hybrid attention mask and Chain-of-Perception interface outperforms SAM 3 on open-vocabulary segmentation (68.0 vs 62.3 Macro-F1), with the largest gains on compositional prompts requiring OCR, spatial, and relational reasoning.
[2026-04-12] ingest | Splitwise: Efficient Generative LLM Inference Using Phase Splitting
- Source: splitwise-llm-inference-phase-splitting
- ArXiv: https://arxiv.org/abs/2311.18677
- Key concepts: kv-cache, inference-efficiency, continuous-batching
- One-line takeaway: Splitting LLM inference’s compute-hungry prompt phase and memory-hungry decode phase onto separate purpose-optimized machines yields 1.4× more throughput at 20% lower cost.
[2026-04-17] stub | High-Resolution Image Synthesis with Latent Diffusion Models
- Source: latent-diffusion-models-high-resolution-image-synthesis
- ArXiv: https://arxiv.org/abs/2112.10752
- One-line takeaway: Diffusion in the latent space of a VQ-VAE enables high-resolution image synthesis at a fraction of pixel-space compute — the architecture underlying Stable Diffusion.
[2026-04-17] stub | Scalable Diffusion Models with Transformers (DiT)
- Source: dit-scalable-diffusion-models-with-transformers
- ArXiv: https://arxiv.org/abs/2212.09748
- One-line takeaway: Replacing the U-Net backbone with a Vision Transformer in latent diffusion yields predictable FID scaling with compute — the backbone of Sora and next-generation video models.
[2026-04-17] stub | Mistral 7B
- Source: mistral-7b
- ArXiv: https://arxiv.org/abs/2310.06825
- One-line takeaway: Combining grouped-query attention and sliding window attention produces a 7B model that outperforms LLaMA 2 13B with lower inference cost.
[2026-04-17] stub | Distilling the Knowledge in a Neural Network
- Source: knowledge-distillation-hinton
- ArXiv: https://arxiv.org/abs/1503.02531
- One-line takeaway: Training a small student on the soft probability outputs of a large teacher transfers more information than training on hard labels, consistently outperforming same-data baselines.
[2026-04-17] stub | QLoRA: Efficient Finetuning of Quantized LLMs
- Source: qlora-efficient-finetuning-quantized-llms
- ArXiv: https://arxiv.org/abs/2305.14314
- One-line takeaway: Loading a 4-bit quantized base model with trainable LoRA adapters enables fine-tuning 65B models on a single 48GB GPU with no quality loss.
[2026-04-17] stub | Mixture of Depths: Dynamically Allocating Compute in Transformer LLMs
- Source: mixture-of-depths-dynamic-compute-allocation
- ArXiv: https://arxiv.org/abs/2404.02258
- One-line takeaway: Per-token per-layer routing that skips easy tokens matches dense baseline quality at 12.5% fewer FLOPs per forward pass.
[2026-04-17] stub | LIMA: Less Is More for Alignment
- Source: lima-less-is-more-for-alignment
- ArXiv: https://arxiv.org/abs/2305.11206
- One-line takeaway: Fine-tuning on 1,000 carefully selected examples outperforms models trained on 52K+ examples, showing that alignment is about style selection from a capable base.
[2026-04-17] stub | Learning to Summarize from Human Feedback
- Source: learning-to-summarize-human-feedback
- ArXiv: https://arxiv.org/abs/2009.01325
- One-line takeaway: The original RLHF paper: reward modeling on human preference comparisons + PPO fine-tuning produces summaries strongly preferred over SFT baselines.
[2026-04-17] stub | Proximal Policy Optimization Algorithms
- Source: proximal-policy-optimization
- ArXiv: https://arxiv.org/abs/1707.06347
- One-line takeaway: Clipping the policy probability ratio to [1−ε, 1+ε] prevents destructively large RL updates, providing the stability that makes RLHF practical.
[2026-04-17] stub | KTO: Model Alignment as Prospect Theoretic Optimization
- Source: kto-model-alignment-prospect-theoretic-optimization
- ArXiv: https://arxiv.org/abs/2402.01306
- One-line takeaway: Prospect-theoretic loss on binary good/bad labels matches DPO quality without requiring paired preference comparisons, enabling use of existing labeled datasets.
[2026-04-17] stub | Llama 2: Open Foundation and Fine-Tuned Chat Models
- Source: llama-2-open-foundation-fine-tuned-chat-models
- ArXiv: https://arxiv.org/abs/2307.09288
- One-line takeaway: Open-weight models with full alignment pipeline documentation — SFT, iterative RLHF, safety fine-tuning — competitive with proprietary systems.
[2026-04-17] stub | GPT-4 Technical Report
- Source: gpt-4-technical-report
- ArXiv: https://arxiv.org/abs/2303.08774
- One-line takeaway: RLHF at scale produces a multimodal model with human-level professional benchmark performance; capabilities are predictable from small-scale training runs.
[2026-04-17] stub | Self-Consistency Improves Chain of Thought Reasoning
- Source: self-consistency-chain-of-thought-reasoning
- ArXiv: https://arxiv.org/abs/2203.11171
- One-line takeaway: Majority voting over K sampled chain-of-thought traces improves accuracy 5–15% over single chain-of-thought with no training changes.
[2026-04-17] stub | Tree of Thoughts: Deliberate Problem Solving with LLMs
- Source: tree-of-thoughts-deliberate-problem-solving
- ArXiv: https://arxiv.org/abs/2305.10601
- One-line takeaway: Framing reasoning as a tree search with lookahead and backtracking dramatically improves performance on planning tasks where greedy generation fails.
[2026-04-17] stub | Deep Residual Learning for Image Recognition
- Source: deep-residual-learning-for-image-recognition
- ArXiv: https://arxiv.org/abs/1512.03385
- One-line takeaway: Residual connections solve training degradation in very deep networks — ResNet-152 won ILSVRC 2015 and defined computer vision backbones for a decade.
[2026-04-17] stub | Masked Autoencoders Are Scalable Vision Learners
- Source: mae-masked-autoencoders-scalable-vision-learners
- ArXiv: https://arxiv.org/abs/2111.06377
- One-line takeaway: Masking 75% of image patches and reconstructing them produces visual representations competitive with supervised pretraining, with no labeled data.
[2026-04-17] stub | A Simple Framework for Contrastive Learning of Visual Representations
- Source: simclr-contrastive-learning-visual-representations
- ArXiv: https://arxiv.org/abs/2002.05709
- One-line takeaway: Contrastive learning with large batches, strong augmentation, and a nonlinear projection head produces visual representations approaching supervised ImageNet baselines.
[2026-04-17] stub | Emerging Properties in Self-Supervised Vision Transformers (DINO)
- Source: dino-self-supervised-vision-transformers
- ArXiv: https://arxiv.org/abs/2104.14294
- One-line takeaway: Self-distillation without negative pairs produces ViT features that spontaneously segment objects — an emergent property absent from supervised or CNN-based alternatives.
[2026-04-17] stub | DINOv2: Learning Robust Visual Features without Supervision
- Source: dinov2-learning-robust-visual-features
- ArXiv: https://arxiv.org/abs/2304.07193
- One-line takeaway: Scaling DINO with curated data and combined self-distillation and masked image modeling produces a universal frozen visual backbone competitive across depth, segmentation, and classification.