Summary
Zhuge et al. (2026) propose Neural Computers (NCs) as a new machine form that aims to unify computation, memory, and I/O inside a learned runtime state. The core distinction from prior work: conventional computers execute explicit programs, agents act over external environments, and world models predict environment dynamics — but an NC makes the model itself the running computer. The long-term target is a Completely Neural Computer (CNC) that satisfies four conditions simultaneously: Turing completeness, universal programmability (capabilities installable and callable later), behavioral consistency unless explicitly reprogrammed, and machine-native semantics rather than neural imitation of conventional stack idioms.
As an initial empirical step, the paper instantiates early NC primitives as video models (based on Wan2.1, a diffusion transformer) trained on I/O traces from CLI and GUI environments without access to instrumented program state. Three prototype systems are studied: CLIGen General (terminal rendering from ~1,100 hours of noisy video), CLIGen Clean (REPL-style state continuation from scripted traces), and GUIWorld (action-conditioned GUI rollout on Ubuntu 22.04 with four injection architectures). The strongest GUI model injects actions via cross-attention inside every transformer block (Model 4 / “Internal”), trained on ~110 hours of goal-directed Claude CUA traces rather than ~1,400 hours of random data.
The paper is as much a conceptual roadmap as an empirical contribution. It argues that agents bottleneck on capability retention and workflow reuse, world models bottleneck on closing the execution loop, and conventional computers show structural friction for open-ended tasks — making NC a convergent direction. The author estimates a real Neural Computer is ~3 years away from the paper’s April 2026 writing, and CNC further still.
Key Claims
- With ~1,100 hours of noisy terminal video, Wan2.1 fine-tuned as CLIGen can render stable terminal scenes (colors, cursor, TUI, progress bars) convincingly enough to pass quick visual inspection — surprising given how text-dense and motion-poor such scenes are.
- CLIGen Clean learns basic operating regularities (pwd, date, whoami, echo $HOME, enter-echo-output cycle) from scripted Docker traces; simple arithmetic (two-digit addition in Python REPL) starts to appear but remains unstable, suggesting symbolic reasoning may be the wrong capability to expect from DiT-based video models.
- For GUI action injection, Model 4 (cross-attention inside each block) outperforms shallower injection schemes (input-side modulation, token merging, residual side-branch); 110 hours of supervised goal-directed data beats ~1,400 hours of random mouse data; explicit visual cursor supervision beats coordinate-only supervision.
- Against the four CNC conditions, the prototype is at the edge of Turing completeness, barely touches universal programmability, achieves behavioral consistency only in controlled local settings, and has not yet formed machine-native semantics.
- The paper proposes that a mature CNC substrate would look less like today’s 1B–10T dense/MoE models and more like a 10T–1000T sparse, addressable, circuit-like machine whose parts can be locally inspected and routed.
Methods
The prototype pipeline treats NC instantiation as a video generation problem. For CLI, Wan2.1 is fine-tuned on terminal recordings (first on ~1,100h of general terminal video, then on cleaner scripted traces generated via Docker). For GUI, the team built a full recording rig on Ubuntu 22.04 / XFCE4 at 1024×768 / 15 FPS capturing mouse, keyboard, and screen state. Four action-injection architectures were trained in parallel: Model 1 injects actions as input-side latent modulation (shallow baseline); Model 2 merges action tokens into the main sequence (à la WHAM); Model 3 adds actions via a residual side branch (à la ControlNet); Model 4 injects actions via cross-attention inside each DiT block (à la Matrix-Game 2.0). Training data splits across ~1,000h random-slow, ~400h random-fast, and ~110h Claude-CUA goal-directed trajectories. Evaluation is qualitative (visual comparison against ground-truth screen recordings) plus action-accuracy metrics on simple command sequences.
The conceptual framework defines a CNC via four formal conditions and a machine-form taxonomy contrasting NC against conventional computers (organized around explicit programs), agents (tasks), and world models (environments). The paper does not report loss curves or standard benchmark numbers — it is deliberately presented as a position paper + prototype demonstration.
Failure modes
- Symbolic reasoning (arithmetic, logic) remains weak in video-model NC prototypes; stable two-digit addition in a REPL is not yet achieved, and the authors acknowledge this may be a fundamental mismatch between DiT architectures and symbolic computation.
- None of the four CNC conditions are meaningfully satisfied by the current prototype — the paper is explicit that the prototype is a “transitional container,” not a working NC.
- Routine reuse, controlled updates across sessions, and symbolic stability are listed as open problems in the abstract itself.
- The 110h supervised dataset enables better GUI control than 1400h of random data, but the evaluation is qualitative and the capability does not generalize to novel interface layouts not seen in training.
- The “3 years away” timeline for a real NC (and further for CNC) is the author’s personal estimate with no formal grounding, making the roadmap speculative.
Connections
- video-generation — the primary model class used as a prototype container for NC primitives
- tool-use-agents — the agent paradigm that NC is positioned beyond
- transformer — architecture being adapted; DiT variant used throughout
- diffusion-models — Wan2.1 is a diffusion transformer; action injection architectures build on ControlNet-style ideas
- emergent-behavior — NC frames capability accumulation in runtime as an emergent-behavior target
- metis-hdpo-meta-cognitive-tool-use — related work on meta-cognitive agent capabilities that NC is positioned to subsume
Citation
@misc{zhuge2026neuralcomputers,
title = {Neural Computers},
author = {Mingchen Zhuge and Changsheng Zhao and Haozhe Liu and Zijian Zhou and Shuming Liu and Wenyi Wang and Ernie Chang and Gael Le Lan and Junjie Fei and Wenxuan Zhang and Yasheng Sun and Zhipeng Cai and Zechun Liu and Yunyang Xiong and Yining Yang and Yuandong Tian and Yangyang Shi and Vikas Chandra and J{\"u}rgen Schmidhuber},
year = {2026},
eprint = {2604.06425},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2604.06425}
}