[Paper Notes] From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence

10 minute read

Published: February 23, 2026

This post supports English / 中文 switching via the site language toggle.

TL;DR

This paper argues that classical information notions (Shannon entropy, Kolmogorov complexity) miss a key ingredient for modern ML:

the observer is computationally bounded

To address this, the authors introduce epiplexity:

a measure of structural information that a computationally bounded learner can extract from data

paired with:

time-bounded entropy: the residual random/unpredictable part for that bounded learner

The paper’s core message is highly relevant to data-centric ML:

data quality is not only about total bits or likelihood
it is also about how much reusable structure a bounded model can absorb
this may better explain OOD transfer and pretraining data value

Paper Info

Title: From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence
Authors: Marc Finzi, Shikai Qiu, Yiding Jiang, Pavel Izmailov, J. Zico Kolter, Andrew Gordon Wilson
Source: arXiv (2026, based on 2601.03220)

1. What problem is this paper solving?

The paper asks a fundamental question:

How should we define and measure the information value of data for learning systems with limited compute?

Why this matters:

In modern pretraining, data choice/curation often matters as much as architecture.
Classical information theory is excellent for communication and coding, but often does not capture:
- structure vs randomness (from the model’s perspective)
- compute constraints of the learner
- why some data formats/orderings improve transfer without improving in-distribution loss much

The authors position epiplexity as a step toward a theory of data selection, not just model selection.

2. The three “paradoxes” motivating the paper

The paper frames the gap between theory and practice using three apparent paradoxes:

Paradox 1: “Information cannot be increased by deterministic transformations”

Classical view:

deterministic transforms should not create new information

But practice suggests otherwise:

synthetic data can improve models
self-play systems (e.g., AlphaZero-style examples) extract useful knowledge from simple rules
dynamical systems generate emergent structures that models can learn

Paradox 2: “Information is independent of factorization/order”

Classical view:

total information content should not depend on ordering/factorization

But in practice:

left-to-right vs reverse text/chess orderings can change what models learn
one direction can be easy to predict while the reverse is much harder

Paradox 3: “Likelihood modeling is just distribution matching”

Classical intuition:

maximizing likelihood only matches the data-generating distribution
models should not learn “more structure” than what is in the generating process

But in practice:

predictive models often need to learn extra internal structure (induction, emergent shortcuts) to make tractable predictions under limited compute

The paper’s thesis is that these paradoxes weaken once we explicitly model a computationally bounded observer.

3. Core concept: Epiplexity + Time-Bounded Entropy

The paper defines an MDL-style optimization under a runtime constraint T:

find the best time-bounded probabilistic program P* minimizing:
- model description length + expected coding loss

Then decompose information into:

Epiplexity S_T(X) = size of the optimal time-bounded program (|P*|)
Time-bounded entropy H_T(X) = residual expected unpredictability under that program

Interpretation:

S_T(X) = structural information visible to the bounded learner
H_T(X) = random/unpredictable information (to that learner)

This decomposition is the main conceptual contribution.

Intuition

Two datasets can have similar training loss / entropy-like behavior but differ in:

how much structure the model had to internalize to get that loss

The paper argues that this “missing component” is often what matters for OOD transfer.

4. Why this is different from Shannon entropy / Kolmogorov complexity

The paper is not merely proposing another complexity metric. It changes the viewpoint:

information becomes observer-dependent
the observer is constrained by computation

Consequences:

CSPRNG outputs can look random to polynomial-time observers (high time-bounded entropy, low epiplexity)
deterministic dynamics can produce data with high structural content for bounded learners
ordering/factorization can matter because forward and inverse computations can have different computational difficulty

This is why the framework naturally connects to cryptography, emergence, and ML training dynamics.

5. How they estimate epiplexity in practice (important)

The theory is abstract, but the paper gives practical estimators using neural network training.

5.1 Prequential coding (fast heuristic)

Key idea:

estimate model information from the area under the loss curve above the final loss

If loss drops slowly and substantially:

the model may be absorbing more structural information from data

Pros:

simple
can reuse standard training runs

Cons:

not fully rigorous as an explicit time-bounded code for the learned model

5.2 Requential coding (more rigorous, slower)

Key idea:

explicitly code the learned model via a teacher-student training process
use cumulative KL terms (teacher vs student) to estimate model code length

The paper presents this as a more principled way to estimate epiplexity, but with extra computational overhead (they note it is typically slower than prequential coding).

Practical recommendation (paper’s stance)

use prequential for quick ranking / rough comparisons
use requential when more accurate estimates are needed

This is a strong part of the paper: it does not stop at theory, it gives a measurement pipeline.

6. Key experiments / evidence (what the paper shows)

6.1 Deterministic computation can create useful information (for bounded observers)

Using cellular automata (ECA rules), the paper shows different deterministic rules produce very different decompositions:

mostly simple/predictable
mostly random (high time-bounded entropy, low epiplexity)
mixed random + structured (higher epiplexity)

This supports the idea that deterministic processes can create learnable structure from the perspective of a bounded learner.

6.2 Factorization/order matters

The paper shows order-dependent effects in:

one-way-function-like setups
chess data formatting/ordering

A particularly interesting result:

a reverse chess ordering can yield higher epiplexity and better OOD transfer on some downstream chess tasks, even if the in-distribution setup is not obviously better by standard metrics

This is a very practical takeaway for data formatting and curriculum design.

6.3 Likelihood modeling can require learning more than the generating process

Through induction and emergence examples (e.g., masked latent information, ECA dynamics), the paper shows that a bounded predictive model may need to learn:

inversion strategies
latent inference procedures
emergent pattern abstractions

that are not explicitly present in the short data-generating program.

This is one of the most important conceptual contributions in my view.

6.4 Epiplexity and natural data / pretraining

The paper estimates epiplexity across natural modalities and data formats and argues:

language tends to exhibit higher epiplexity than images under their estimation setup
tokenization/representation choices (e.g., VQ tokenization) can change epiplexity
epiplexity may explain part of why some pretraining data sources transfer better than others

They also connect this to Adaptive Data Optimization (ADO) and show evidence that data selection strategies that improve downstream/OOD performance can coincide with higher estimated epiplexity.

7. Why this matters for data-centric ML (my take)

This paper gives a useful lens for questions like:

Why does data ordering matter?
Why can synthetic data help even when “no new Shannon information” is added?
Why do some corpora transfer better than others despite similar perplexity?
Why can training loss fail to predict downstream transfer?

The key distinction is:

loss / entropy-like terms capture residual unpredictability
epiplexity captures how much reusable structure the model had to build

That decomposition is not a full theory of transfer, but it is much closer to the phenomena we actually observe in pretraining.

8. Strengths

Ambitious and original conceptual framing
Connects ML practice with cryptography / complexity / MDL in a non-trivial way
Provides both theory and practical estimation procedures
Explains several empirically familiar phenomena (ordering, emergence, synthetic data usefulness) in a unified language
Opens a path toward data selection theory

9. Limitations / Cautions

Estimation is still approximation-heavy (especially prequential proxy)
Epiplexity is not a direct guarantee of OOD performance on a specific task (the paper explicitly emphasizes this)
The framework depends on the choice of observer/model class/time budget
Practical estimation can be expensive (requential coding)
Some claims are currently more explanatory/interpretive than directly actionable at large scale

10. Practical Takeaways for Research

If you work on pretraining, data curation, or synthetic data:

Evaluate datasets beyond held-out perplexity / in-domain loss
Test data ordering / factorization / formatting as a first-class variable
Consider whether an intervention increases learnable structure vs just reducing noise
Use epiplexity-like proxies (even crude ones) to rank dataset variants or curricula
Treat “more data” and “more useful structure” as different objectives

11. One-sentence summary

This paper introduces epiplexity as a compute-aware measure of structural information in data, offering a new lens for understanding synthetic data, emergence, pretraining data value, and OOD generalization beyond classical entropy and likelihood.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Lixin Xu

TL;DR

Paper Info

1. What problem is this paper solving?

2. The three “paradoxes” motivating the paper

Paradox 1: “Information cannot be increased by deterministic transformations”

Paradox 2: “Information is independent of factorization/order”

Paradox 3: “Likelihood modeling is just distribution matching”

3. Core concept: Epiplexity + Time-Bounded Entropy

Intuition

4. Why this is different from Shannon entropy / Kolmogorov complexity

5. How they estimate epiplexity in practice (important)

5.1 Prequential coding (fast heuristic)

5.2 Requential coding (more rigorous, slower)

Practical recommendation (paper’s stance)

6. Key experiments / evidence (what the paper shows)

6.1 Deterministic computation can create useful information (for bounded observers)

6.2 Factorization/order matters

6.3 Likelihood modeling can require learning more than the generating process

6.4 Epiplexity and natural data / pretraining

7. Why this matters for data-centric ML (my take)

8. Strengths

9. Limitations / Cautions

10. Practical Takeaways for Research

11. One-sentence summary

TL;DR

论文信息

1. 这篇论文在解决什么问题？

2. 三个“信息悖论”（论文动机）

悖论 1：确定性变换不能增加信息

悖论 2：信息与因子分解/顺序无关

悖论 3：似然建模只是分布匹配

3. 核心概念：Epiplexity + Time-Bounded Entropy

直觉理解

4. 它和 Shannon 熵 / Kolmogorov 复杂度有什么本质不同？

5. 实际上怎么估计 Epiplexity（很关键）

5.1 Prequential coding（快速近似）

5.2 Requential coding（更严格、更慢）

论文给出的实践建议

6. 关键实验与证据（论文展示了什么）

6.1 确定性计算也能为受限学习者“创造”有用信息

6.2 因子分解/顺序会影响信息（对学习者而言）

6.3 似然建模并不只是“复刻生成过程”

6.4 Epiplexity 与自然数据 / 预训练数据价值

7. 为什么这对 Data-Centric ML 很重要（我的理解）

8. 优点

9. 局限性 / 使用时的注意点

10. 对研究实践的启发

11. 一句话总结

Share on

You May Also Enjoy

[Paper Notes] Cross-Hand Latent Representation for Vision-Language-Action Models

[Paper Notes] Visual Dexterity: In-Hand Reorientation of Novel and Complex Object Shapes

[Paper Notes] TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

The Singularity is Near