[Paper Notes] From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence
Published:
This post supports English / 中文 switching via the site language toggle.
TL;DR
This paper argues that classical information notions (Shannon entropy, Kolmogorov complexity) miss a key ingredient for modern ML:
- the observer is computationally bounded
To address this, the authors introduce epiplexity:
- a measure of structural information that a computationally bounded learner can extract from data
paired with:
- time-bounded entropy: the residual random/unpredictable part for that bounded learner
The paper’s core message is highly relevant to data-centric ML:
- data quality is not only about total bits or likelihood
- it is also about how much reusable structure a bounded model can absorb
- this may better explain OOD transfer and pretraining data value
Paper Info
- Title: From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence
- Authors: Marc Finzi, Shikai Qiu, Yiding Jiang, Pavel Izmailov, J. Zico Kolter, Andrew Gordon Wilson
- Source: arXiv (2026, based on
2601.03220)
1. What problem is this paper solving?
The paper asks a fundamental question:
How should we define and measure the information value of data for learning systems with limited compute?
Why this matters:
- In modern pretraining, data choice/curation often matters as much as architecture.
- Classical information theory is excellent for communication and coding, but often does not capture:
- structure vs randomness (from the model’s perspective)
- compute constraints of the learner
- why some data formats/orderings improve transfer without improving in-distribution loss much
The authors position epiplexity as a step toward a theory of data selection, not just model selection.
2. The three “paradoxes” motivating the paper
The paper frames the gap between theory and practice using three apparent paradoxes:
Paradox 1: “Information cannot be increased by deterministic transformations”
Classical view:
- deterministic transforms should not create new information
But practice suggests otherwise:
- synthetic data can improve models
- self-play systems (e.g., AlphaZero-style examples) extract useful knowledge from simple rules
- dynamical systems generate emergent structures that models can learn
Paradox 2: “Information is independent of factorization/order”
Classical view:
- total information content should not depend on ordering/factorization
But in practice:
- left-to-right vs reverse text/chess orderings can change what models learn
- one direction can be easy to predict while the reverse is much harder
Paradox 3: “Likelihood modeling is just distribution matching”
Classical intuition:
- maximizing likelihood only matches the data-generating distribution
- models should not learn “more structure” than what is in the generating process
But in practice:
- predictive models often need to learn extra internal structure (induction, emergent shortcuts) to make tractable predictions under limited compute
The paper’s thesis is that these paradoxes weaken once we explicitly model a computationally bounded observer.
3. Core concept: Epiplexity + Time-Bounded Entropy
The paper defines an MDL-style optimization under a runtime constraint T:
- find the best time-bounded probabilistic program
P*minimizing:- model description length + expected coding loss
Then decompose information into:
- Epiplexity
S_T(X)= size of the optimal time-bounded program (|P*|) - Time-bounded entropy
H_T(X)= residual expected unpredictability under that program
Interpretation:
S_T(X)= structural information visible to the bounded learnerH_T(X)= random/unpredictable information (to that learner)
This decomposition is the main conceptual contribution.
Intuition
Two datasets can have similar training loss / entropy-like behavior but differ in:
- how much structure the model had to internalize to get that loss
The paper argues that this “missing component” is often what matters for OOD transfer.
4. Why this is different from Shannon entropy / Kolmogorov complexity
The paper is not merely proposing another complexity metric. It changes the viewpoint:
- information becomes observer-dependent
- the observer is constrained by computation
Consequences:
- CSPRNG outputs can look random to polynomial-time observers (high time-bounded entropy, low epiplexity)
- deterministic dynamics can produce data with high structural content for bounded learners
- ordering/factorization can matter because forward and inverse computations can have different computational difficulty
This is why the framework naturally connects to cryptography, emergence, and ML training dynamics.
5. How they estimate epiplexity in practice (important)
The theory is abstract, but the paper gives practical estimators using neural network training.
5.1 Prequential coding (fast heuristic)
Key idea:
- estimate model information from the area under the loss curve above the final loss
If loss drops slowly and substantially:
- the model may be absorbing more structural information from data
Pros:
- simple
- can reuse standard training runs
Cons:
- not fully rigorous as an explicit time-bounded code for the learned model
5.2 Requential coding (more rigorous, slower)
Key idea:
- explicitly code the learned model via a teacher-student training process
- use cumulative KL terms (teacher vs student) to estimate model code length
The paper presents this as a more principled way to estimate epiplexity, but with extra computational overhead (they note it is typically slower than prequential coding).
Practical recommendation (paper’s stance)
- use prequential for quick ranking / rough comparisons
- use requential when more accurate estimates are needed
This is a strong part of the paper: it does not stop at theory, it gives a measurement pipeline.
6. Key experiments / evidence (what the paper shows)
6.1 Deterministic computation can create useful information (for bounded observers)
Using cellular automata (ECA rules), the paper shows different deterministic rules produce very different decompositions:
- mostly simple/predictable
- mostly random (high time-bounded entropy, low epiplexity)
- mixed random + structured (higher epiplexity)
This supports the idea that deterministic processes can create learnable structure from the perspective of a bounded learner.
6.2 Factorization/order matters
The paper shows order-dependent effects in:
- one-way-function-like setups
- chess data formatting/ordering
A particularly interesting result:
- a reverse chess ordering can yield higher epiplexity and better OOD transfer on some downstream chess tasks, even if the in-distribution setup is not obviously better by standard metrics
This is a very practical takeaway for data formatting and curriculum design.
6.3 Likelihood modeling can require learning more than the generating process
Through induction and emergence examples (e.g., masked latent information, ECA dynamics), the paper shows that a bounded predictive model may need to learn:
- inversion strategies
- latent inference procedures
- emergent pattern abstractions
that are not explicitly present in the short data-generating program.
This is one of the most important conceptual contributions in my view.
6.4 Epiplexity and natural data / pretraining
The paper estimates epiplexity across natural modalities and data formats and argues:
- language tends to exhibit higher epiplexity than images under their estimation setup
- tokenization/representation choices (e.g., VQ tokenization) can change epiplexity
- epiplexity may explain part of why some pretraining data sources transfer better than others
They also connect this to Adaptive Data Optimization (ADO) and show evidence that data selection strategies that improve downstream/OOD performance can coincide with higher estimated epiplexity.
7. Why this matters for data-centric ML (my take)
This paper gives a useful lens for questions like:
- Why does data ordering matter?
- Why can synthetic data help even when “no new Shannon information” is added?
- Why do some corpora transfer better than others despite similar perplexity?
- Why can training loss fail to predict downstream transfer?
The key distinction is:
- loss / entropy-like terms capture residual unpredictability
- epiplexity captures how much reusable structure the model had to build
That decomposition is not a full theory of transfer, but it is much closer to the phenomena we actually observe in pretraining.
8. Strengths
- Ambitious and original conceptual framing
- Connects ML practice with cryptography / complexity / MDL in a non-trivial way
- Provides both theory and practical estimation procedures
- Explains several empirically familiar phenomena (ordering, emergence, synthetic data usefulness) in a unified language
- Opens a path toward data selection theory
9. Limitations / Cautions
- Estimation is still approximation-heavy (especially prequential proxy)
- Epiplexity is not a direct guarantee of OOD performance on a specific task (the paper explicitly emphasizes this)
- The framework depends on the choice of observer/model class/time budget
- Practical estimation can be expensive (requential coding)
- Some claims are currently more explanatory/interpretive than directly actionable at large scale
10. Practical Takeaways for Research
If you work on pretraining, data curation, or synthetic data:
- Evaluate datasets beyond held-out perplexity / in-domain loss
- Test data ordering / factorization / formatting as a first-class variable
- Consider whether an intervention increases learnable structure vs just reducing noise
- Use epiplexity-like proxies (even crude ones) to rank dataset variants or curricula
- Treat “more data” and “more useful structure” as different objectives
11. One-sentence summary
This paper introduces epiplexity as a compute-aware measure of structural information in data, offering a new lens for understanding synthetic data, emergence, pretraining data value, and OOD generalization beyond classical entropy and likelihood.
