[Project Notes] InSpatio-World and InSpatio-WorldFM
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
InSpatio-WorldFM as presented by the paper and project page pitches a clean idea: a real-time frame model for spatial intelligence, where each frame is generated with explicit 3D anchors and implicit spatial memory instead of full video-window decoding.
The currently public inspatio/inspatio-world repository is related, but it does not look like a literal release of that paper architecture. What it actually ships is a practical three-stage novel-view video pipeline:
- Florence-2 captions the input video
- DA3 estimates geometry and poses
- an offline renderer produces point-cloud render + mask videos
- a Wan-based causal latent generator synthesizes output video in short blocks with KV-cache reuse
So my read is: the repo is best understood as a paper-adjacent, engineering-oriented public release that operationalizes some of the same ideas, especially explicit 3D conditioning, but with a noticeably different implementation shape.
What I looked at
I combined three sources:
- the paper: InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model (arXiv:2603.11911)
- the official project page: inspatio.github.io/worldfm
- the repository you pointed me to: github.com/inspatio/inspatio-world
One important naming detail up front:
- the paper and project page point to
inspatio/worldfm - the repo you shared is
inspatio/inspatio-world
That mismatch turns out to matter, because the public repo looks more like a runnable product pipeline than a faithful paper release.
What the paper is actually proposing
The paper’s core claim is that video-based world models are not the only route to interactive world simulation. Instead of generating a whole temporal window with strong inter-frame dependence, InSpatio-WorldFM proposes a frame-based paradigm:
- generate each frame with low latency
- enforce geometry using explicit 3D anchors
- preserve scene appearance using implicit spatial memory
The paper frames the system in two stages:
- an offline stage produces multi-view-consistent observations and 3D anchors
- an online stage uses a frame model for fast inference
Its main method ingredients are:
- PixArt-Sigma as the initial image-generation backbone
- PRoPE camera pose encoding inside attention
- a hybrid memory design:
- explicit anchor: point-cloud rendering at the target view
- implicit memory: reference image tokens
- a three-stage training recipe:
- Stage I: pre-train image generator
- Stage II: middle-train a spatially controllable frame model
- Stage III: distill it into a few-step real-time generator
The reported headline is that the distilled system can run interactively with roughly:
- 10 FPS at 512x512 on A100
- 7 FPS on RTX 4090 in single-step mode
The important conceptual point is that the paper is arguing for frame-first world modeling, not just faster video diffusion.
What the public repo actually ships
The repository is centered around run_test_pipeline.sh, and its structure is much more concrete than the paper’s abstract system diagram.
At a high level, the public pipeline is:
flowchart TD
A["Input folder of .mp4 videos"] --> B["Step 1: Florence-2 captions via scripts/gen_json.py"]
B --> C["new.json with video path, text, geometry paths"]
C --> D["Step 2a: DA3 depth + pose extraction"]
D --> E["Step 2b: convert_da3_to_pi3.py"]
E --> F["Step 2c: render_point_cloud.py"]
F --> G["render_offline.mp4 + mask_offline.mp4"]
G --> H["Step 3: inference_causal_test.py"]
H --> I["Wan text encoder + video VAE + causal generator"]
I --> J["Block-wise denoising with KV cache"]
J --> K["Novel-view output video"]
This is not a single-image frame model interface. It is a video-to-video novel-view generation pipeline that expects input videos and produces output videos.
What each stage does
Step 1 uses scripts/gen_json.py to:
- load the middle frame of each input video
- caption it with Florence-2
- write a
new.jsonmanifest with:video_pathtext- geometry output paths
Step 2 uses Depth-Anything-3 to build geometry:
depth/depth_predict_da3_cli.pyruns DA3scripts/convert_da3_to_pi3.pyconverts DA3 outputs into the format expected downstreamscripts/render_point_cloud.pyrenders an offline point-cloud video and mask video from a user-provided camera trajectory
Step 3 runs the generator:
inference_causal_test.pyloads text, source video, render video, and mask videopipeline/causal_inference.pyperforms block-wise denoising- the default config uses
num_frame_per_block: 3 - the model keeps KV cache and reuses the previous prediction as context
This implementation detail is the biggest clue that the released code is not the same thing as the paper’s clean “generate each frame independently” story.
Where the repo matches the paper
Even though the implementation shape differs, the repo still reflects several of the paper’s key ideas.
1. Explicit 3D anchors are real, not just rhetoric
The paper emphasizes point-cloud rendering as an explicit spatial anchor. The repo concretely instantiates that:
- DA3 estimates depth and poses
render_point_cloud.pyprojects point clouds along a target trajectory- the resulting render and mask videos are encoded into latent space and fed to the generator
That part is very aligned with the paper’s spirit.
2. Conditioning is hybrid
The paper’s “explicit anchor + implicit memory” idea also survives in the public code:
- explicit anchor: render video + mask video
- implicit memory: source/reference video latents
In causal_inference.py, the generator concatenates and reuses:
- reference latent content
- rendered latent anchor content
- prior predictions for temporal context
So the public system is still clearly built around geometry-conditioned generation, not pure text-to-video sampling.
3. Real-time engineering is a first-class concern
The repo is visibly optimized around inference practicality:
- support for 1.3B and 14B configs
- distributed inference support in
inference_causal_test.py - KV-cache reuse
- low-memory handling via dynamic model swapping
- shell-level orchestration for multi-GPU Step 1 and Step 2
This is consistent with the paper’s focus on responsiveness and deployability.
Where the repo diverges from the paper
This is the most interesting part.
1. The public model does not look like a pure frame model
The paper’s core pitch is frame-based generation. But the repo’s inference path is explicitly temporal and block-causal:
num_frame_per_blockdefaults to 3- inference loops over frame blocks
- the previous prediction
last_predis injected as context - the generator keeps a transformer KV cache
That is closer to a causal video latent generator with short temporal chunks than to a strictly frame-independent online renderer.
2. The backbone is different
The paper says Stage I uses PixArt-Sigma as the foundation image model.
The public repo instead builds around Wan components:
WanTextEncoderWanVAEWrapperWanDiffusionWrapper- checkpoints based on Wan2.1-T2V-1.3B and Wan2.1-I2V-14B-480P
That is a substantial architectural shift. It suggests that the public release is either:
- a later implementation path, or
- a productized variant that prioritizes practical inference over paper-pure faithfulness
3. I do not see the paper’s PRoPE-style camera conditioning in the released inference path
The paper spends real effort describing PRoPE and comparing it with alternative camera-pose encodings.
In the public repo:
- trajectories are used to render offline point clouds
- target extrinsics are generated and saved
- but the inference path I inspected does not clearly inject camera pose tokens into the transformer in the paper’s described way
That does not mean the release has no geometry control. It obviously does. But it seems to achieve that mostly through rendered 3D conditioning, not through a visible, paper-style camera-token interface.
4. The input contract is different
The paper is framed as single reference image + target pose -> target view.
The public repo’s interface is:
- folder of videos
- caption extraction from the middle frame
- depth and pose estimation over video frames
- output novel-view video
So the public artifact feels less like a minimal research demo for the paper formulation and more like a full demo pipeline for viewpoint-controlled video generation.
My best synthesis
I think the cleanest interpretation is this:
- The paper presents a research thesis: frame-based world modeling can beat the latency limits of video-window world models.
- The project page markets that thesis as a real-time, explorable generative world system.
- The public
inspatio-worldrepo is a runnable release that borrows the same high-level philosophy, but implements it as a more pragmatic pipeline using:- Florence-2 for captioning
- DA3 for geometry
- offline point-cloud rendering
- Wan-based causal latent generation
In other words, the public repo looks less like “here is the exact paper system” and more like “here is the current open-source pipeline we can actually ship.”
What I find most interesting
The most interesting design choice is that the public code treats geometry as an external artifact pipeline, not as something the generator must infer from scratch online.
That choice has real engineering advantages:
- easier control over camera motion
- clearer debugging boundaries
- ability to swap out geometry modules independently
- less pressure on the generator to solve all of 3D reasoning internally
It also explains why the repo feels more robustly engineered than the paper abstraction. The system is split into separate captioning, geometry, rendering, and generation stages, each with a specific responsibility.
Limits and caveats
There are a few practical caveats worth stating directly.
- The repo looks very early-stage as a public artifact: the cloned snapshot has only one visible commit.
- I did not find a
LICENSEfile in the repo tree, even though the README links to a license in a different repository. - The README itself says the current release is not yet speed-optimized, and that a more optimized version aligned with the live demo / technical report would be released later.
So I would not read this repository as the final, fully documented reference implementation of the paper. I would read it as an initial public drop.
Takeaways
- The paper’s main idea is compelling. A frame-first world model with explicit 3D anchors is a strong answer to the latency limits of video world models.
- The public repo is useful, but it is not paper-pure. It is a more hybrid, pipeline-heavy system than the paper’s high-level framing suggests.
- The strongest continuity between paper and code is geometry anchoring. Point-cloud rendering is the bridge between the research story and the release.
- The biggest divergence is the generator itself. The shipped code looks like a Wan-based causal video-latent model with caching, not a strictly independent frame model.
- This makes the repo more interesting, not less. It shows what often happens in generative systems work: the public, runnable artifact is an engineering compromise between research elegance and practical deployment.
Links
- Paper: arXiv:2603.11911
- Project page: inspatio.github.io/worldfm
- Repo analyzed here: github.com/inspatio/inspatio-world
