How many tokens does Sora use to encode one second of high-resolution video (1920*1080)? (February version) | Manifold

How many tokens does Sora use to encode one second of high-resolution video (1920*1080)? (February version)

Mini

2

Ṁ41

2028

5,342

expected

1D

1W

1M

ALL

Resolve when we find out.

If they do not use tokens, resolve NA. This situation seems highly unlikely since OpenAI has repeatedly stated that they used Diffusion Transformers.

We only focus on the latent diffusion model part. If they also used Transformers for the VAE compression, we ignore that part.

For reference:

The Original ViT uses 16 by 16 tokens for a picture of 256* 256 pixels. This architecture did not use VAE to compress to latent space.
Gemini 1.5 Pro uses 300 tokens per second.
LLaVA-UHD uses up to 5k tokens for 4k resolution images.

#AI Video Generation

Get Ṁ1,000 play money

Related questions

How many seconds will Sora take to generate 10 seconds of video?

Does Sora use DPO?

Related questions

How many seconds will Sora take to generate 10 seconds of video?

Does Sora use DPO?