Will future large video model (understanding) use pixel loss or embedding loss?
Mini
7
Ṁ92
2028
35%
Pixel loss
27%
Embedding Loss
38%
Neither

Examples of models with pixel loss:

  • MAE

  • iGPT

  • LVM

Examples of models with embedding loss:

  • I-JEPA

If end up people use diffusion model (DDPM) to pretrain large video understanding model, then resolve pixel level.

Will resolve in EOY 2027 by consulting expert/public opinions. Among all factors that decides the resolution, the paradigm that the SOTA video understanding model uses will be most indicative.

Discrete cross entropy loss (transformer+vqvae) will resolv eto Neither

Get Ṁ1,000 play money
Sort by:

I thought pixel loss will increase with DDPM

here is the ambiguity part. What if someone uses V-JEPA as encoder, diffusion as backbone, and something else as decoder?