Follow-on from https://manifold.markets/Jasonb/will-a-gpt4-level-efficient-hrm-bas, since I'm interested in the possibility (or impossibility) of architectural innovations more broadly.
Resolution criteria:
The architecture must be meaningfully different from a transformer, either not transformer based at all, or a significant fusion of a transformer with other components. To clarify, something similar to incorporation of Mixture-of-Experts would not count, but diffusion based LLMs would (though they also need to meet the other criteria).
The model must be significantly better than previous LLMs in some important aspect. E.g. for the same amount of training data it achieves much higher performance, or it can achieve similar performance to frontier models with far fewer parameters, or it lacks some failure mode common to current or future transformer-based LLMs.
It must be generally on par with transformer-based LLMs at most tasks. If it just excels in a few areas but it's mostly not very useful, it won't count.