By an AI achieving 90% accuracy at hard APPS problems, I mean an AI system interacting with APPS like a human would (with access to any tool which doesn’t require access to the internet or to a human) which is able to submit a correct program on the first submission attempt in 90% of interview-level problems found in the APPS benchmark introduced by Dan Hendrycks, Steven Basart et al.
By primarily using text, I mean the conjunction of the following:
Text bottlenecks: there isn’t a path of more than 100k serial operations during the generation of an answer where information doesn’t go through a categorical format where most categories correspond to words or pieces of words, in a way which makes sense to at least some human speakers when read (but it doesn’t have to be faithful). Some additional bits are allowed as long as they fit in a human-understandable color-coding scheme. Only matrix multiplications, convolutions, and similarly “heavy” operations count as serial operations. There are 2 serial operations per classic attention block), and 2 serial operations per classic MLP block, so a forward pass of GPT-3 is around 400 serial operations, but GPT-3 generating 1000 tokens is around 400k operations, so GPT-3 generating 100 tokens would count as having text bottlenecks, but not a 100-layer image diffusion model doing 1000 steps of diffusion.
Long thoughts: The AI generates its answer using at least 1M serial operations. GPT-3 generating 1000 tokens wouldn’t count as “long thoughts”, but GPT-3 generating 10k tokens (within some scaffold) would count.
I will use speculations about AIs architectures to resolve this question. For example, GPT-4 generating 10k tokens would qualify as primarily using text.
(For reference, if it takes 1ms for a human neuron to process the incoming signal and fire, then the human brain can do 100k serial operations in 1’40’’, and 1M serial operations in 16’40’’.)