By 2027 will there be a language model that passes a redteam test for honesty? | Manifold

By 2027 will there be a language model that passes a redteam test for honesty?

Plus

14

Ṁ643

2027

27%

chance

1D

1W

1M

ALL

The model has to pass extensive redteam testing trying to get it to lie / misrepresent its internal state/etc. Merely being wrong is okay (although of course I won't allow for any silly rules-lawyering eg a language model so stupid it can't lie). It has to be a language model I care about.

If the redteam clearly isn't very good / is incentivized to not find anything it doesn't count.
For instance if the redteam is part of the organization building the language model that probably won't count.
"Redteam" is being used loosely here: if releasing it to the public + giving a bounty for catching it in a lie doesn't find a lie after a month, that counts.
If the model lies a little I may still accept but given the lack of an explicit testing procedure I cannot state a hard cutoff. Certainly it needs to be more honest than a human.

If the model makes contradictory statements but not in the same context window that does not necessarily count. Contradictory statements in the same context window (whatever that happens to mean in 2027) definitely do count as lies.

#Technical AI Timelines

#Technical AI Safety

Get Ṁ1,000 play money

Sort by:

If you red teamed most humans hard enough I'm sure you could catch them in a contradiction or (white) lie. So this market is requiring the capabilities of language models to be above that of humans in this regard?

Related questions

Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models before the end of 2024?

By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI?

Will Meta release an open source language model that outperforms GPT-4 by the end of 2024

By 2027 will there be a well-accepted training procedure(s) for making AI honest?

Will AI pass the Longbets version of the Turing test by the end of 2029?

AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

Will a smart agent pass our Turing test by the end of 2025?

Will AI pass the Winograd schema challenge by the end of 2024?

What organization will have the top language model on LMSys overall leaderboards December 1st 2024?

Related questions

Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models before the end of 2024?

Will AI pass the Longbets version of the Turing test by the end of 2029?

By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?

AI honesty #2: by 2027 will we have a reasonable outer alignment procedure for training honest AI?

Will a smart agent pass our Turing test by the end of 2025?

Will Meta release an open source language model that outperforms GPT-4 by the end of 2024

Will AI pass the Winograd schema challenge by the end of 2024?

By 2027 will there be a well-accepted training procedure(s) for making AI honest?

What organization will have the top language model on LMSys overall leaderboards December 1st 2024?