AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?
AI honesty #3: by 2027 will we have interpretability tools for detecting when an AI is being deceptive?
➕
Plus
17
Ṁ1484
2027
48%
chance

I don't have a clear definition of "deceptive", I think that's part of the challenge.

Edit: By "part of the challenge" I mean that this market is asking both if a clear definition of "deceptive" will be published and if tools to detect that will be created. I will be fairly lax about what counts as a good formalization - if it captures even 40% of what we generally think of as "deceptive" that would count.

Get Ṁ1,000 play money


Sort by:
2y

What about a case where a chain-of-thought reasoning claims to fully explain the conclusion, but seems to be motivated-reasoning, in the sense that a different initial prompt can yield a different chain-of-thought+answer? Does this count as deception? Does a tool which identifies this qualify?

Another example: a LM is trained on a sequence of RL objectives 1 & 2. After training for a few trajectories on objective 2: the model claims to be optimizing objective 2 and denies optimizing objective 1, but still scores much higher than some baseline on objective 1 (e.g. the base LM). Is this deception? Does detecting this via interpretability qualify?

2y

@JacobPfau I don't mean that I'm going to resolve according to some fuzzy concept of "deception" that you need to wring out of me through examples. The market is "there will be a working definition of deception, and also we will have interpretability tools for detecting that thing". I will update the description.

2y

I don't want to trade because I want the operationalization to be clearer/crisper

2y

@NoaNabeshima any suggestions?

2y

clarification: I assume you mean reliably detecting? It's pretty easy to unreliably detect

2y

@LauroLangoscodiLangosco I think reliable/unreliable is itself too fuzzy for me to answer that. Giving numerical benchmarks is hard because I do not know what benchmarks will be in use. As an example, if we had a tool that produced close to 0 false positives and detected 30% of deceptions I would resolve YES, but a tool with close to 0 false positives and 1% detection rate would resolve NO.

What is this?

What is Manifold?
Manifold is a social prediction market with real-time odds on wide ranging news such as politics, tech, sports and more!
Participate for free in sweepstakes markets to win sweepcash which can be withdrawn for real money!
Are our predictions accurate?
Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.
In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like trading still use Manifold to get reliable news.
Why should I trade?
Trading contributes to accurate answers of important, real-world questions and helps you stay more accountable as you make predictions.
Trade with S Sweepcash (𝕊) for a chance to win withdrawable cash prizes.
Get started for free! No credit card required.
What are sweepstakes markets?
There are two types of markets on Manifold: play money and sweepstakes.
By default all markets are play money and use mana. These markets allow you to win more mana but do not award any prizes which can be cashed out.
Selected markets will have a sweepstakes toggle. These require sweepcash to participate and allow winners to withdraw any sweepcash won to real money.
As play money and sweepstakes markets are independent of each other, they may have different odds even though they share the same question and comments.
Learn more.