Will OpenAI make a deal to pay a major science publisher such as Elsevier for explicit access / training rights on their papers, before the end of 2025?
Closes at end of 2026 to allow for a one-year lag in when we learn about it. Resolves YES if such a deal has already taken place.
Vide this minor difference of guesses between myself (Yudkowsky) and Sabine Hossenfelder.
i hold this at 80-90%
openai will want as much data as possible, with good metadata and annotation, which supports them making deals
openai has already made data deals with other publishers (reddit, etc)
openai will make these deals no matter what to avoid content liability (just potentially at lower prices)
openai will benefit from delegating the task of keeping the pile updated to someone else
even a small number of unique items could justify the deal at the right price
new managers they hire will be more used to purchasing rather than building or collecting
against this is the idea that they really care to save the money because they could get the data elsewhere but i think this is less of a thing in a $100bn organization or they have other organizational focus and this is too hard to do now
I haven't recently looked in on what sort of torrents exist there, or how hard it would be to throw it somewhere in Common Crawl and make it look like an accident. But if they're sufficiently scared of legal consequences to not do that, I'd guess they won't pay either.
If you specifically want a somewhat comprehensive collection of new papers published since after sci-hub stopped uploading new papers, I'm not sure if there's anything. (If there is, someone please tell me!!!)
@jacksonpolack depends a little on the field but many authors upload their papers to preprint servers such as arxiv and hal
Was initially thinking that scraping papers is bad optics, but actually it might be worse to do public deals with scientific journals.
(Sabine Hossenfelder's tweet sums up how media would probably treat deals like this, i.e. "look at these journals selling researchers hard work and not giving them anything")
@GeorgeIngebretsen What? OAI does not seem to care too much about optics.
As far as I can tell, they correctly rate such small factors like “makes deals with sci journals” as v unimportant compared to making their product that much better.
It may seem important to you, but to the vast majority of people, it won’t make them more or less outraged at “Big Tech”.
Great point. I guess even the new york times lawsuit didn't seem to matter much? Not to mention how much more coverage there will be as models become more capable.
I agree training data optics are just a drop in the bucket (and if oai actually builds what its set out to build, eventually a completely insignificant detail). My og comment was more about trying to figure out the sign value of the optics of paying for journal data. Though, after some thought, I realize optics are a very negligible factor in if oai would make that deal.