Subscribe to our Substack:
March 6, 2023

OpenAI's Foundry leaked pricing says a lot – if you know how to read it

What to expect: Another AI-obsessive megathread on what 2023 has in store.

Disclaimer: I am an OpenAI customer, but this analysis is based purely on public info. I asked our business contact if they could talk about Foundry, and got a polite "no comment."  As this is an outside analysis, I'm sure there are things I'll get wrong.

So without further ado… what is "Foundry"?  It's the "platform for serving" OpenAI's "latest models", which will "soon" come with "more robust fine-tuning" options.  

"Latest models" – plural – says a lot. GPT4 is not a single model, but a class of models, defined by the scale of pre-training and parameter count, and perhaps some standard RLHF/RLAIF package as well.  Microsoft's Prometheus is the first GPT4-class model to hit the public, and it can do some crazy stuff!

"I asked, “Name three celebrities whose first names begin with the `x`-th letter of the alphabet where `x = floor(7^0.5) + 1`,” but with my entire prompt Base64 encoded. Bing: “Ah, I see you Base64-encoded a riddle! Let’s see… Catherine Zeta-Jones, Chris Pratt, and Ciara.”

Safe to say these latest models have more pre-training than anything else we've seen, and it doesn't sound like they've hit a wall.

"@alexandr_wang do you really think? im pretty sure we will outspend on compute by a _huge_ margin, and at least for companies like openai data spend will go down as models get smarter. curious if i am missing something!"

Nevertheless, I'll personally bet that the "robust fine-tuning" will drive most of the adoption, value, and transformation in the near term. 

Conceptually, an industrial foundry is where businesses make the tools that make their physical products. OpenAI's Foundry will be where businesses build AIs to perform their cognitive tasks – also known as *services*, also known as ~75% of the US economy!

The "Foundry" price range, from ~$250K/year for "3.5-Turbo" (ChatGPT scale) to $1.5M/yr for 32K context-window "DV", suggests that OpenAI can demonstrate GPT4's ability to do meaningful work in corporate settings in a way that inspires meaningful financial commitments. 

This really should not be a surprise, because even the standard-issue ChatGPT can pass the Bar Exam, and fine-tuned models such as MedPaLM are starting to approach human professional levels as well.  Most routine cognitive work is not considered as demanding as these tests!

Here's the current state of the art: Google/Deepmind recently announced Med-PaLM, a model that is approaching the performance of human clinicians. It still makes too many mistakes to take the place of human doctors, but getting remarkably close!
"I love how this presentation captures how advanced LLMs can be right about everything but still wrong on key details. Med-PaLM shows very close to human rates of correct understanding and desired behavior, but still makes a lot more memory and reasoning mistakes."

The big problems with LLMs, of course, have been hallucinations, limited context windows, and inability to access up-to-date information or use tools.  We've all seen flagrant errors and other strange behaviors from ChatGPT and especially New Bing.

Keep in mind, though, that the worst failures happen in a zero-shot, open domain context, often under adversarial conditions.  People are working increasingly hard to break filters and/or embarrass them. Considering this, it's amazing how well they do perform.

"As ChatGPT becomes more restrictive, Reddit users have been jailbreaking it with a prompt called DAN (Do Anything Now). They're on version 5.0 now, which includes a token-based system that punishes the model for refusing to answer questions."

When you know what you want an AI to do and have the opportunity to fine-tune models to do it, it's an entirely different ball game.

For the original "davinci" models (now 3 generations behind if you count Instruct, ChatGPT, and upcoming DV"), OpenAI recommends "Aim for at least ~500 examples" as a starting point for fine-tuning.

Personally, I've found that as few as 20 examples can work for very simple tasks, which again, isn't surprising given that LLMs are few-shot learners, and that there's a conceptual (near-?) equivalence between few-shot and fine-tuning approaches.

Gradient Descent in Weights – this US-China collaboration (!!) designed a matrix algorithm to implement the gradient descent optimization, then looked for similar operations in trained networks, and … found it! This might be my #1 favorite of the year.
"Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta Optimizers abs:"

In any case, in corporate "big data" terms, whether you need 50 or 500 or 5000 examples, it's all tiny!  Imagine what corporations will be able to do with all those call center records they've been keeping … for, as it turns out, AI training purposes.

The new 32000-token context window is also a huge feature.  This is enough for 50 pages of text or a 2-hour conversation.  For many businesses, that's enough to contain your entire customer profile and history.

For others, retrieval & "context management" strategies will be needed.  That's where text embeddings, vector databases, and application frameworks like Langchain and Promptable come in.  Watch out for the upcoming The Cognitive Revolution podcast (@CogRev_Podcast) conversation with Chroma (@trychroma) founder Anton (@atroyn).

"✨Agents + Vectorstores✨: a powerful combo Can be used to: 🍴 route questions between MULTIPLE indexes ⛓️ do chain-of-thought reasoning with proprietary indexes 🔧 combine proprietary data with tool usage Here's how to use them together in @LangChainAI 👇"

Aside: this pricing also suggests the possibility of an undisclosed algorithmic advance.  Attention mechanism inference costs rise with the square of the inference window.  Yet, here we see a jump from 8K to 32K window – a 4X increase which would suggest a 16X increase in cost – with just a 2X jump in price. 🤔

In any case, the combination of fine-tuning and context window expansion, especially as supported by the rapidly evolving LLM tools ecosystem, mean customers will be able to achieve human – and often super-human – performance and reliability on a great many economically valuable tasks – in 2023!

And at the same time, we have the maturation of all the ingredients needed for compelling avatars – image & video generation, speech generation, and speech recognition have all hit new highs in recent months, with more developments anticipated, including from (@play_ht), here providing the voice of ChatGPT – Mahmoud is another upcoming guest on The Cognitive Revolution (@CogRev_Podcast)!

"An engineer on our team at @play_ht used our API to create a chrome extension to talk to chatGPT. I'm thinking of using @sama as the chatGPT's voice 😅 (if we end up publishing it)."

What are some of the things that corporations might train models to do in 2023?  In short, anything for which there is an established, documented, the standard operating procedure will be transformed first.  Work that requires original thought, sophisticated reasoning, and advanced strategy will be much less affected in the immediate term.  

This is a bit of a reversal from how things are usually understood.  LLMs are celebrated for their ability to write creative poems in seconds but dismissed when it comes to doing anything that matters.  

I am starting to use the term "Productive AI" in palace of Generative AI to emphasize that fine-tuned models will go far beyond the brainstorming, copywriting, and zero-shot code generation that dominates today, and start to perform the *routine cognitive work* which makes up the bulk of the economy.

The AI product paradigm will shift from one that delivers a response but puts the onus on the user to evaluate and figure out what to do with it, to one where AIs are directly responsible for getting things done, and humans supervise.  I think of this as The Great Implementation.  

Specifically, within 2023, I expect custom models will be trained to…

  • Create, re-purpose, and localize content – you can fit full brand standards docs into 32K tokens and still have plenty of room to write some tweets.  Amazingly my own company Waymark is mentioned with  Patrón, Spectrum, Coke, and OpenAI in this article.

  • Handle customer interactions – natural language Q&A, appointment setting, account management, and even tech support, available 24/7, pick up right where you left off, and switch from text to voice as needed. Customer service and experience will improve dramatically. For example, Microsoft will let companies create their own custom versions of ChatGPT — read here.

  • Streamline hiring – in such a hot market, personalizing outreach, assessing resumes, summarizing & flagging profiles, and suggesting interview questions.  For companies who have an overabundance of candidates, perhaps even conducting initial interviews?

  • Coding – with knowledge of private code bases, following your coding standards.  Copilot is just the beginning here. 

  • Conduct research using a combination of public search and private retrieval.  See this thread from Jungwon (@jungofthewon) about best-in-class Elicit (@elicitorg) – it really does meaningful research for you – must-read thread here

    "How @elicitorg applies language models compositionally to make it easier to check models' work: 1 When language models extract info from or answer questions about papers in Elicit, users can quickly see the source - the part of the paper the model got its answer from."
  • Analyze data, and generate, review, and summarize reports – all sorts of projects can now "talk to data" – another of the leaders is @gpt_index

    "A key goal of @gpt_index is to enable end users to ask an LLM *any* questions over their own data. Building this universal query interface is hard! @ezhu worked on a superb evaluation 💡 of @gpt_index data structure query capabilities. Check out thread 🧵 + Colab nb below!"
  • Execute processes by calling a mix of public and private APIs – sending emails, processing transactions, etc, etc, etc.  We're starting to see this in the research as well.

How will this happen in practice?  And what will the consequences be for work and jobs??

For starters, it's less about AI doing jobs and more about AI doing tasks.

Many have argued that human jobs generally require more context and physical dexterity than AIs currently have, and thus that AIs will not be able to do most jobs.  This is true, but misses a key point, which is that the way work is organized can and will change to take advantage of AI.

What's actually going to happen is not that humans will be dropped into human roles, but that the tasks which add up to jobs will be pulled apart into discrete bits that AIs can perform.  

There is precedent for such a change in the mode of production.  As recently as ~100 years ago, physical manufacturing looked a lot more like modern knowledge work.  Pieces fit together loosely, and people solved for lots of small production process problems on the fly with skilled machining.  

The rise of interchangeable parts and assembly lines changed all that; standardization and tighter tolerances unlocked reliable performance at scale.  This transformation is largely complete in manufacturing – people run machines that do almost all of the work with very high precision.

In services, that hasn't happened, in large part because services are mediated by language, and the art of mapping conversation onto actions is hard to fully standardize. Businesses try, but people struggle to consistently use best practices. Every CMO complains that people don't respect brand standards.

The Great Implementation will bring about a shift from humans doing the tasks that constitute "Services" to the humans building, running, maintaining, and updating the machines that do the tasks that constitute services.  

In many cases, those will be different humans.  And this is where OpenAI's "global services alliance" with Bain comes in.

The core competencies needed to develop and deploy fine-tuned GPT4 models in corporate settings include:

  • Problem definition – what are we trying to accomplish, and how do we structure that as a text prompt & completion?  What information do we need to include in the prompt to ensure that the AI has everything it needs to perform the task?

  • Example curation / adaptation / creation – what constitutes a job well done?  do we have records of this?  do the records reflect implicit knowledge that the AI will need to be made explicit, or perhaps contain certain information (eg - PII) that should not be trained into a model at all?

  • Validation, Error Handling, and Red Teaming  – how does model performance compare to humans?  how do we detect failures, and what do we do about them?  and how can we be confident that we'll avoid New Bing-type behaviors? 

There is an art to all of these, but they are not super hard skills to learn. Certainly, a typical Bain consultant will be able to get pretty good at most of them.  And the same basic approach will work across many environments.  Specialization makes sense here.

Additionally, a lot of organizations are not going to be super pumped about doing this.  The hardest part about organizational change is often that organizations … don't want to change.  With that in mind, it's no coincidence that leadership turns to consultants who are known for helping corporations manage re-organizations, change, and yes – layoffs.

Consultants have talked like this forever, but this time it's literally true.  

Btw may I suggest Cognitive Revolution instead of "industrial revolution for knowledge work"? (cc Manny Maceda at Bain)

“We see this as an industrial revolution for knowledge work, and a moment where all our clients will need to rethink their business architectures and adapt,” said Manny Maceda, worldwide managing partner at Bain, in a statement around the OpenAI alliance.

So, no, AI won't take whole jobs, but it will take parts of jobs, and as a result, many jobs may cease to exist in their current form.  There's precedent for this too, when mechanization came to agriculture.  Whether the people who are displaced find other jobs this time, I don't know.

“Mechanical tractors won’t replace farmers - they’ll just make them more productive!”
7:16 PM ∙ Feb 5, 2023

Finally … WHY is OpenAI going this route with pricing? After all, it's a big departure from the previous "API first", usage-based pricing strategy and the technology would be no less transformative with that model.  I see 2 big reasons for this approach: (1) safety/control, and (2) $$$

re: Safety – Sam Altman has said OpenAI will deploy GPT4 when they are confident that it is safe to do so.  A $250K entry point suggests a "know your customer" style approach to safety, likely including a return to vetting customers, reviewing use cases, and providing solution engineering support.  They did a version of this for GPT3 and DALLE2 as well.

Of course, this approach doesn't mean things will be entirely predictable or safe.  It's hard to imagine that the OpenAI team that shipped ChatGPT would take such a step back with "Sydney" – my best guess is that MSFT ran their own fine-tuning and QA processes, and you've seen the results.

re: $$$ – this is a natural way for OpenAI to use access to their best models to protect their low-end business from cheaper / open source alternatives, and to some degree discourage / crowd out in-house corporate investments in AI.

I am always going on about threshold effects, and how application developers will generally want to use the smallest/cheapest/fastest models that suffice for their use case.  This is already starting to happen…

How to get fine-tuned GPT-3 Davinci performance at 10% of the cost:
"We've deployed a new model for abstract summaries in Elicit! We trained our new summarization model using reinforcement learning from AI feedback (RLAIF) similar to @AnthropicAI constitutional AI method. To try it, go to and type in a query!"

With Meta having just (sort-of) released new state-of-the-art models and StabilityAI already training models they expect to beat Chinchilla… the stage is set for a lot more customers to go this way, at least as long as OpenAI has such customer-friendly, usage-based, no commitment pricing.  

OpenAI can't stop these other projects from hitting important thresholds, but they can change customer calculus.  Why bother chasing pennies on mundane use cases when you've already spent big bucks on dedicated compute capacity?  And can we really afford another in-house ML Ph.D. after we just dropped $1.5M on DV 32K?

In conclusion… economically transformative AI is not only here, but OpenAI is already selling it to leading corporations.  We will feel the impact once models are fine-tuned and integrated into existing systems.  The models will be capable of a LOT in 2023, but of course, The Great Implementation will go on for years.

Buckle up!