Sovereign AI in Poland: Language Adaptation, Local Control & Cost Advantages with Marek Kozlowski

Marek Kozlowski discusses Poland's PLLuM project, aiming for AI sovereignty by training small, locally-adapted models. This approach addresses English bias, ensures local control, privacy, and offers cost advantages for specific languages and cultures.

Sovereign AI in Poland: Language Adaptation, Local Control & Cost Advantages with Marek Kozlowski

Watch Episode Here


Listen to Episode Here


Show Notes

Marek Kozlowski, Head of the AI Lab at Poland's National Information Processing Institute, discusses project PLLuM (Polish Large Language Models). PSA for AI builders: Interested in alignment, governance, or AI safety? Learn more about the MATS Summer 2026 Fellowship and submit your name to be notified when applications open: https://matsprogram.org/s26-tcr. He shares how countries like Poland can achieve AI sovereignty by training small, locally-adapted models for specific languages and cultures, ensuring control, privacy, and cost advantages. The conversation delves into challenges like frontier models' English bias, EU regulations, and technical strategies like "Language Adaptation" on base models. Discover how transparently created, locally-controlled AI offers a viable path for nations to maintain their technological destiny.

LINKS:

Sponsors:

Google AI Studio:

Google AI Studio features a revamped coding experience to turn your ideas into reality faster than ever. Describe your app and Gemini will automatically wire up the right models and APIs for you at https://ai.studio/build

Agents of Scale:

Agents of Scale is a podcast from Zapier CEO Wade Foster, featuring conversations with C-suite leaders who are leading AI transformation. Subscribe to the show wherever you get your podcasts

Framer:

Framer is the all-in-one platform that unifies design, content management, and publishing on a single canvas, now enhanced with powerful AI features. Start creating for free and get a free month of Framer Pro with code COGNITIVE at https://framer.com/design

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

Shopify:

Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive

PRODUCED BY:

https://aipodcast.ing

CHAPTERS:

(00:00) Sponsor: Google AI Studio

(00:31) About the Episode

(03:17) Sovereign AI in Poland

(04:41) The Case for Localization

(13:38) The PLUME Project's Mission (Part 1)

(20:25) Sponsors: Agents of Scale | Framer

(22:47) The PLUME Project's Mission (Part 2)

(22:47) Defining Polish AI Values

(35:32) Sourcing and Curating Data (Part 1)

(35:38) Sponsors: Tasklet | Shopify

(38:46) Sourcing and Curating Data (Part 2)

(44:40) Small Models, Big Advantage

(58:21) Training and Domain Adaptation

(01:12:22) Compute, Talent, and Geopolitics

(01:22:50) Forming International AI Alliances

(01:27:41) Decentralized AI and Conclusion

(01:31:47) Outro

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk


Transcript

Introduction

Hello, and welcome back to the Cognitive Revolution!

While we often discuss "Sovereign AI" in the Silicon Valley AI bubble, we rarely hear directly from the technical leaders who are actually leading national AI projects.

And so, today I'm very glad to share my conversation with Marek Kozlowski, who's leading project "PLLuM" — which stands for Polish Large Language Models — in his role as Head of the AI Lab at the National Information Processing Institute of Poland.

Poland, with a population of 38 million and GDP of roughly $1 Trillion, roughly 10% and 3% of the United States respectively, is an interesting and in some ways a representative case study.  It clearly doesn't have the resources required to compete with the US and China at the AI frontier, but it does have strong technical talent, a real sense of pride in its language and culture, and a deep desire to control its own technological destiny and avoid domination by global superpowers.  

So, what does that mean in practice?  

As you'll hear, Marek's strategy rests on the core belief that by training small models for a particular local language and cultural context, countries like Poland and projects like PLLuM can compete with the latest frontier models while retaining control, preserving data privacy, and achieving a major cost advantage.  

In this conversation, we dig into the strategic realities that motivate projects like PLLuM and the technical challenges they have to overcome to succeed, including:

  • how today's frontier models, which are trained on overwhelmingly English and Chinese data, fall short in other languages
  • why this problem is actually getting worse from one generation to the next as frontier model developers prioritize things like coding performance over support for niche languages,
  • how EU regulation prevents European AI builders from conducting massive web-scrapes and forces them to rely on more focused data curation projects,
  • how the Polish government is thinking about investing finite resources in data, compute, and talent,
  • the "Language Adaptation" techniques Marek’s team layers on top of Llama and Mistral base models to inject local knowledge without starting from scratch,
  • why they haven't yet had to worry about developing a constitution or other explicit articulation of values for Polish AI systems,
  • and why government agencies and national champion companies are often better served by smaller models, fine-tuned for specific tasks and served locally, than by massive generalist models served from the cloud.


Overall, Marek's mix of realism about the challenges of competing with global leaders and his positive vision for transparently created, locally-controlled AI is a great window into what AI leaders around the world are thinking and doing to maintain AI Sovereignty.  

With that, I hope you enjoy this deep dive into the meaning and training of Polish AI, with Marek Kozlowski.


Main Episode

Nathan Labenz: Marek Kozlowski, head of the AI Lab at the National Information Processing Institute of Poland. Welcome to the Cognitive Revolution.

Marek Kozlowski: Welcome, everyone.

Nathan Labenz: I'm excited for this conversation, too. We met not too long ago at an AI event in Las Vegas, the Enterprise Technology Leadership Summit. And I thought it was really interesting to double click on everything that you're doing because In the United States and in the, you know, sort of Silicon Valley AI circles that I spend most of my time in, there is this ongoing conversation about sovereign AI. And I think it's funny that a lot of this conversation happens in the Silicon Valley bubble and sort of makes a bunch of assumptions about what other countries feel the need to have, you know, or aspire to create, you know, what's driving those decisions. And I don't hear too much from primary sources of people that are actually doing the sovereign AI projects around the world. So I was excited to meet you and learn more about what it is that you're doing in Poland. Poland, obviously, I think, obvious to me, is a country with a lot of technical skill and very distinct culture, obviously its own language, proud tradition. And so I'm really interested to get into it and figure out what sovereign AI means in the context of Poland.

Marek Kozlowski: Once again, thank you for the introduction and for introducing my person and showing the idea. The idea is that I called it slightly broader, not only the sovereignty, but also the creating the localized LLMs. Because the localized, it can be the national LLMs, but also the domain oriented LLMs. And I create the, maybe not I create the idea, but I'm I am promoting the idea of the localized LLMs. It means the LLMs adapted to the language or domain, because they can be also a domain, and they are in this domain or the language, they have higher quality understanding, text in this language or domain, and have the higher quality, and they are able to create the higher quality text in the generation step. It means that building the localized LLMs, they can be, of course, adapted to the language or domain, has two goals. First of all, to improve the understanding in this domain and the language, but also give the possibility to generate the higher quality texts, of course, in aspects like the linguistic and cultural aspects. They are the idea, and our goal is to create the models that are the order of magnitude smaller. than the closed popular now LLMs. But in the aspects of the language and the culture or the domain, they have the same quality as the 10 times bigger models. And they are open source, transparent, secure, and as much organic as we can.

Nathan Labenz: Yeah. OK, great. That's a great start. Can maybe take one step back, if we can, and just talk about why this is needed First of all, from a capabilities perspective. Famously, I think it was, gosh, it's been a minute, but I think it was the GPT Instruct series. I think the model originally was Text DaVinci 002, if I recall correctly, one of the first models that OpenAI trained to follow instructions. They reported basically, we just trained this thing to follow instructions in English, and lo and behold, it seemed to be able to follow instructions in other languages too, which was you know, obviously a strong example of emergent capabilities and transfer learning, you know, positive generalization, all these sorts of phenomena that had been, you know, kind of elusive. But I think in many ways like characterize the the phase change that we've gone through from earlier AI systems to these more general AI systems, positive transfer being obviously a huge one. But, you know, that's where they started in terms of like just English and, you know, oh my God, it works in other languages. Since then, of course, they've gone and done a lot of work to try to collect data in other languages, to try to even things out. And my sense from just kind of benchmark data is that they have made pretty good progress, but still like performance is best in English. And then you can kind of think of like performance getting worse, sort of the farther a language is from English in the language tree. And also just correspondingly how many resources it has, right? Few low resource languages are obviously going to be a bigger challenge than higher resource languages. That's my sense of like--.

Marek Kozlowski: Yeah, at least because I have to correspond to your insights. First of all, the 90% of the training data is English and Chinese. Even if you look at the biggest open source or the biggest closed LLMs, 90% plus data are English and Chinese ones. Only 10% or less are the other languages. For example, it varied, but for example, in some models, the Polish language, I use this example, there is about 1% of the corpora, or even smaller. And this decides that the vast majority of the skills and the competences are gained by the English and Chinese instructions. And of course, if you have the learnt model, it has a huge competences in transfer learning, you can say they extrapolate. You can very easily extrapolate between the tasks. For example, even if I have in the instructions, I have lots of mathematical calculations prompted or commanded by the English language. And even if I ask to do it in Spanish, the very large modeler can do it in the intermediate steps. It can translate on the fly the commands from Spanish to English and somehow map the knowledge from the English to resolve the solutions, even if it was not learned on the Spanish examples of how to calculate some mathematical formulas. But what is the most important, that this works very good, but this works the same way as we or our kids learn the language. First of all, we learn how to understand and hear, listen, next, how to write and how to speak. And of course, if you learn a new language and we get some commands in our minds, we try to map these commands to what we know from the primary language, our native language. And the same is going inside the LLMs. For example, I can give you the example that the models that were not trained by the huge volume of Polish texts, they are still able to be communicative and create the text that is understandable, but there are some statements and some phrases that are very easily identified that they are not the natives. For example, I give you the example about the writing the emails in Polish. And for example, they have such a formula, typical for English. I hope you stay in good health and condition. This is typical for English, but not typical for Polish. And even if you translate this word to word, it's communicative, understandable, but not typical for our language and culture.

Nathan Labenz: So is that is there more to say about how the leading commercial models are underserving the Polish market than that? I mean, I have I have the sense that there is a little bit more to it than just the cultural idiosyncrasy, because even when I look at like, you know, an MMLU benchmark, it does seem like performance degrades across the language spectrum, right? Like the highest MMLU score is in English. it does seem to get worse in other...

Marek Kozlowski: I know, but for example, when you look at the benchmarks, we live in the words that we are biased by the benchmarks. It means, for example, the NMLU benchmark is mostly that you choose the solutions A, B, C, D, the multiple choice questions. They are not testing the ability to communicate fluently in the language. Most of the benchmarks don't test how how good is the model in producing the longer forms, longer writings or longer sentences. They usually test their understanding, extractive competences, summarizing competences, and many knowledge about the facts and the word. But there are, I know, little or almost little, very little, very few benchmarks that test how good the model is in generating the longer forms of text in the other languages than English and Chinese, for example, Danish languages, because it's much harder. And for example, we in Poland create the benchmark PLCC, Polish Linguistic and Cultural Competency Benchmark. And this benchmark enable us to evaluate how good is the model in different subcategories. And for example, they are not only the categories of the grammar, vocabulary, but also about our culture, tradition and history. and many, many others, but we would like to not only evaluate how good is the wordings of the model, how good the model is in some typical for our traditional wordings and the phrases, about the history also, but we also trying to check how good the models are in the general spectrum of using some ambiguous words, for example, in Poland. But there is still, but there is still, this benchmark still doesn't validate, validate how good the longer sentences are in Polish. How good the model is in producing the longer structures in Polish language?

Nathan Labenz: Yeah. Interesting. So is it fair to say that the primary focus of your work in creating Polish native models is on this on these sort of softer skills. Like it doesn't sound like you're focused on closing the benchmark gap or the sort of reasoning gap that exists between English and Polish. It's more about, as you said, like culture, values, tradition, history, cultural competency.

Marek Kozlowski: But it's because I think that the language is not only the wordings. The models can have the very broad vocabulary, but they should be able to use it properly in the context. And sometimes the language is not only the words. They are the culture, tradition, history. Everything is mixed in it. And in order to create the model that behaves as natives, you have to inject not only the knowledge about how to create grammarly correct sentences, but also how to use special idioms or phrases in special context. Or what places are typical for Polish history, or maybe what places are viral now. Yeah, generally it's like you have to mix the history, grammar, vocabulary, art, entertainment, culture and tradition, everything into the mix to create the language ability that is somehow similar to the natives. But as you ask me why we are doing that, first of all, because, as I mentioned, sometimes that we believe in the idea of the localization islands, the localized islands, it means the islands adapted to the language. that there are as much similar alternatives as possible. The second issue is the competency gap. For example, we believe that, for example, we have to develop our people, our engineers, to have the skills to build our own models, because maybe in a few years the market will change, maybe the model will be closed, or maybe some models will be forbidden. There are plenty of models currently in the European Union, but we are not able to use it because of the AI Act. Even in the licenses, the LAMA 3.4, the KIMI and many other models, they have the sentence, the statement in their license that they are prohibited to use in the European Union. Maybe we will be forced to use this knowledge to beat our own models, maybe they will be a little bit worse than the Chinese or USA, but they will be our own. And sometimes it's better to have the competences to build even something a little bit worse, but have the ability to do it, then don't have this. We can. Sometimes it means more than you think. But also we think that there is in this approach, in the Plume family, because we create the family of models, we also believe in the transparency, because we show how we built it from the scratch. We released a few weeks ago, or two weeks ago, sorry, we released the publication about almost 100 pages, how we built these models, yeah? And also, we not only released the publication, the recipe book, the cookbook, but also we published on the Hugging Face the samples of our datasets, the instructions, preferences, because we would like to show not only the open weights, But the more more than because the open source are not only the open weights, there are also some samples of open data and the cookbook, how we do it step by step in a very detailed manner. And what is for us important because I think even now they are the most popular open source models at the times now, but they are only open weight. There are no samples of the instructions or preferences they use to train the models. And we would like to go a step farther, to be as transparent as possible. And also we invest lots of in the organic, organic data, because we believe that we also proved that, that when you have three stages when you learn the models. First is the pre-training. It's somehow similar to learning the kids a new language, that you identify the words, how to create structures from these words, and some pieces of information. but children after this type of learning is not able to resolve the mathematical calculations or write the essay. Yeah, it's like you learn the language but you don't learn the competencies. Next stage is the SFT, supervised fine tuning, you learn how to resolve some tasks, downstream tasks, write the poem, write the essay, summarize this article, perform some calculations, you learn the competencies, like the children in the school, you have the math, the geography, chemistry and many others, and after all you have the alignment, the preference learning, that you mark what the children have done during the test for example, and this information, this marks, makes them what should be corrected or not. And with the same what we have done with the children, with the kids, that we learn them the language, the piece of information, the wordings and the structure, next we learn the competences and then we evaluated them And during the feedback loop, we try to improve the abilities. The same things are done with the LLMs. And for example, when you're doing the pre-training, you show the model the hundreds of billions of tokens to learn the language. And after all, in the SFT, the supervised fine-tuning stage, we show the model the syntactic instruction. Syntactically, it means that they are produced by the other LLMs. They usually, if they are linguistically poor, they also degradate the model. Because if you, in any stage of the learning, you will see that when the model see the poorer data, if it degradate, it means it's going down if there is quality of the linguistic creation of the sentences. And so we focus mainly on the create the organic data sets, organic instructions and preferences, or even if you use the LLMs to produce such instructions, we check it by the humans to improve their structure and quality. And I think they are the novelty. And the second novelty is that first of all, there is the open source, open data and open cookbook. The second is transparent because it's written in the cookbook what we have done step by step and we showed the samples. We also focus on the organic data, organic instructions, organic differences. And I think this is the reason why the GPTs and other big models are so good, because they are also they have plenty of manual instructions and they don't show them because there is a intellectual property of these companies. And we also we secured our models on our own because we discovered that these models, when they are secured for the English speakers, they can be much easier hacked than the secured for the Polish speakers. I think they are the novelties, maybe briefly speaking. Yeah.

Nathan Labenz: I have like seven follow up questions I want to ask about various parts of that. Um, and maybe we can kind of break it down by inputs to AI for one thing, right? Obviously the big inputs are data, compute and talent. Um, and you touched on certainly data and talent there. Um, I also. do want to come back to the safety training because that's always a keen interest of mine. But maybe let's start with like the goal. You've spoken about it somewhat, but I think one big challenge that we have certainly in the United States and we have all this talk, right, of especially in the context of the geopolitical competition in AI, there's a lot of talk about, well, we want to have AI with democratic values. win. We don't want to have Chinese values, or maybe we even are bold enough to say we want American values to be the values that the AIs embody and kind of propagate through the world. That obviously brings a big question, which is, well, what are those American values? And I can certainly say that there's no single agreed upon answer for that, right? What American values are is hotly contested on an ongoing basis. And that leaves basically the AI companies to try to come up with their own best guess of what that should be. And that too is often sharply criticized because it's too woke, or it's not woke enough, or it's right wing extreme, or it's describing itself as Hitler.

Marek Kozlowski: The AIMs, they are somehow, they are the compressed representation of what we have in our web in the internet. Yeah? There is somehow... what topics are the most important, what information is the most peculiar ones, it somehow is reflected by the LLMs. If you have the problems, the political problems, the religious problems, everything what is there is also reflected somehow in the compressed LLMs because LLMs, they are somehow they are compressed memories repositories. They are the compressed stores of the memory of the internet.

Nathan Labenz: Certainly that's, you know, all that stuff is baked in. Sometimes I don't know how far American, the leading companies have come today in terms of filtering the training data. I know that there are some techniques that are like, we're going to get rid of all the bad pre-training data, you know, and just try to show this.

Marek Kozlowski: Yeah, there is a typical step. Yeah. Even we in our project, there are, it's called data curation that even because the, as I mentioned, in the pertaining stage, the 90% of the data, the web data, And web data are plenty of them is creepy. You are not able to use them because your model will be not stable. And in this data creation step is there are two sub stages. The duplications, you remove the same information that is repeated very often in the internet. It's sometimes it's scaled to two times because there's plenty of duplicates in the internet. And also there is a filtering out, we filter out the data that there are very crappy with their low quality. It is, for example, there are plenty of special characters, plenty of inter functions, plenty of words from not being recorded in our vocabulary. There are plenty of such disturbed data that should be truncated because it will have impact on the stability of the quality of models. And I think you mentioned that, that the big companies They have such tools that are not even able to eliminate some poor quality data, but they even eliminate some, for example, the theories, some points of use, and much broader section, not only the linguistic aspects of the data. So it's the same as censorship. The Chinese models, if you ask about what has happened in Tiananmen, square, They are not able to give you any information. Yeah. The generally, the people who build models, they can able to isolate or how it's called the black, the band, some important informations that for people who are not aware about that will be the reflection of the world without some part of it.

Nathan Labenz: Yeah. But there's at least like two layers to this, right? I mean, there is all this data pre-trained or pre-filtering. And I genuinely, I could certainly believe that the Chinese models are trained on data that's so thoroughly filtered as to never have seen any document about tianamins.

Marek Kozlowski: I think that even not about in the pre-training stage, I think in the last stage, because as I mentioned, there are three stages during the learning models, the pre-training, the SFTs, supervised fine tuning, and the preference learning, sometimes called the reinforcement learning with the human feedback, but there are other methods like DPO or ORPO, and I think in this stage, they secure the model how to not behave?

Nathan Labenz: Right. So that's what I want to get at for what you're doing in the Polish context, because in I don't know what, you know, the Chinese companies are doing. I do know that the American companies are developing their model specs or their constitution. You know, it's basically this super long document that says this is how we want our AI to behave. And to their credit, you know, they're starting to be reasonably transparent about what those are so that at least the public has a sense of like what they're going for. But again, it's like it's in the US context, it's like pretty contentious because, you know, everything is is contested here in the Polish context. Is it like that or is it, you know, is it an easier time? Like, do you have a constitution for what you want Polish.

Marek Kozlowski: AI to be? There are some strategies like that. There are the strategy how the AI should behave or not, maybe not, how should not behave. Maybe in this case, for example, there should be ethical that should not blame anyone, or for example, to avoid some topics that are, for example, very risky topics, how to say the topics about the hate speech. Yeah, there are some places that there are typical, the same for the models from the Chinese or the USA. They are the place when there is a risk that the model behaves in an ethical way or in a way that we can be blamed that is not, it's root, at least root. But of course, there we go with some political tensions. But generally, I think we don't have such a huge, now currently, there are huge constraints, as you mentioned. We don't have a constitution that we have plenty of points that you have to obey. I feel we are mostly that the models should be ethical as much as we can, but we don't give too many constraints to the model. because I think we are on the other level of development of China, Chinese or the USA government or the companies, because we are, I think, a few years behind them or two years or three years. It's hard to say. But generally, I think we are not still don't have such a we have our own regulations, but not the regulations containing the how the model should behave, but rather what kind of data we are able to use for training. We have many constraints like that, focus on the data, then on how the model should behave.

Nathan Labenz: Yeah, interesting. So, I mean, I've never even been to Poland, so obviously, you know, should be very humble in terms of my ability to describe it. But one high level fact that I know is that the large majority of Polish people identify as Catholic. Yes.

Marek Kozlowski: Well, it seems to be. For many years, it used to be a good statement. Currently, I think it depends on how the city is big, that the inhabitants of the cities versus inhabitants in other villages, I think they should be varied.

Nathan Labenz: Yeah. So how do you think about that dimension? I just happened to have done an episode not long ago about Catholic AI with a company that is literally building AI that embodies Catholic values specifically for religious Catholics. But in your context, you know, you've got this sense that like, okay, well, you know, maybe a majority of people are Catholic, but maybe that's on the decline and maybe it depends on, you know, an urban rural divide. Do you have is there some sort of decision making process where you think, okay, like how Catholic should our Polish AI be? And, you know, does it vary in different situations? Is that are you guys getting explicit about?

Marek Kozlowski: No, I think-- Articulating goals there? I think we are much more liberty now. I think now we don't have such ideas to create the reflection of our world. But as I mentioned, I think it can be very easily-- the models can be very easily constrained by the preference learning. And you can learn it to behave in such a special way. But now, currently, when we produce the family of the models, We produce not only the chat models, but also the instruct models, the base models. We give the possibility to companies to use any kind of the models because we know that some constraints may have a disruption effect on some business cases. But generally, I don't think that we, as the producer of the, as the releaser or the, no, not the, the builder of the models, we should get all of them, the people, and people we decide how to skew them.

Nathan Labenz: Gotcha. Okay. Yeah, very interesting. Do you envision that this will become something, you know, as you presumably go on to train more in future models and they become even more powerful and, you know, potentially, I don't know to what degree you sort of aspire to serve a consumer use case versus, you know, empowering businesses in the country. But do you think that this becomes a challenge at some point? Do you envision a future where there is a sort of Polish constitution for AI that actually seeks to answer that question? And if not, how do you think you ultimately get around that? Because it seems to be a very central thing that the American companies feel they need to grapple with. So if you think you can avoid that problem, indefinitely, I'm kind of wondering.

Marek Kozlowski: I think we have much harder problems. Because for example, you have the AI constitutions in the companies and in USA market. But for example, you can very easily use all data you have without any constraints. Of course, there is a problem with some suits and many other cases, but there is like maybe it's a long process, not very, I think for most of the companies in USA, they can take this risk because they are still profitable enough to even to pay some Some have to say, some bad decisions by the judges and the author and so on, I have to say, the arbitral decisions. But generally, I think in the European Union we have the AI Act and our local regulations, like for example, the acts concerning the, how it's called, the rights, the authorship rights. And I think these problems, legally speaking, these documents, they have much harder impact on our quality of models then we can... I think it's enough huge constraint not to go any further, yeah? Because as I mentioned, in the European Union, we have the AI Act concerning the general purpose LLMs. We have also our local regulators, like for example, the AI Act about the authorship rights. And both of them combined create a much harder constraints than, for example, any kind of constitution that is much, I think, flexible, more flexible than our regulations. And I think we don't think currently about the AIA constitution, but I know that maybe in that one, two years, something like that appear. But currently in the European Union and Poland, we fight with the currently existing and obeying the regulations.

Nathan Labenz: Yeah.

Marek Kozlowski: And I think they are much harder, much harder, and they are much impactful than those you mentioned in the USA. Because for example, you have the AI Act or the authorship rights regulations. For example, they can eliminate the 80% of the data from your training data sets and say it's a huge, they have a huge, it has a huge impact on the quality of models.

Nathan Labenz: Yeah, okay, that's interesting. So turning to data then, we can check back in on the state of the Polish AI constitution in a year.

Marek Kozlowski: Okay.

Nathan Labenz: On the DataFront, you had mentioned that in the biggest open source models, maybe 1% of the data is Polish. Quick back of the envelope math, I think the Llama models are like maybe up to 15 trillion tokens that they've been trained on. I don't know if they disclose their data mix, but that would cash out to something roughly on the order of 100 billion tokens in Polish that the biggest projects might be using. I understand you have quite a bit more data than that, but also your last comment.

Marek Kozlowski: We don't have the trillion of tokens. Yeah, because even as I mentioned, even the Llama, there was about, as I mentioned, that now one person is the Polish language or maybe less. We have now the several hundred billions tokens. We don't have even the trillion of tokens because as I mentioned, the duplication stage, the duplication stage and the filtering out stage, they eliminate lots of data and we don't have one trillion tokens after this data curation steps.

Nathan Labenz: So where are you getting your data? And your comment about like the difference, the sort of regulatory arbitrage that the American companies are potentially taking advantage of. Are they able to use some Polish data that is on the Internet? But yeah, I use because of these rules.

Marek Kozlowski: I remember that it was sometimes ago that There was the people who analyzed the crawlers that are going from the webs, on the websites in the Polish, in the Poland. They identified there are plenty of anthropic crawlers and there was plenty of robot text pages on the websites that disallow these anthropic crawlers to get data. I mean, it means there are plenty of the crawlers from the US companies, US origin companies that are crawling the Polish data, even if they are not, if they are, even if they are not allowed to do it. Because as I mentioned, it's much harder, for example, to be the, I know, to have the court in the USA and to have some accuse them for using the data and able to fight with them in the USA court. Even if they have some proofs that they use data, that they have the desirable clauses.

Nathan Labenz: And so in that way, you're, if there's a hundred billion tokens that they're getting off the internet, It sounds like you can only use a fraction of that, and then you had to go elsewhere to find the few hundred billion tokens that you've collected.

Marek Kozlowski: Yeah, for example, they're using the central libraries with some sources of data that are not web. Yeah, because as I mentioned, then a vast majority of data used by the big vendors, also by us, the web data. But for example, we have also some data that are not published in the web, and we can use to somehow, of course, to some extent, use them, but as I mentioned, there is the minority of data. Still, even for us, even if you have some access to the local organizations and so on, the vast majority of data we use, there is web data. And the problem is the same for all other players. Maybe we can much easier identify some web sites that are not easily crawled by the external crawlers. But generally, I think that most of the companies in the OpenAI or the Anthropic, they have, I don't know, maybe 80% or 90% of our data still.

Nathan Labenz: Yeah. So where else are you going to get data? Like what is your data process?

Marek Kozlowski: We have, for example, the first mass image of data, but also we have, for example, there is a called, how it's called, the library of the science. There is a plenty of publications and so on. We also have some private bilateral agreements with the publishers, but not publish on the web. But as I mentioned, there's only the fraction of data we have in our corpora.

Nathan Labenz: And are you also, you mentioned kind of doing a lot of human review?

Marek Kozlowski: Yeah, I think the huge, what we have, our advantage is not in the data used for the pre-training stage, because as I mentioned, I think the 80% of them is still in the Anthropic or OpenAI repositories, but that we have dozens or even the hundreds of annotators that create the manualized instruction preferences. Yeah, because they give us the ability to create the new data that is not published on the internet still.

Nathan Labenz: Is there like a Polish equivalent of like Scale AI or LabelBox that you're working with to do this? Or is this a?

Marek Kozlowski: Project that you've-- No, we have mattered our internal tools, not so crowdsourcing for once.

Nathan Labenz: So you guys have built your own platform for?

Marek Kozlowski: Human preference data collection. Yeah, human preferences and the human instructions are built locally, internally. Of course, some of them, some samples of them we publish to show What is the structure of our instructional preferences and some examples, but most of them is still the closed asset.

Nathan Labenz: So how do you think about that? I mean, one question that I've been thinking about in the context of this whole sovereign AI discourse is, obviously, you know, as a national government, you can have different strategies or different goals and different strategies for what you're trying to do. One goal is, as you alluded to, we want to make sure we have our own base, our own data, our own talent base, compute, we'll get to in a minute, so that if we get cut off or who knows what might happen, we have some sovereignty over what's going on, right?

Marek Kozlowski: I think that's- Yeah, some possibilities to develop in a different way, yeah? But also there is the second, because this is the one, as I mentioned, we can something, we can mean something more than you think. If you can do something even a little bit worse, there is still the competency and possibility to do new ways, new movements. But I think there's also the second issue that I believe that the AI agentic revolution will be based on the small localized models. Why? Because first of all, there are some branches or sectors in the economy or even in our public sector, but we are not allowed to use the cloud-based solutions. Yeah, there are some regulations, or the risk is too high, and there are some demands to have the on-premise models. When you have the on-premise models, you always have some challenges, like, for example, the GPUs you have to buy, the energy consumptions, And usually when you realize that, for example, you need to buy the 16 GPUs and pays for energy, you always go to the downscaling, to use the as small model as possible to achieve the expected goal. And from our experience that, for example, the people, especially the businesses, but also the public sector, they don't demand exactly the ChatGPT, the general purpose LLM that is able to resolve 1,000 tasks. Usually in the business and the public sector, you have the demands for the 10 or 20 use cases. And you are able to create the smaller models that are able to resolve this task on the same level as the few-shot use big LLMs and hosting them on the previous solutions. And I think when you go to the AI agentic solutions, there are plenty of agents, it means there are plenty of models. that are used to resolve some complex scenarios, you really have to downscale the models to using only the as small as possible models to be energy effective. And also the economic aspects are crucial now. And I think this is the place where the small localized models can play well.

Nathan Labenz: Yeah, that makes a lot of sense in the business context. And yeah, in my experience, I would say the same has been true. when I'm really trying to dial in performance for a particular use case. And that's all I care about. And I kind of know that this model is going to be deployed in a controlled environment where, because of the way the system is set up, I know what the inputs are going to be. I know what the outputs are going to be. I know that I have other layers of control than, yeah, I can just kind of dial into one task or a few tasks. Often a cheaper model, you know, with the right training can do just as well.

Marek Kozlowski: Yeah, especially when you have the, because mostly the people are now using the cloud LMs in a few-shot manner. Yeah, because they are very powerful, they are able to resolve the very broad number of tasks, thousands of tasks, and they use them as the, in a zero or few-shot scenario. It means they are integrating the API of their own systems with the cloud-based LMs, and they use them as they can out-of-the-box, because you could create the prompt and use the output, that's all. But when you need to create the much more controllable solution, the closed solution, the on-premise solutions, you're not able to use some cloud solutions, you have to go with the different, as I mentioned, aspects of these decisions. Do I need the multimodal or textual only model? Do I have a training datasets? Maybe if I have any datasets, I can supervise fine tune the smaller model and achieve the same quality as the few-shot applied cloud-based solutions. And mostly when we create many deployments currently, we identify that when you have the few or 10 different use cases, and you create for them at least 1,000 instructions or 3,000 instructions, and you SFT and supervise fine-tune the smaller models, you achieve almost the same quality or maybe sometimes higher quality than using in a few-shot approach the big, very big, large cloud-based ALMs. But of course, you have to prepare training datasets, at least 1,000 instructions. The higher number of instructions is better. But 1,000 is enough. Mostly organic ones, but or maybe the semi-automatically created with the human factor. And then you have 1,000 or more of the instructions, you can SFT smaller models. And in this one task or a few tasks, you will have the same quality as using the zero few-shot cloud-based SEMs. And I think this is the future, because I think if they go to the business, and the business will calculate the risks, the money, the possibilities, how they control the solutions, what is the impact of their decisions. They finally, in these AI-agentic environments, choose the small localized models and fight-tune them to their demands. And there is also one risk. I will show you that because we discussed it last time, lastly with the Savagodas, one of my colleagues, that, for example, in the anthropics models, the cloth, for example, the LLMs from the anthropics, they, for example, their quality in the ability in knowledge about the Polish language and cultures going down. Because, for example, they decide to be focused mostly more on the software developer assistance. And if they focus on this kind of fraction of the market, they go down with the qualities in, for example, in generating texts in the niche languages. And for example, imagine that, for example, in Poland, you apply such a model from Anthropic, you will integrate it with your environment, with your ecosystem, and after the next months or next years, the next releases of this model going down on your competencies that you demand. And you have to choose the other model or you roll back if you can, because unless you are not able to roll back, revert the previous models, or you have to choose another vendor. I think the creating the huge integrations based on the cloud LRMs is a huge reason because as I mentioned, during the next years, they can change their target objectives and don't need to focus on the Polish or Czech Republic languages because they are not the market for them.

Nathan Labenz: Yeah, that's really interesting. I've never heard that before. So just to make sure I understood correctly, you are seeing worsening performance over time in Polish on like top issues of sort of Polish culture or general world knowledge as the Claude models have progressed through generations.

Marek Kozlowski: Yeah, because we have the, I don't, I can say there's the anthropic based models, I think the Claude and then the Haiku or maybe I think the Haiku. But also we identified this problem with GPT models. For example, the GPT models that don't going up with the equality of the Polish language and cultural competencies. There are some of them even going down with the next releases. I think the problem is much more broader. I think that there are the creators of the models, they analyze the market. And for example, if they need to focus on some competences and they improve these competences, then sometimes there is a trade-off that other competences are going down. For example, in this case, the Polish cultural and linguistic competences. I can check it, it was the Claude model for Anthropic, and this version is going down on our PSSC benchmark when you compare to the previous releases.

Nathan Labenz: Wow. Okay, that's a really interesting data point. Yeah, and I guess it maybe sort of answers the next question that I had for you, which is my general working model has been that the AI frontier model developers want as much data as they can get, you know, and if you had any data for them, they would be happy to take it and maybe even pay you for it. What you're saying sort of suggests, well, maybe not always because They're trying to do the smallest models that they can as well while they're doing all this sort of distillation. They're trying to go for efficiency. They're trying to obviously serve the core use cases that they're getting paid for, which is a lot of coding. And so maybe if you showed up at their doorstep with a few hundred billion tokens worth of Polish data and said, Hey, would you like to use this? Maybe they at this point might say, Not really, because we aren't that focused on that use case versus we'd rather go do another however many billion, you know, generations of token of coding tasks and and use those tokens instead, I guess. Would you guys ever consider I know you have some it sounds like you have some open data, but not all of it is open. If I was the if I'm thinking as the government, I guess another goal that I might have is I want my users, retail users, you know, just my general public to be as well served by AI as possible. And I don't know if you have stats on what the Polish retail consumer is using right now. Are they going to ChatGPT? Are they going to Gemini or something else? Mistral? Who knows? You may know. I don't know. But if I was the Polish government and I was saying, OK, here's what my people are doing. They're using these other companies. We've gone and collected all this data. Is there some sort of deal to be made with the AI companies where you might say, hey, we'll either give you this data or perhaps unlicense you this data, you pay us for it, and that way you can incorporate it into your process, and that way you can serve the Polish market better. I've wanted to sort of trade there.

Marek Kozlowski: It's a good point of view that I think the next natural step, that you are not able to get more data without some assist or the cooperation with other players. But as I mentioned about the previous statement about to serve as good as we can for our citizens, and I think there is still one of the goals of our projects, because about the PLUM, the family of models that they call PLUM, Polish Turk Language Models. It's a family of models, but not only the family of models, but also the assistant and chatbots for citizens and city inhabitants. They will not only focus on the models itself, because the models are a very good asset, But how to build based on those models, the chatbots, the RAC approach solutions that can work for the citizens nationwide, but also for city inhabitants, for the local chatbots, for example, in the cities, the halls. And as I mentioned, it's very good to mention that, that sometimes we are trying to create better and better models But the problem is, for example, on the digital level, on the other place. For example, there is not a problem that the model is a little bit worse and better. But for example, there is no chatbots for the city's municipal halls and so on. And there is two issues. First of all, the deployment issue. This model should be a little bit customized, supervised, fine-tuned to be able to work as the part of the citizens and bots and assistants for the for the city inhabitants. And this is the one issue. And the second issue, as I mentioned, that sometimes maybe there's time for some cooperations, because only the cooperation gives you the ability to improve your data sets and improve your models. I think it's a very good step. And I think, as I mentioned, we are developing in our pace, but we know that there is a place where you are not able to go further and you have to be supported by someone else. Yeah, it's normal. In each, the same is in the business, yeah? To some level of your development, you reach some points, but you have to be supported by some better players or diverse players to be better finally.

Nathan Labenz: Yeah. So do you know what that kind of market share breakdown is today? And is there a sort of established goal that you have to win market share with the models or is it--?

Marek Kozlowski: It's very hard to say about the winning. And I think because we are the, the PLU models are not the corporate initiative. It's not the private money and they are the funds. There is the, this is the project supported by and funded by the Ministry of Digital Affairs. It's a consortium of the six institutes and now the eight, because we emerged it in the second year, institutes and universities. And we are the public initiative. If you are the public initiative, you don't think too much about the retail investment or the number of customers. We are much more focused on how to be as open as possible, how legal, because we have to be legal compliance with two regulatory holders, transparent as possible, organic, because it improves their linguistic possibilities and secured, and how to can be used by the public sector as much as we can. Because mostly in the public sector, the models should be closed on premise deployments.

Nathan Labenz: Yeah, gotcha. You've shared a lot about how you train these models, but what is the base? You're not doing all the pre-training from scratch, right?

Marek Kozlowski: My understanding is- No, no. I can say about it more. I can say it and explain it much more in detail. trying to create the models from the scratch. It is from the random weights. But the problem is in the pre-training stage, the number of tokens you have. As I mentioned, even there, when you look at the April reports from the DSLM models, they prove that even if you have the 8 billion parameters models, you have at least one trillion tokens to have the stable pre-training. It means stable that gives you the moderator or high quality base model. And in our case, as I mentioned, the duplication stages and filtering out give us around 200 billion tokens. It was too little to create the model from the scratch. And we used, of course, the Llama. Of course, now the Llama are much more closed. But one year ago, they were still open and the Llama license was not so prohibited in European Union as it is now. at least now, but usually we use the Lama-based models and the Mistra-based models, and we continue pre-training. We perform the language adaptation. Language adaptation means we continue pre-training them on our corpora of Polish texts. And after all, we have the new base model, and this new base model can be SFT and difference optimization learning in the next second and the third stage. But as you mentioned, we are not able to create the how to say, the moderate quality or the good enough quality model without one trillion tokens and we don't have one trillion tokens in Polish language. Now we have made some experiments with the mixture of languages. Yeah, we may not only the Polish language, but we also use the other languages, the mixture of language to get this one trillion tokens and to make some random pre-training from the scratch. But the effects will be in a few weeks.

Nathan Labenz: Okay. On this language adaptation step, I have a couple of questions there. One is, do you continue to mix in English? Do you try to preserve the model's ability to speak English? Or after this language adaptation, does it just only speak Polish?

Marek Kozlowski: Of course there is when you have the cascade learning, because if you use the base model that was already created, and you perform the language adaptation. Of course, there is always in the cascade learning the problem of the forgetting. Some knowledge is forgetted from the previous learning stages. But there is still some persistence still now. Even if we pre-train a few epochs on our Polish data, the models still have the competencies, for example, to write something in English. We don't prune any other competencies in the other languages. They somehow forget it because there is a problem of forgetting in the cascade learning. But generally speaking, we don't prune it manually. We only take the base model and perform the language adaptation, continue pre-train a few epochs. And this is how we make better the abilities in the Polish language. Of course, there is a trade-off. Some other languages are going down, but they are not pruned at all.

Nathan Labenz: And that's also where the world knowledge comes from, right? And by world knowledge, obviously there's all these sort of local details of life, right? And you can maybe give better examples. I'm sure you can give better examples than what I would give, but I'm thinking just like, what are the names of the Polish candies that kids like? And how does one file a document if you want to sell a car to somebody else? There's surely some filing process, all these sort of little details, they're absorbed in that stage as well, right?

Marek Kozlowski: Yeah, and then you have to know that usually when you, the knowledge, the general knowledge is pertained. But then you ask me about the factuality, how a model is good in some facts, for example, regulatory issues, some law issues that there are changes in the time, they are changing in the time. It's always the problems in any kind of ALMs. Yeah, because you pre-trained the ALM, for example, on the data that are tilled in March 2025. And you don't have in this memory store information about the changes in the law or some regulations or even some situations, accidents, and the names of the new politics after this time point. But generally, we use it in Iraq approaches or data. It means you have this retrieval stage. When you have the database, the knowledge base up to date, it's much easier to be updated. And after all, when you get the sum retrieval states, you use the models to synthesize. So we're going to read the answer based on that. And in this case, we use this factuality issue because I think none of the providers of the LLMs, even the big ones, are not able to pre-train it. And the new interval of data is coming up.

Nathan Labenz: Yeah. Do you think this would work for companies? This has been sort of a-- this is a bit of a digression, but it's been a question I've had in my mind for a long time. Like almost two years ago now, I did an episode of the podcast with a company called Mosaic LM. And what they were doing, among other things, was this sort of continued pre-training for businesses. They would go into a business and say, let's get all your tokens, you know, and this could be all the Google Docs that you've got and the Slack history and all these various things. Let's compile that. And now we can pre-train on that. And hopefully the... model will start to speak your internal native dialect of whatever language you're speaking.

Marek Kozlowski: I call it, as I mentioned, the domain adaptation. For example, you have some closed data. a closed, for example, I have my internal closed data from the documents about my issuance, my customers, about some reports that are not open, and I would like to pre-train the model on this data to make it more adapted to the domain. And we have done such a project now for the biggest bank in Central Eastern Europe, the PECLOBP. It's one of the biggest banks in Europe, the biggest one in Central Eastern Europe, and we have performed what you mentioned, the domain adaptation. We adapt the models to their domain, they have said their own data, domain data that is closed data, and we pre-train, continue pre-train the models on their data. And I think it's a very good, very good approach. We proved that this in different tasks it varied, but there are some tasks that is domain adaptation gives you a huge gain, inequality and in some financial measures and so on. But I think there is one remark I have to mention, and I think that only the a huge companies has enough data to be worth to perform a domain adaptation. Because we know that, for example, you have to at least have, after the duplication and filtering out, you have to reach at least around 10 billion tokens. If you don't have 10 billion tokens, it's not worth to perform domain adaptation. And I think if you would like to have the 10 billion tokens in a domain corpora, you have at least 30 billion tokens before the duplication and filtering out stage. And the 30 or 40 billion tokens, I think they only need a few companies, maybe not few, but less than 100 in Europe that has a hundred billion tokens. They closed internal data.

Nathan Labenz: Yeah, I guess that my intuition is that I guess it depends what you count, right? I mean, excuse me, I've I started a company that's, you know, 40 people and I don't know how many tokens we have, but over all the Slack messages, all the Google Docs, all the Jira tickets, all the, you know, contract proposals that we've sent and the revision history on all of those, I do feel like it adds up pretty quick. So I guess maybe one of the barriers is just like exactly how. deep are these companies willing to mine into their own data? Like if they're actually willing to go get e-mail data from their employees.

Marek Kozlowski: I think if you count this data, there are maybe the plenty of billions of conversations, the hundreds or thousands of the agreements or the proposal of agreements. But if you sum up them and they count them and after all duplicate them and filter out, it's very hard to get 10 billion tokens. Hmm.

Nathan Labenz: Yeah, interesting.

Marek Kozlowski: Yeah, you can count it. 10 billion tokens is about 10 billion words. You can imagine that after the duplication, but you have to at least 30 or 40 billion tokens to have finally 10 billion tokens in a domain corpora. I think it's not so easy. It seems to be easy for many, many, but when we start counting them, it's not so easy to get 10 billion tokens.

Nathan Labenz: And I assume that that sort of what you're counting probably doesn't include like individual employees, e-mail histories, all that sort of stuff, that stuff is like kind of out of scope.

Marek Kozlowski: I think the emails, when there are emails between the, for example, the sales forces, for example, the call center or the sales emails, you can use them. But for example, there are also some kind of emails that are not keen to be used because of some undefined problems with the intellectual property or the risky cybersecurity risks and so on. I think it's not so easy to use any kind of e-mail. It always is there is that some emails are too, I would say, too risky or too not too delicious.

Nathan Labenz: Sensitive. Yeah. Yeah. Yeah.

Marek Kozlowski: Sensitive.

Nathan Labenz: Yeah. That's really interesting. Yeah. That also does sort of make me wonder if new organization structures are going to be advantaged in some of these dimensions, because I totally understand the difficulty that would arise if you said, okay, hey, everybody, and we know you've been working here for all these years and sending all these emails, by the way, we're going to take all that and put it into our training process. You might have a revolt.

Marek Kozlowski: It would be different that most of them, even the big organizations, they are not aware about what kind of data they have. And before you adapt the AI or before you train the AI, you have to, first of all, you have to clean your data stores. Identify what data you have, what is there, creepy or not creepy, they're high quality or low quality. And the data curation process or data organization process, everything what is around the topic, how to organize the data to find the high quality subfraction of the data, I think is the problem itself. And that the companies, very often the company try to integrate or deploy AI or even train the AI without this data curation, data organization process, and usually it collapses.

Nathan Labenz: Yeah.

Marek Kozlowski: I think this is the most important, to be aware about what kind of data inventory you have, what kind of data level of qualities, what kind of data you can use without any regulations, internal regulation, external regulations, and this is the most important step. After this step, when you have the properly identified data sets, good described and well organized, you can go up to invest in the AI training and AI deploying based on that.

Nathan Labenz: That's really all very interesting. Thank you. Definitely great food for thought for me. What about just going back one more question on models? I know you had mentioned that some of the Chinese models have licenses that don't allow you to use them in the EU.

Marek Kozlowski: Mostly there were, they missed the Lama 3.4, the first model, but in the license, there is a prohibition to use them in the, they are forbidden to use them in the European Union. But now they are Kimi models, the Chinese model, they have the same, but they are not able to use them in the European Union. I think it's the problem with the AI Act, because the AI Act that was then released, the second chapter was released August 2025. They demand from you to create for the general purpose models, the card of model. What data were used on training, how it was secured, what are the data sets used, and what are the resources used for creating them, and many, many other points in the model card. And I think those kind of model, they don't want to evidate what data was used by them and what kind of even the training stages looks like.

Nathan Labenz: Yeah. If that weren't a problem, would you be open to using Chinese models or are Chinese models not appealing for other reasons?

Marek Kozlowski: I think it depends on the task you have. For example, I will be very, as you say, the sense, I will have a huge aversion to the risk when I would like to use the Chinese model to create the long structures of histories or essays and emails and so on. Because I think that when you create the longer forms of text, the ability that the censorship evidence will be, how to say, easily noticeable, it is going up. But when you for example use some model for the task like the understanding, analytical extractive, for example, to extract information from the documents, or for example, maybe something that we get the information, specific information from the documents, I am open to use the Chinese model, yeah, because The risk of the censorship in the tasks that are typical analytical extractive ones is very low.

Nathan Labenz: Yeah, that makes sense. Turning to compute and talent, the other two big legs of the AI stool, I guess, how AGI pilled would you say the Polish government is? I mean, it's pretty remarkable that all this is going on. at the governmental level already. I would say that that speaks like to a pretty situationally aware and generally agile government. But how, you know, how committed is the government or how big of a deal does the government understand this to be? And, you know, downstream of that are gonna be questions around like, how much funding is there?

Marek Kozlowski: And I think there are different, as you mentioned that during the first minutes of our speech that, for example, There are three pillars of the AI revolution. The data, as mentioned before, the data organization, data curations, and generally looking at the data as the crucial point for the AI training. Next, the compute powers, the GPUs and the AI factories and the others, the DCs and the data centers and so on, and the talents, it means the people. There are the three pillars. They together combined create the fuel for the AI revolution. And as I mentioned, I think in Poland, we are focused currently mostly on the AI factories. It means to buying as much as we can the GPUs and creating the DC centers for GPU tasks. Of course, we have some projects like the Plume. That is a very good example. One of the, I think, maybe there are three or two projects similar to the Plume in the European Union. that the Ministry of Digital Affairs, their funds, they have funds and they funded the consortium of the universities and institutes that are able to develop the models and the competencies and so on. And somehow they support the talents in this way, because you have the money for the people somehow. But I think we don't be able to be competitive against the US market, because in the US market, the AI engineers They are paid like the NFL players, yeah? I heard something like that, that the best AI engineers or AI researchers, they have the contracts like the quarterbacks in NFL, yeah? They are treated as the stars. And I think we don't have such a maturity inside the decisions to pay people, to overpay people for their niche competencies, because I think it's much harder in the European Union, especially in Poland, to say how do we, what is the, final objective function we would like to get. For example, 10 million customers, or no, no, the $10 million each week in our subscriptions. It's much harder for public sector to define such objectives that are very easily monetized and very easily evaluated. And I think in USA, they pay some huge contracts because they are able to evaluate in somehow, even making some margins buffers for the future, how this talent can give you what kind of innovation and how this innovation will pay you back. And I think this is the problem that in USA, everything is there. As I mentioned, when I mentioned when I was in Las Vegas, in ETLS, when I mentioned that we are working on the LLMs in the public ecosystem that the Ministry of Digital Affairs, they funded us, we create the consortium and this is funded by the public. public sector, or most of the people I met in the Las Vegas, they were surprised that the public sector invest in airlines. The USA is something, it's hard to imagine that the public sector is that we have enough, how to say, the intuition, the enough knowledge, the enough money to invest in such a sexy and very revolutionary topic like AI.

Nathan Labenz: So how do you think this will evolve over the next couple of years. Obviously, the amount of resources that the frontier companies are putting into their current and future models just continues to grow, right? That's expected to be actual.

Marek Kozlowski: I think we see, even now we see that. I'm looking, for example, on the GPT models. And when you compare them, of course, the reasoning abilities are going up in different kind of models. But generally, the GPT-5, it was not such a huge there's not a huge improvement over GPT-4. Of course, there are some reasoning capabilities, but generally there's a plateau, yeah? The models are going up, but since some level, they are going, the improvement are very steady, yeah? There's like the horizontal improvement, not the vertical one, yeah? This is going up, but much more plateau, plateau, And their development is not so sexy as it used to be. Remember the ChatGPT 2022, ChatGPT-4, Multimodality 2023, 2024. There are some places in our history of the AI revolution that they were so shocking and they also they are so making our imagination work so that each year we have something huge something changing the rules of the game. Now I think the models are going more, much more steadily. They are not a huge bumps. But generally, I think now we are starting to counting the costs, the cost of the energy and the costs of what the models are used for. But there is the place of there, I don't know how to say, the same is needed in people, that when you meet someone or for example, You have some five minutes that you will be adored or maybe not adored. And the same, after all, you have to evaluate is it worth to meet with these people or not. Maybe it's five minutes, maybe it's hours. But generally, I think now it's coming to such a time where we're trying to evaluate what is the real cost of these tools and how we can use them. And if we use them properly, what they give us. Yeah, like the verification stage. Yeah. And I think the verification stage will give us lots of information that we don't need to have very huge LLMs. We have to invest in the small localized LLMs, especially when you are working on premise solutions.

Nathan Labenz: Yeah, I sort of have mixed feelings about that. I mean, on the one hand, I do think already the models that exist are you know, amazing artifacts for one thing. And very often, especially if you do take the time to do the supervised fine tuning and, you know, really dial in their performance, they can totally work perfectly well for all sorts of use cases. At the same time, of course, we've got the leading companies saying, you know, we're nowhere near done. This is definitely going to keep going. You should expect, you know, more progress. We're going to have AI scientists, we're going to have AI, you know, AI researchers. How much do you think your strategy depends on or will change if it turns out that there's not so much a plateau, but that you do still see like significant capabilities jumps, albeit with like, you know, even exponentially more resources required to achieve them? Like, how do you how do you think you navigate that world if it really is the case that a $10 billion training run is actually that much better than a $1 billion training run?

Marek Kozlowski: I think the most important issue is, first of all, the demands and what's expected by our customers. Because I also mentioned that we are very often biased by the benchmarks, general purpose benchmarks, but they don't match the benchmarks and expectations that the business has. And for example, if you have in your business, the company, and you know that, for example, you would like to use the AI, And in these places, this AI should have at least this kind of metrics. It gives you the very good information, what kind of benchmarks you should create to evaluate what kind of model is able to reach these benchmarks, these expected metrics. And I think this is the most important, that very often we analyze the general purpose benchmarks, which are, for example, the factual, the reasoning. I know the extractive competencies they are evaluating there. But for example, for the business, the problem is slightly different. For example, they need something that write the beautiful e-mail to the customer or something that makes the e-mail that enables you to the cross-selling. That very often we don't know what should be done properly because there is no business, they don't define the requirements very explicitly. I will start from there. What would we like to improve in your business? What kind of tasks you would like to send to the AI? Next, create the benchmarks for these tasks. And next, you choose the airlines. Because I think the most of the business cases I have seen, they don't demand reasoning. You can do it with the normal airlines without the reasoning stages. I think we should try to-- we should slow down a little bit. and analyze what is needed to be done and what is especially has the huge business factor. Not only because it's a sexy and public relation likes that, but what gives the money to the business or what makes some savings.

Nathan Labenz: Yeah, there's a massive disconnect, I think, often between general business culture and the culture.

Marek Kozlowski: I think there are two trains. but they are not in the sequence, the one after one, but they are next to each other. One is much faster, the second is going on their pace, but they are not, as I mentioned, very often there is no crossing between them. There are two roads, but the crossing is very far before us.

Nathan Labenz: Yeah. Yeah, that's really interesting. Who are your allies in this? You mentioned, you know, using multiple languages. And I assume that that's sort of in some sort of partnership with maybe other neighboring country national institutes. I guess I'm curious as to how you think the international dynamics will play out. Historically, in the Cold War, we had, you know, the US and the USSR and, you know, these two great powers were sort of engaged in proxy conflict and whatever all over the world. And a lot of other countries understandably said, you know, this is ******** from our perspective. And there was a movement of countries that were like, we don't really want to be in either of your camps. We would rather be just independent. And, you know, the beef that you guys have between yourselves, like, we don't really want to be a, you know, a pawn in that game. Now it's the US and China, obviously, that are kind of the two, you know, big poles of AI power. How do you think countries that are I sometimes say countries three through 93, three through 193 on the AI power rankings How do you think they will react? Like, do you see alliances forming or, you know, countries working together to share resources, share data sets to try to create some sort of third way in the AI space?

Marek Kozlowski: There are some movements in the European Union. For example, there are projects that are international. They gather the different kind of people from different countries and try to do something together. But I think it's a huge problem that generally when you would like to get a very fast products or very fast outcomes, You have to centralize. The problem is always the same. Generally, the best way is to have the federation. Everything should be spreaded out, have different people in different countries, they collaborate with each other, they build the, how do you say, the wealthiness is going up everywhere, in somehow distributed way, but a normalized way. But generally, when you would like to get very fast incomes, very fast outcomes, and have the products in months, not in years, Usually you have to centralize the assets in one place and there's a problem. Because there are two different ways. If you'd like to do it in an ideal way, you should create the unions, the unions of countries, the unions of states, the unions of the partners, the networking, the consortium with the consortia with the plenty of hundreds of stakeholders. to get this knowledge everywhere and to distribute this knowledge and this power everywhere. But usually when you have to get some outputs very, very fast, you have to centralize. The problems are there are two opposite ways. You're not able to do it the same way, both of them. And I think this is the problem, yeah? Because generally when the, for example, when you have some very huge pressure on the outcomes on new models, We always prefer the centralizations. Yeah, the like with the Silicon Valley. Yeah, you have the huge USA, but most 90% of the startups in the Silicon Valley. Yeah, this the centralization in one place where there are the money assets and every everything. But from the economic point of view, the best place will be distribute this these companies across the whole USA. Yeah, and I think that that is the problem. Yeah, that if you are. If you would like to monetize something and you would like to get the very fast outputs, you have to centralize it. But generally, the best for the economy and social aspects is to distribute in a normalized way across the country and across the continent. What about geopolitical? I think there are still two players. The Chinese and the USA, they have the two biggest economies. They have money. The Chinese companies, I heard that they pay the same money to the researchers as the USA ones. Yeah, the contracts are now currently there are somehow similar. It means they pay very well. They don't have to competitive. There is no, there is that they will be taken over by the USA companies because they are well paid in China. In Europe, I don't think there is still, there is a Mistral, the European funded startup, now it's not the startup, but it was the startup two years ago, three years ago. But now I heard that the shares, 30 or 40% of the shares are there in Microsoft, yeah? There is not so open as it used to be because there are some stakeholders from the USA. I think generally the problem is slightly different, the question is, Whether the Chinese and the USA, they are going to be the rivals still, or there also there's a chance for the cooperation. This is the question. Maybe it's a chance for cooperation still.

Nathan Labenz: Yeah, from your lips to God's ears. Maybe just one little follow up, and I think this has been excellent. I really appreciate all your time and all these thoughtful answers. Is there anything that you have seen that on sort of a technical or socio-technical level can help with the cooperation of decentralized AI. Here I'm thinking about things like the Near Protocol. I recently did an episode with the creator of the Near Protocol, Ilya Pulasuhin, or there's also the Intelligent Internet, which is Imad Mustaq's project. There's others as well. These things sort of have this idea that If we create the right scheme, it might be somewhat cryptographically enabled.

Marek Kozlowski: There is a topic called federation learning, yeah? Federated learning, that you can use the different data sets, somehow anonymized and secured, and use them in a way that the networks, that you are able to identify the sensitive data, but you can use the data to train your models. Yeah, there are different kinds of ideas about this federated learning. But generally, I don't know that there is some huge deployments of such approaches. Even that there is a very good approach to have such a networks of federations to cooperate with each other and to share data in some, as I mentioned, the secured way. But I think still there is, we are not on, if you talk about the business and economy, we are not on the level to use it. Because I think we are still on the problems that feed the companies, mostly they don't know what data they have. What is the quality of their data? What is the value of their data? If you aren't able to measure your own in-house data repositories, how you can go farther and create some data mixture or the network of data repositories?

Nathan Labenz: Yeah.

Marek Kozlowski: I think this is maybe in the future, yeah? In the future, I think there is a chance that there will be the elected... the distributed data repositories with the secure levels and anonymizations used by the huge consortia to use as much data as we can. But I think this is not for the next year.

Nathan Labenz: Yeah. So much depends on whether there really is a plateau or whether the frontier companies are just going to continue to keep scaling successfully.

Marek Kozlowski: And now I heard that the plateau is caused by because there is a lack of the organic data. Yeah, the biggest companies, they collected already almost all organic data that was able to be collected from the internet. Even some of the companies, they scan the books that they're not, that they haven't appeared in the internet to make this organic data gains relevant. But there is still the problem that I think that there is this repository, the resource of the organic data is almost full.

Nathan Labenz: Yeah, that's why we're now seeing all these simulated worlds and the strategies to overcome that are going to be definitely fascinating to watch. I, for one, will bet on them working, but it does, to a certain degree, certainly remain to be seen. Again, really fascinating conversation. It's been awesome to get your perspective. Anything else you want to share or anything we didn't touch on you want to comment on before we break for today?

Marek Kozlowski: I can share that. I can recommend you our archive paper, The Plume Family Models. It's now the archive paper released in November, first days of November. And I recommend readers and the viewers of this podcast to look inside this paper. The Plume Family is the archive, the title of this paper.

Nathan Labenz: Yeah, that's P-L-L-U-M.

Marek Kozlowski: P-L-L-U-M, yeah, P-M, and this, the archive paper, almost 100 pages.

Nathan Labenz: And the P, of course, is for Polish. It's the Polish large language model.

Marek Kozlowski: Polish language models.

Nathan Labenz: Marek Kozlowski, this has been amazing. Thank you for being part of the cognitive revolution.

Marek Kozlowski: Yeah, thank you very much.


Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.