Trust and Validation in AI | Rawkode Academy

Watch / Cloud Native Compass On demand

The embedded player needs JavaScript.

Open the video stream (HLS) Download captions (VTT)

Expand player Shrink player

Overview

About this video

What You'll Learn

Why black-box AI makes proprietary data, training provenance, and enterprise trust inseparable.
How The Pile, Hugging Face models, and copyright questions shape dataset risk.
Why citations, clean source data, and human verification matter for hallucinated outputs.

JJ Asgar joins to discuss IBM watsonx.ai and why enterprises distrust today's black-box AI. The conversation covers training data provenance, The Pile, Hugging Face, hallucinations, copyright, and the human problem of verifying AI outputs.

Chapters

Jump to a chapter

Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

0:00 Introductions

0:00 Welcome to Cloud Native Compass, a podcast to help you navigate the vast landscape of the cloud native ecosystem. We're your hosts. I'm David Flanagan, a technology magpie that can't stop playing with new shiny things. I'm Laura Santa Maria, a forever learner who is constantly breaking production. Alright. This episode of Cloud Native Compass is all about AI. So David decided to input a request into ChatGPT for a sonnet about Kubernetes. David, why don't you start us out? What the chat GPT say? In digital realms where servers and sway, a symphony of nodes and rhythmic dance, there lies a master orchestrate and play Kubernetes,

0:21 AI in Practice: Kubernetes Sonnet

0:43 the captain of advance. Its containers, like actors on a stage, hold microcosms of code's vibrant art. With Docker's grace, they come alive, engage. A ballet where technology takes part. Scaling with grace like stars and boundless skies, replicas bloom whether at its will. Resilient adheres the failure's cry, yet presses on each challenge to fulfill. Oh, Kubernetes, conductor of the cloud, you orchestrate our dreams forever proud. I can't believe I can shred that. So, we have a guest on. So JJ, can you tell us who you are and what you do and how it all works, the whole thing, all of

1:21 Introducing the Guest, JJ Asgar

1:32 it? Yeah. Hi. JJ Asgar. I'm a developer advocate for IBM now, And that means a lot. I wear a lot of hats. That's what it boils down to. I kind of engage in different organizations and different spaces and try to represent developers. And they're they're kind of their persona, I think, is the project management term. I'm trying to learn the vernacular. What can I say? And right now, I'm focusing on a product called Watson X, which is an AI platform that's built off of open source. And I feel like your audience would be interested in

2:14 What is watsonx?

2:14 hearing a little bit about that space. Yeah. Yeah. Kind of an interesting topic to really bring up. Now I know that there's the Linux Foundation's AI and Data Foundation, and then there's obviously the CNCF, which is another suborganization for the Linux Foundation. So is Watson AI part of, like, the AID one? Is it the cloud native one? Because I wanna get, like, dig all the way down into this landscape. Let's let's go for it. So Yeah. Well, first of all, I gotta be because I I am an IBMer, and I have to have I have

2:46 to get the branding Yeah. It's watsonx.ai is the official name of the AI platform for watsonx.ai. We also have a couple other products coming down that do other portions that are required for a real AI platform to work. I don't think I'm allowed to publicly say those words quite yet. I don't actually know. But the one that your community in general would be focusing on most likely is watsonx.ai. Now that one so when you kinda tie that up to the foundations and the organizations that our industry uses, like LF AI and things like that, What we have done as IBM is

3:34 kinda stepped into the LF AI to help build the ecosystem for enterprise grade level AI an AI platform that allows enterprise to be comfortable with using AI. I'm gonna just go into a quick little tirade about why because you're probably like, JJ, why wouldn't any enterprise be happy with this? Right? Like Yeah. Tell tell us a bit about why enterprises maybe don't like AI. That that's an interesting question. Yeah. Yeah. So so there's this whole world right now of people being like, AI is gonna take our jobs. Right? You know what it is. Right? Everyone

3:48 Why Enterprises Distrust Current AI (The Black Box Problem)

4:14 knows this. Well, we can make fun of it, but the the truth of it is as much as people think ChatGPT or all the other major disruptors out there have come into the ecosystem and started giving you sonnets about Kubernetes, which is actually surprisingly good. Yeah. Seriously. Well, ask him for a sonnet about Kubernetes. Later. Okay. Yeah. No. Anyway. And as much as people think that's that's entertaining and interesting, when you start looking at actual how our business is done, there isn't a very good safe environment for this information. So everything the ChatGPT gets, it learns off

4:58 of. It it it grabs. So if you look at the Samsung, for instance, issue with people giving out the the proprietary information, and then all of a sudden, someone else gets that proprietary information about the Samsung thing. Hopefully, we can figure find Yeah. The link to that. That is just a microcosm of the problems when it comes to business. Because right now, ChatGPT is a complete and utter black box. There is no way that the owners of ChatGPT or OpenAI or Microsoft who has invested billions into it will ever give us the data that ChatGPT

5:30 AI for the Enterprise

5:36 has been built off of. And if you start going down that train of thought, all of a sudden, you recognize you're giving all your proprietary info you can't give your proprietary information. Whether it be a PDF of how your HR system works to the schematics of a f 15, you can't give it to you can't use ChatGPT with it because you don't know who's gonna get it on the other side. This is where Watson X comes into play. A really great analogy for Watson X verse the industry right now is ChatGPT is Napster. Watson X is iTunes.

6:00 WatsonX: The Trusted Enterprise AI Solution (Napster vs iTunes)

6:11 If you think of it and you put it in that paradigm, all of a sudden, things start making a lot more sense on the enterprise level. Because now you can go to banks and say, hey. We can give you a foundational model that we can give you all the data that was built off of this. Obviously, with some money, business y stuff happening. Right? We're not gonna just, like, give it to you, of course, build the relationship, etcetera. So but we can give you that give you the training on top of it, and then you can put your

6:37 proprietary stuff on top of that so you can have the chatbot that gives you that turns out that Aetna is your insurance and give you all the information you need about Aetna or or whatever. Right? Which is, believe it or not, something that's insanely powerful. Right? So, anyway, that that that when when you start going down that track, you start seeing more things in that space. Good. Ego, David. I mean, I get I get asked more questions about, like, open source sitting high, but you go next. Yeah. I just wanna understand make sure I understand the the proposition correctly.

7:08 Developer Concerns: Bias and Data Augmentation

7:13 Right? So you've done this this comparison. Right? You're saying Napster versus iTunes. People use, say, ChatGPT and OpenAI to go and ask any question because it's a model that is trained on multiple billion of parameters. Right? And all that information, like you said, is a black box. We don't understand it. It doesn't really allow you to, I don't know, dig into sort of niche subjects with high cardinality. Often. Yeah. Yeah. Exactly. Watson x, I get to decide what that data is and I get to then query it through a similar style interface. Is that the is

7:52 that what Watson x would offer me as a developer, like a model where I say, here's all of my data. Have have my life. Right? And then I can ask it questions and it's gonna give me answers. But that if that's true, which is cool. Right? It does I have this little question where I'm like, okay. Is there a bias where it's only gonna confirm the stuff that I've given it in my data? Like, is it gonna be able to be is it gonna be able to be slightly more objective? Can you feed outside properties to

8:20 augment and enrich? Like, how does that all work? That's that is a very great question. And that's a natural progression. Right? So let's go a little bit deeper than the CIO, CTO level, and let's go down to the senior executive and start thinking about this. Yeah. Yeah. Right. So you like you like that? That was a good was that was good. So so, yes, it's a very valid question. What what is AI? A good friend of mine, Carl, actually mentioned this the other day. What is AI? It's just a yes man. Right? When you actually look at it, what

8:24 Understanding AI Models and Foundational Data

8:57 is AI? It's it takes percentages of the possible questions you're asking it, finds the highest percentage of what you were looking for, and gives you that answer. That is what AI is in a nutshell. It's a yes man. It's a crony. Right? So what we need to do is give accuracy to the crony to get the answers you were looking for very specifically. Now we have something called foundational models, which is I think there's four of them. I don't know exactly what I'm supposed to say publicly right now, so I'm just gonna say there's four foundational models.

9:16 Open Source AI: Hugging Face and The Pile

9:27 And they're all built off of data that we have got we have agreed upon that are safe to use. Now let's pivot quickly and talk about models in the open source ecosystem. Models in the open source ecosystem, something called Hugging Face. If your audience doesn't know about it, the easiest thing to describe Hugging Face as is the GitHub of AI. It allows places to put models up there that you can build off of previous models. They also give some really nice I think the term is shims on top of AI development work. So you don't have to do

10:07 all the stuff around it. You can write some very simple Python to be able to leverage stuff from what from from Hugging Face. Now 95 this is not hyperbole. 95% of those open source models are built off of something called literally the pile. If you Google the pile AI, it is a massive dataset out there that is just, like, I think, 800 gigs last time I looked at it that people Prep. Yeah. That is all just text and information of on the web that people just shove together. So LLMs or linguist language models. Language learning models.

10:50 Larg language models. Yeah. Largest language model. Yeah. Largest Largest language models. None of us actually know. None of us know what this means. Anyway. Exactly. So the majority the 95% of these LLMs out there are actually trained on the pile. And trust me, Google it. It comes up in a a Wikipedia article. It's this whole big dataset problem that they have. Well, you've you've said something right there. Right? Like, we have this open source governed centralized knowledge of Wikipedia. Like, why is that not the base for these models? So so I I I great great

11:20 The Wikipedia AI Challenge

11:27 great question. Before we get get on that that that branch, let's let let me finish what I was trying to talk about Yes, sir. The the pile problem, which is the pile isn't clean. There is a lot of proprietary, pirated issues inside of of inside of the pile data inside of that. There are books that are completely ripped off. So when you start looking into copyright law, right, and all of a sudden finding out people have written and their books are in the pile and the LLM is trained on their book without permission and they create something off of that,

12:08 that brings into some really interesting conversations about how our copyright law works and then how with that too patent law works and then with that licensing. And all of a sudden, it it the the the so what I'm looking for? The Pyramid scheme? Oh, not only pyramid scheme, but the the house of cards that is our industry, all of a sudden starts starts falling apart. Because, frankly speaking, if we if we allow, which right now, at this exact moment in time when we're recording, a federal judge in at least the US government has said anything

12:21 Copyright, Patents, and Industry Fragility

12:49 from a AI that is created cannot be copyrighted. Right? That is that is this exact moment in time. I think that was literally yesterday. Yeah. It was. Yeah. But, again, that's only here in The US. Right? And being a US centric company at IBM, we obviously are paying very close attention to this. This all brings up a whole conversation of, like, how do we keep going down this path without without possibly destroying our industry? Right? People joke about how we're barreling forward to to the end times leveraging leveraging AI and all that jazz. But when you actually look at how we protect

13:31 and can use the entities that is governments to enforce, yes, I have not stolen your idea, and it is my right to sell this to people for goods and services. If all of a sudden AI can create all that stuff and they create something very close to that and they can start selling that as their proprietary thing, all of a sudden, businesses don't don't function. Right? As soon as we get a real AI that creates a the next Taylor Swift song, I mean, if if we call it Taylor Swift bot, I don't I don't know. What right? Like,

14:07 you get what I'm saying. You you see all of a sudden the the onion gets scarier and scarier and scarier. So we we have to build this wall to make sure that the world we we we know right now, which is going to change because we're gonna try to create avenues to be successful. But if we're not, it's scary, people. It's scary. Yeah. So why not Wikipedia? Ah, yes. Great question. Yeah. I'm I'm I'm gonna bring it back to a little bit of levity because otherwise, we're just gonna go screaming down the black hole. But We've already

14:34 Why Public Data (Wikipedia) Isn't Ideal for Training

14:44 passed the VP level. We've already passed the director. We're already, like, into the thought leadership, like Alright. Engineering people. Right? Thought leaders. Yeah. There you go. I do wanna say, well, at least from my understanding of how Wikipedia works and when if you parallelize it parallelize it, sure, with the pile, the amount of errors inside of Wikipedia and conflicting information on top of the ability to actually get the information out of Wikipedia and train it usefully, there really isn't actually as much to get useful information. It goes back to the my simple example of a bunch of

15:28 PDFs to look for suggested areas to find stuff. In my mind, that is the easiest path for most people to grasp when it comes to understanding the power of AI. Right now, we've passed the whole if then statements of of AI. We've passed the whole ecosystem of give me avocado chairs or whatever from Dolly. Right now, we need to look at how we can make our lives a little bit better. There are there are LLMs or not even LLMs. I think they're considered classification. But I point to me is that you can shove a bunch of PDFs

15:33 Practical Enterprise Use Case: The AI Librarian

16:13 into a model and then say, hey. I'm looking for information about our growth over the last two quarters. And maybe that's only in a chart, right, inside of the PDFs of of your business logic. There are elements that can read all those charts and figure out, hey. Turns out growth over the last quarter is is in this graph and it's over the last four years. This is the fourth spot right here. Okay. It looks like it was 50%. It can respond back. It looks like your growth was over about 50% referenced in this diagram in

16:47 this PDF. Right? So it becomes like a really, really great librarian if you think of it that way. And then if you can tell that story to an enterprise, they have p every enterprise, right, has massive amounts of PDFs of all their policies, procedures, and everything like that. If you can create a librarian, think of whatever dystopian sci fi that has a librarian in it, which is always always a thing. Right? If you can give that to an enterprise and say, I can give you a way to do that with your trusted data that I know will not leave our borders

17:20 because that is core to our business. Maybe you should look into that. I mean, but that also assumes that the librarian is not making up data. I don't know if you heard about the the lawyer that asked for somebody to go through all of the legal history for some something and it turned out they made up, like, LLM or the the the AI made up a fake court filing to prove something. Yeah. And judge caught it, and it was a big deal. That's in The US again. But still, like, I'm a little worried about that librarian

17:26 The Human Problem of Verification and Hallucinations

17:57 not quite being correct. But does this come down to a misunderstanding of where people think AI is today? Like, I think, you know, when I speak to my wife, I speak to my family, I speak to friends, and they're talking about chat GPT. Right? I mean, they're all trying it. They're all playing with it because it's getting so much coverage. They don't understand the generative part of it. They think it's giving them knowledge. They they don't know that it's just all made up. Now we do because we are in this industry. We follow the news. We

18:27 read these stories. But to most people, it's a fact machine that's artificially intelligent and is gonna give you the correct answer. So, I mean, was the lawyer liable that they know it was fake, that they did not verify it? Sure. But what did they expect from the AI? Like And that's one reason why we have to get AI to cite the stuff. Right? That's not too far away to be able to, like, where did you get this information from? You can the the the natural progression is citation and, you know, trust but verify. But, again, that also requires on the model

18:59 that you've created and which model you've chosen, which black box you've decided to do, where the data has been trained like, where the data is actually trained off of. Again, this this all goes back to the other problem that we don't talk about as an industry. And as soon as you start playing more in this space, recognize that it's not the LLM or the model. I I we I know we're intertwining change interchanging LLM and model. Wanna I wanna acknowledge, first of all, that that's wrong. Right? But most people have exposure to LLMs. Where so that's why it's becoming, like, the

19:05 AI Needs Trust

19:32 Kleenex conversation. Again, I want to acknowledge that that's wrong. There is very specific terms here, but just to get in there for the conversation, I wanna make sure that that's that's clear. I I know what I'm talking about, but I I'm even making this mistake. The core value, the core problem of this whole narrative is that we're talking about the compute. We're talking about the compiler. We're not talking about the source code, which is the data. Right? And the data is what actually gives the compiler, which gives you the answers of the entire of the the AI,

20:06 the information. We need to figure out a way that we can have trusted data that exists in the space that we know that won't create the court filing or has the ability to create a court filing. Right? And and we have that ability to have that conversation and trust that the AI does it. But, again, it goes back to the pile, right, which is a bunch of untrusted data where, again, I'm gonna be an IBM shrill for a second and say, we have engineers who are dedicated to cleaning the pile. We actually have humans that are going through

20:08 IBM's Data Cleaning and Trust Strategy

20:39 the pile, making sure that it is trusted data to build build our foundational models off of. So we pull from the open source ecosystem. So we actually do have like, that's the other part of IBM's whole model. And we why we have such a good relationship, you look into it, with Hugging specifically. We've built partnerships directly with Hugging Face to pull from the open source ecosystem because we, as IBMers, know that the open source ecosystem is gonna drive the winds of the sail, the ship, or whatever you wanna whatever enterprise y statement you wanna make there of the

21:12 industry. And we acknowledge that. So we are we are spending a lot of time and effort in building and building bridges to Hugging Face and taking the open source stuff from Hugging Face and finding a way to make it safe for enterprises to work. We have whole teams of people inside of our research org at IBM that that's all they do is they clean data. I don't envy them at all because that's gotta be boring, but we do have these people who do it. And I have I I was at a training relatively recently about this, and it was it was they

21:45 IBM's B2B API Approach ("Red Solo Cups of AI")

21:49 they were proud because they were able to say that this sliver of this knowledge inside of this open source dataset, we can now talk to our downstream clients to say, no. We can provide this level of confidence with this level of information, and you put your level on top of it and you build some prompt prompt engineering on top of that. And then all of a sudden, you're getting the answers you're looking for. Is that sanitized data that is now verified pushed back to Hugging Face for other people to use? That's that's well, because we're investing

22:26 so much money at this exact moment in time, no. We are we that is gonna be one of our selling points is that we can give you the core data. We can give you the dataset that we have built our model. And, of course, the model that we're gonna be offering to you through the Watson X platform will be so large and so processor intensive. We need to run it on our cloud. I mean, what does IBM have? It's literally business machines. Right? So we have a lot of compute power to do this stuff. So we can run that model that

22:56 we but we can give you the actual dataset. And we can say with legal penalties that this is the dataset that builds this model so you can actually push this out to and put your level of information on top of it to get the information you're looking for. K. Can we I wanna make sure so I like, let's let's take all the things we've covered. Right? What's next? We've got clean, trusted data. People people can come along and say, okay. I'm gonna use this for my organization because I have this level of trust. Right? Now

23:11 Specific Use Case: Kubernetes Monitoring

23:26 I'd like to understand, like, a real real use case that any developer listening to this episode would be like, yeah. That sounds really, really sweet. Right? And I'm wondering, like, could we take Kubernetes events? Could we take metric servers? Could we push this all into Watson x? Can we derive insights and predictability into our workloads on our Kubernetes cluster as a as an example. Right? Is is that something you're seeing people do? Well, the answer is yes with a lot of asterisks behind. Or I'm sorry. I'm a c yeah. I'm sorry. I'm a senior engineer. It depends.

24:09 But the so so as as much as the as much as the conversation we had around Watson X is, the best part about of this whole narrative is that I use this term in a lot of conversations I have, but it is valid in in this one. We want to be the red solo cups of AI. If you don't know what red solo cup is, watch any US college movie. Right? The beer pong cups. Right? Beer pong cups. Yes. That's that's correct. Yes. Exactly. I'm not in The US, but that's all I know them as. That's it. Yeah.

24:43 But Solo makes an insane amount of money on those things because they're everywhere. Right? They're just the way you get you you do that stuff. IBM wants to be that of AI. We do not want you to interface directly with Watson x. We don't wanna be to be sorry. We do not want to be b to c. We want to be b to b. We want to give you an API with a trusted environment to be able to work off of that. Now the reason why I'm saying this is because the developers are listening to your

25:13 your your podcast right now. What I am offering to you is a simple, non crazy, right, REST API with Watson X that you can trust that you can just literally use requests from Python, right, to do a simple post to the back end with a certain API key and a couple of requests of, like, which model you wanna push it through. It's a really simple JSON or yeah. JSON. And then it comes back with what you you want. So you do all the heavy lifting inside of Watson X. But for your application, you can literally just leverage it a little

25:50 bit higher. Right? You just add one little request, and it comes back with a nice little blob of information. So the practicality of it is, again, it depends. Right? But the idea is we we're trying to build the railroad for you here so you can get the information you're looking for and be able to pull intelligently back. Did that answer your question? So It did. Yes. Thank you. I was gonna say, so, basically, what you're trying to do is you're trying to say, okay. See all these booths at KubeCon. Imagine a number of these companies actually running on

26:22 top of what's next underneath, but you just don't know it yet. You as a consumer wandering around KubeCon. Yeah. Okay. Okay. We're we're building phone lines, building the railroad, whatever you wanna call it about. Okay. Whatever major infrastructure change you're thinking of, in essence, we're trying to do that for AI. And You're trying to be Bell Labs is basically what you're telling me. You're trying to be the old school Bell Labs that's building out the original stuff that eventually everybody builds on top of. Yep. And then on top of that too, like, haven't even mentioned quantum inside of this

26:51 The Challenge of Proving Data Origin and Trust

26:55 space. Right? There's a there's a spur that goes into that ecosystem buzzwords. Don't just throw buzzwords, JJ. I I have a bunch of PhDs. All they do every PhD physicist all day every day. All they do is look at this weird ass computer that's gonna take all our lives or something. Yeah. Exactly. That makes sense. I I guess my my question always comes down to there's a people problem under this to me. Right? Under AI in general. No matter where where it ends up, if it gets integrated into the cloud native ecosystem or if it

27:20 People Problem in AI

27:28 stays in the, like, AI ecosystem, I guess, for lack of a better term. But, you know, we can get the models to we can get the AI to start saying, here's the citations. We can get people to try to verify everything, but we have that trust but verify. But how many people sit there and actually look at the citation? Right? Like, that always was a thing in science, remember, was people don't always check the citations. You have to learn to be very good about checking those. It doesn't come to you naturally most of the time.

28:01 So how do we, like if I think about, let's say, that some company decides, okay, I'm gonna build a monitoring tool that goes through all of my live metrics and analyzes everything using AI and comes out with an idea of, okay, this system probably is gonna fail, has like a twenty percent chance of failing in the next twenty four hours. Right? And let's say that's an eventual thing because that's an evaluation a person might do. But it gives you all the citations. Who's gonna go back and look at their logs to verify it right now? Right? If I'm

28:35 told, oh, it works. My reaction is, okay, it works. Right? Yeah. So how do you how do you deal with that people problem when it comes to this AI question of, well, we do have to verify it somewhere. Otherwise, we get pages that wake us up at 3AM in the morning and for nothing because nothing actually went wrong because the AI got it wrong because they analyzed some other incident from somebody else that just happened to have a correlation here or something. Right? Like, how do we get there? I don't know if that makes any sense. But How do

29:09 we fix the chicken little problem? Can I try and broaden that question? Because I think they're both thinking some something very similar. Right? And I don't know if I trying to get there. Yeah. David, you might be able to say this better than I can, so go for it. I I I don't think so. But I had that question, and then you asked something really similar. And I'm like, oh, if we just expect, like so let's try it. Right? We we've got all we've got a world of AIs. Every developer is out there talking to a different AI. I'm just gonna call

29:20 Ethical Dilemmas

29:35 them singularities and AIs. Right? I I know this deeper whatever. Now we're all asking these questions. We're all getting answers, and we mentioned licensing. We mentioned copyright. Derivative works are obviously a huge challenge when it comes to people when we're using these answers to then put something out into the world. So there's this ethical dilemma as well, which I think ties into what Laura was asking, is that if 10 people go and ask AI how to write a good song or a catchy song or a number one hit and then they all go start to use this,

30:06 that's 10 different AIs that then somehow need to be answered the question of backtrack. Do we need another AI to answer the question of was this written by an AI? Do these AI companies need to work together to provide a transparency log to inputs and outputs to prove that something was an artifact from their algorithm. How do what's the future look like? The the this is clearly a tough question or a tough predicament. So what how does IBM tackle that? How does what's next tackle that? Are there conversations with other companies? This was actually a a wonderful

30:38 Ethical Dilemmas and AI Transparency

30:43 working session at Fosse. We had this whole and I I can I can link to the Etherpad that captured all the notes for this? And it was basically the open source if if you don't know what Fosse is, it's what OSCON used to be. But Fosse has broken off from it's the same group of people, but not under the O'Reilly banner. It's it's completely it's under the software conservancy. They're the ones who go after the GPL people. Right? They they they have the GPL lawyers, and they they are they're an interesting breed. Let's just put it that way. Lot of

31:20 stallmen. Lot of stallmen. Anyway, the the the interesting conversation happened was how do you how can you verify and and say that this data is is not stolen? That was what it boiled down to, which is, I think, kind of where you're going there. K. And it always went back to the it always this this end point this end entity, this tarball. Sure. It was called a tarball. This make sure that this tarball, which most likely is a binary, isn't stolen. It always went back to the data. It always kept going back to if I can

32:00 prove to you that this tarball came out of this data block through the model that I've created and it is this data is verifiable, that is the only way that you can backtrack through. Problem is models aren't models don't have, like, a SHA. Right? They don't have some way to say that this was the model that I used to do off because the technology just doesn't exist. Right? And as much as we want to add all that stuff onto this stuff to be able to say that, yes, this is this exists, the ecosystem is already so large

32:35 and has moved far so so so rapidly ahead of us where you can get an older version of ChatGPT on Hugging Face. You didn't know. If they're they've open sourced the those models. The models, not the data. The models on the Hugging Face so you can play with ChatGPT on your local laptop. They're shit, but they do exist. The challenge is we have no answer for that in the space. And the way that IBM was answering it is through doing the pile cleaning, being able to show you the models, and and with legal penalties, say and with literally pen to paper, say,

33:12 yes. This is we can show you exactly what's going on. But ChatGPT, Microsoft, I can tell you they will never do that. Right? As as much as as much as we want to say that when you go into the the disruptors I'm just using that as a term to cover the non the the the I say this with love. The normies, when they think about ChatGPT, right, they those are the disruptors, the ones that they're gonna be using it to write the paper for their history class or whatever. Right? The the models and the data that they

33:51 that exists in that space, the companies behind it, like OpenAI, will never give us those datasets. They will never do that. And then with Microsoft investing their time and effort with, like, a Copilot. Right? That's the one I I I kind of just skipped over, but I fall in the same space as ChatGPT. As much as they claim that they didn't take code from GitHub, as much as they claim they they you can have only Copilot only look in your org. We're computer professionals. We know that's never true. Right? And as soon as you get that

34:27 data swerved into the model, you have no way to pull that data out. People forget that. Like, models are added There's no Like, there's no there's no reverting. Once you once you've trained something, there's no way to revert it and, like, pull it out. Not easily. Not no. You could destroy the model and recreate the model with removing the data. But then again, if that model's already out there, it just again, it's it's a compiler. You've gotta think of it as a compiler, and all of sudden, a lot more things start making sense. So Microsoft will never give us a legal

35:00 affidavit saying that, no. I did not take any private repos from GitHub. And, no, I used I looked through through all of GitHub looking for all the license files to make sure the attribution is correct on all of the open source projects I did. That alone is a Herculean effort. Like, what did you get trained off of this this code? Did somebody just put up on GitHub and there's no default license? Legally speaking, Software Conservancy couldn't represent them in court going after Microsoft because of that. Right? And it again, it goes back to what we were

35:35 starting at the very beginning in this conversation with is we have no visibility into this space because the technology moves so quickly without checks and balances that we are now at a point where okay. Bad analogy, and it just kinda hit me right now. Stick with me for a second. You know I'm the queen of bad analogies, so go for it. In essence, we've created a bunch of of of of printing presses. Right? All of a sudden, we created printing presses all over the world that you can create that you don't have to organize anymore.

35:56 The "Printing Press" Analogy and Global Impact

36:12 They can just start printing out information for you. And now what's stopping you from selling those books? Right? Because there's no validation that as long as you got that initial printing press with the the the plates in it to give you the books to to to to shove out, that is what AI has done. It's given the ability to send out that information very quickly. At least when we had the Internet when it first hit and people were worried about pirated books and stuff like that, the ecosystem created blobs of ways of security and pathways to

36:47 getting these things. Right? And and and validate the iTunes of the world, if you will. But here, that horse is already out of the barn and already to the next town. Right? We've got a lot of catching up to do, and the only way and problem also is that this is global. Right? Like, this is not just our friends in China, they could create LMs that do all the that create patents for them. And US law means nothing, right, over there. They have their own set of laws and their own ways of doing technology, and they have a lot of computing power

37:27 over there. Right? So so it just like, I'm not trying to be doom and gloom. I'm just trying to express this to our audience that we're talking to here saying that as much as you think, hey. The joke of, hey. ChatGPT, give me an application, and it gives you an application in five minutes. It does blah. You're you're gonna be spending twenty four hours debugging what the hell that application does because you can't trust it. And there's this whole ecosystem of around that that people don't recognize that it actually spurs out to a lot of other stuff. Anyway,

37:59 Reflections and Hope for the Future

37:59 sorry. I I got on on a roll there. It's okay. Alright. So the TLDR is we're all doomed. It's all fucked. Go hug your loved ones. Hug your thing? Turn off just turn off the computer. Yeah. Hugging face. Hugging hugging the face of the face and Turn it off, and the the next podcast will be coming to you generated by AI. Our our faces will be moving, but we will not be the ones set. And I'm just gonna yeah. I guess I guess we're at the end of this. I mean, like, there's a ton more to dig into.

38:30 And who knows? I mean, if if there's more you wanna hear on this topic, by the way, there is a Discord that you can join. It's called Rawkode Academy, and there's a channel specifically for Cloud Data Compass. And I'm calling it out right now because I haven't opened in another tab. But if you wanna ask more questions, maybe we can do another episode on AI someday in the future. You can see. But yeah. Yeah. Let let let me try a positive side. Right? Like Or positive. Either one. Alright. I don't know if there's a bit

39:04 of lack there. No. I was just thinking, like, you know, this is like, for me, right, I'm not in the ML space. I'm not in the AI space. This was all new to me as chat GPT and OpenAI came out and Google Bard and that sort of stuff. Right? To me, those were the only options, but I think we're in a really fortunate position. But there are other options. There's a lot of movement. There's a lot of, you know, not volatility, but it's just there's new ideas coming all the time and there's a lot more open

39:29 source happening too, and there's a lot more trusted execution environments like we're hearing about with Watson X. I think there's a lot of positive six twenty come from AI even though it's easy to point out the scary bits, the negative bits. Right? But I'm I'm still extremely hopeful based on everything that I've seen in the past and I've heard today from JJ that it's gonna have a net positive impact on my life and hopefully other people's lives. Like, I'm I'm excited for the people working on this. Okay. I'm excited. And at the same time, like,

40:00 I have I have more history in, like, ML and things like that from Python, but also from science. Like, just thinking through that. Like, to me, I'll be honest, all of this is stats all the way down and stats that I have a hate relationship for the rest of my life. Yeah. I just act her in. I'm just, like, do my homework. Stat it. Stand lies and statistics. I mean, come on. Like, that's exactly how it works. I think Rawkode is a little fiesta in that sense. Anyway but who knows? Maybe we'll get really lucky,

40:20 Final Thoughts

40:31 and it works out. But, JJ, is there any last, like, thoughts, comments, whatever? Because we're already longer than we normally are, but this is a really interesting conversation. And maybe you have any last things, last plugs, last whatever. Where can we find you online? All that fun stuff. Yeah. Well, first of all, I'm pretty easy to find online. JJ Asgar most places. If you are interested in Watson x, I do need to plug the URL. Ibm.biz/dev-watsonx. The dash is the actual dash, not dash, but the actual dash. Dev - Watson x ibm Biz. I do want to acknowledge that

40:41 Guest Final Thoughts & Contact Information

41:16 it is hard. You think Kubernetes is hard. You think our cloud native ecosystem is hard, and it is. When you start playing in the what the the the AI space, be prepared to be confused. Question your ethics and morality, and never ask what a developer advocate is because I did once. And it told me I was, in essence, a CIA agent killing people. Was a little weird. What? Okay. Yeah. Yeah. A reason I had not created a chat GPT account Yeah. Or any AI. No. This was this was just an open source LLM. It was even better. It was just like

41:59 some generic open source LLM that basically describe describe developer advocates going sent out by Obama to to kill Russian entities or something like that. It was really weird. It was really weird. Even more exciting lives than I realized. On that note, I thanks for coming out, JJ. I hope it was fun. Having me. I hope you all enjoyed. David, last thoughts. I know you're lagging all of a sudden. Meh. Meh. I guess that's the answer. AI. Meh. No. It's probably good. I'll put on that note. Thanks to y'all for listening. Thanks for joining us. If you wanna keep

42:15 Hosts' Wrap-up and Outro

42:43 up with us, consider subscribing to the podcast on your favorite podcasting app or even go to cloudnativecompass.fm. And if you want us to talk with someone specific or cover a specific topic, reach out to us on any social media platform. Until next time when exploring the cloud native landscape on 3. On 3. 1, 2, 3. Don't forget your Don't forget your compass.

Meet the Cast

David Flanagan

@rawkode

Laura Santamaria

@nimbinatus

JJ Asghar

@jjasghar

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Documentation

IBM watsonx.ai developer page

Additional Resources

The Pile AI dataset

FOSSY working session Etherpad notes on AI data provenance

More from Cloud Native Compass

View all 23 episodes

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Flatcar Linux: A Modern OS for the Always-On Infrastructure

Flatcar Linux: A Modern OS for the Always-On Infrastructure

Platform Engineering: Asking "Why"? with Evelyn Osman

Platform Engineering: Asking "Why"? with Evelyn Osman

AI-Augmented Programming

AI-Augmented Programming

Observability for Developers: What You Need to Know?

Observability for Developers: What You Need to Know?

The Future of Sustainability in Open Source

The Future of Sustainability in Open Source