Introduction to Kubeflow | Rawkode Academy

Watch / Rawkode Live Live

The embedded player needs JavaScript.

Open the video stream (HLS) Download captions (VTT)

Expand player Shrink player

Overview

About this video

What You'll Learn

Sets up Kubeflow as a Kubernetes platform for running machine learning workflows.
Explains how notebooks, pipelines, and CRDs fit together for end-to-end model development.
Shows a chest X-ray pneumonia classifier trained with Kale, Katib, and KFServing.

Michael Tanenbaum walks through Kubeflow 1.3, covering the architecture, Notebooks, and Pipelines, then trains an image classification model on the Kaggle chest X-ray pneumonia dataset and automates the workflow with Kale, Katib, and KFServing.

Chapters

Jump to a chapter

Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

0:55 Introduction and Housekeeping

0:55 Hello, and welcome to today's episode of Rawkode live. I am your host, Rawkode. Now before we get started, I just wanna cover a little bit of housekeeping. First, please subscribe to the YouTube channel. This helps other people find this content. My goal is to produce valuable learning materials for us all on the cloud native landscape, and I hope that you can subscribe and share that knowledge with us. Also, I wanna encourage you to join our Discord channel where, you know, if you have questions or you want to experiment or even suggest new technologies that we should cover, the

1:26 Discord channel is the best place to do that. We've got a few hundred people in there now talking about cloud native Kubernetes and everything in between. Again, come and join us there. And I also want to thank my employer, Equinix Metal, who provide the time, resources, and everything else that allows me to do this as part of my job. So thank you Equinix Metal. Feel free check out that platform by using the code Rawkode live. This will get you $50 in credit. That is roughly around one hundred hours of compute on a modest instance. So take it, enjoy it, play

1:54 with it. Alright. Now, today we're gonna be taking a look at Kubeflow, a open source program and technology that allows you to run machine learning workloads on Kubernetes. I am by no means an expert in machine learning. However, I am joined by Michael Tannenbaum, Kubeflow maintainer, who's gonna guide us through everything that we need to know to get started today. Hey, Michael. How are you? You there? Hello. Yes. I am indeed here. I'm I am both getting the live and the slightly delayed audio, so it's a little confusing for me. I have done that so many times. So

2:36 many times. Yeah. So let me think. Perhaps I I think I think if I if I resolve to use a single perhaps not. What's what's your suggestion, David? Do you have another tab open with the YouTube page? Oh my goodness. What is okay. Don't worry. Like I said, I've done it before. Well well said. We we resolved it. Thank you. Alright. Welcome to or thank you for welcoming me. Hello. Greetings. I'm here. Awesome. That's the that's the first tech up done. Right? That that's done, we can move on with nothing but glorious successful demos here on end. And that's the way.

3:14 Guest Introduction: Michael Tannenbaum

3:27 That's that's how we that's that's the only only way I like to do business, David. Thanks. I presume you asked me to introduce myself or or Yeah. Please feel free to tell us a little bit about you, and then we'll we'll start talking about Kubeflow. Sounds good. Okay. My name is Michael Tanenbaum. I am a principal solutions engineer in my day job at a company called Aricto. We are one of the biggest contributors to open source Kubeflow, and and, my community participation is around the, special interest group for on premises, deployments, where we are working to help make it

4:06 easier for folks to deploy in their own data centers and, providing reference architectures and, particularly bare metal because, ML workflows are very workloads are are performance, sensitive. And, bare metal, no hypervisor, get it on out of there, is, is sort of a a mantra that we're we're trying to encourage. And certainly for there's lots of good reasons to do that security cost. We can talk more about that if you're interested. But I live in New York City, so I'm speaking you to you today from Red Hook, Brooklyn. And what else? And previously, I've I've been in

4:45 the Kubernetes space for a long time. Before that, I was in the Mesos space for any cloud native dinosaurs in the in the audience. I worked at Mesosphere for a couple years as well. And I started my career as a as a junior developer working on analytics tools at the World Bank, and since then have done a million different things all around the software world. So I love software and have ever since I was a little kid, and that's that's sort of what got me into it. I have no formal education. It should be noted in anything related to

5:20 computer science. I studied Chinese in college. But but, yeah, self taught. It's one of the things I love about our our industry, right, is people don't really care what's on your what's on your what your diploma is, you know, what's hanging on the wall. It's just can you do the job that, we need done today? And, even if you can't, can you Google effectively? So I couldn't agree more. I have no formal education in computer science or computer engineering either. I am a complete that managed to faint his way into his first role and has

5:50 just kept learning ever since. If it wasn't for Google, I probably wouldn't have this It definitely keeps me Well Keeps me from getting fired, I'm sure. Alright. Cool. Why don't we thank you for sharing all that. It's really nice to get a little bit of personal information as well. It's just a general overview of you. So that's awesome. Do you wanna then tell us a little bit of of Kubeflow? We can go straight into Slice if you wanna do that, or you can maybe give us a Sure. Yeah. Yeah. That that sounds good. So speaking today, not from, you know, the

6:14 What is Kubeflow? (Introduction & Problem)

6:25 position of my not from my day job, but speaking about it, from the perspective of being a community men member. So everything you'll see today is open source and, available for for folks to to play around with. So, yeah, I do have some slides. I think that might help, illustrate a little bit about Kubeflow for folks who are new to the project. So there we go. Perfect. Alright. So the the Kubeflow project is approximately, you know, in a formal way, about two years old, three years old at this point, 2018. We are just coming up to

7:03 the one dot three release. And by just coming up to, I really mean just coming up to. I think, the RC is being cut today. I should know one of my colleagues here at is the release manager. But, I don't wanna put him on the spot, right, as, you know, I said on the air that we were gonna release RC. But yeah. Anyways, so, Kubeflow one zero one, just as a a broad intro, you know, speaking from the perspective. I know your audience is, is well versed in the world of cloud native, so, speaking to an educated audience, about these sorts

7:36 of tools here. So why why q why was Kubeflow founded? Well, the reality is that today, we don't really have great DevOps, SRE sorts of practices around machine learning workloads. They tend to be very snowflakey. They tend to be pets. They tend to have tremendous operations bottlenecks. After all, machine learning is a relatively new field. I'm in in my practice and in my day job, I see lots of folks who are coming to the machine learning world from a more operations background, and then we see people who are coming from a data science, data engineering background

8:24 now being tasked with, you know, really bringing this into production, bringing this into a format where we can leverage, the benefits of of ML, as opposed to, at a at scale, right, at the at enterprise scale or or organizational scale, you know, for for nonprofits and public sector folks who would who would like to use it. So the Kubeflow project was started to facilitate that, and it is, as we'll see in a minute, a collection of many different services, many different pieces of software. But the standardization around, Kubernetes, which I've, you know, have described I

9:06 would describe as an operating system for cloud native, really. I mean, it's it's what we deploy distributed software onto, and you can be fairly, certain at this point that, anyone you need to move, any any of your clients, any of your collaborators will also have, Kubernetes. Anyone else in the in the ecosystem will have at least some Kubernetes in their, available to them. It makes sense to, put it on top of, Kubernetes. So the project was started at Google. Obviously, Kubernetes will will would be would be the preference for for them as well. And, and the Kubernetes scheduler,

9:42 is is great at at scheduling, diverse workloads. So why not take advantage of it? Right? So we've got the we've got the the environment portability sorted. We've got the scheduling sorted, and we've got the scaling sorted. Right? So leveraging containers and and, that that or, you know, that paradigm for for delivering a scalable software. So that's the mission. Alright? How do we help people get their their hard work developing machine learning models to a place where we can, provide a scalable and repeatable, pathway to, production? And maybe it makes sense to take just one second to talk about what we mean

10:30 by production for our machine learning model. David, would you agree? Yeah. Go for it. Okay. So what we mean by putting an ML model into production, is, going to sound very simplistic probably to a lot of your, listeners, in the sense that it's just an endpoint. It's just a service that we can send it some payload of information and get back another piece of information. That's that's all we need to do. The issue is how do we automate as much of that process as possible while ensuring security, isolation, and so on. Does that make sense? I

11:09 do. Yeah. Yeah. That's so so. An ML model at the end of the day is just a is just a formula. It says, you give me, you know, the the pieces of information that I need, and I will give you back a guess. Could be a guess about is the, are we coming up to a stoplight? If you're a self driving car, could be, does this person have a a medical condition, which is the example we'll be looking at. So why Kubeflow? What was sort of the the genesis of of the reason you know, the

11:38 genesis behind it? There are a lot of tools that that can help in cloud native workflows. And the issue is that there's not really a a good way to stitch them all together. So you can think of Kubeflow as an open ecosystem where you can put in the tools you want, take out the tools you don't, but you're guaranteed a degree of interoperability between all of the all of the tools, and the base set of tools will always work together and cover really most use cases. Any questions on that? Make sense? Yeah. I think I'm I'm still

12:17 with you. You have not lost me yet. Okay. Good. And the this is from the the Kubeflow1.0 user survey from last year. The the newest one, the 2021 version, will be coming out very shortly. Some of the issues around getting a model to production are, of course, the bottlenecks associated with, you know, finding, you know, the right operations people who can take the the data scientist work output and then deploy it to production. But there's lots of other stuff too. You know? Machine learning models are being used to enhance decision processes, some of which may have legal or

12:55 compliance implications. Making a loan decision, for example, would be one of those. Making a hiring decision would be would be another. And so the more automation that we can get around documenting what the model is doing, why why it was it it produced a a certain thing and bringing that cost down as well as time is is of great value because there's lots more models that folks want to put in into production, and 90% of them never get there. I saw a statistic fairly recently that said that for the average American bank, it's over half a million dollars just in compliance cost

13:34 to get a model into production. So Yeah. Lots of, yeah, lots of lots of headache and costs there. And, you know, the general practice is that it takes, you know, a good six to nine months to get a model production. And for example, if you're if you're let's pick us an example. If you're a, oh, I don't know, a streaming video player and content producer, A I certainly remember when Tiger King was all the rage. It is no longer the rage. So if I'm developing a recommendation engine using machine learning that helps viewers identify other shows that they might like

14:17 to watch, other programming they might like to watch. It doesn't help me very much if my model that was trained six months ago gets, you know, deployed after everything is after that data is stale. Right? So there's a notion of currency in machine learning models that doesn't exist beside you know, outside of revving libraries and other other stuff, for for software development, more broadly. The data needs to be needs to be current. There's a a declining, there's a declining return to the value of data, as, as time goes on. So what's in the box? You're welcome for not including the meme.

14:54 Kubeflow Architecture & Components Overview

15:00 As I mentioned before, Kubeflow has a ton of different stuff in it. The the data scientist or machine learning engineer or data engineer will be familiar with certain frameworks. Excuse me. TensorFlow, PyTorch, MXNet. Those will be those will be familiar. They, of course, also have their fans and detractors and their lineage. PyTorch was developed by, came out of Facebook. If I'm not mistaken, TensorFlow is from Google. MXNet is is developed and promoted by, AWS as an example. But they're all they're all open source. And so this you can think of the top of this diagram here as as the interaction

15:43 with the the end user will interact with with these libraries via some containerized environment. It could be a Jupyter Notebook as an example. It could be a Versus Code. It could be RStudio. Lots of different ways to to interact with the software with the software stack proper. And the beauty of containers is that not only they they come up very quickly, but also that, you know, we can standardize on a set of common libraries. And then as we need to make adjustments, of course, layer on top of them. Under the covers, what we're what we're also

16:22 delivering as part of Kubeflow is a set of operators that relate. I'll focus down here to start. A set of operators that relate to the specific machine learning libraries. So what let's take TF job operator as an example. So one of the the great advantages of distributed computing is that, things that would take forever on a single machine take longer when we're able to effectively chunk up that work and distribute it among many machines. Right? The we can't make the computers any bigger nor would we necessarily, want to. And there, to, Waleed's point, thanks for the the comment.

17:05 There are actually quite a few more, that are are very close to being ready for prime time. So stay tuned. If you're interested in that, definitely definitely check that out. But how how do these what are these operators? Well, to quote my my dear friend Jared Dillon, an operator is just a piece of software, deployed with its own runbook for for managing that software. In other words, a custom scheduler, you could think of it that way in some ways for for these operators that you have here. And so what we're able to do is take our standard our standard code that we

17:41 would use TensorFlow or something like that and very effectively using the operator along with the Kubernetes API, Kubernetes scheduler, distribute that across all of our resources. So if if you're if you're doing machine learning on on on huge amounts of data, you can take something that would take a a week to train on a single machine. You can bring it down to half an hour. Right? Just depends on the resources that you have available to you. So that's that's really where we see some of the the efficiencies, for folks and and one of the most

18:17 exciting elements, of the, of the software stack. We have a a, Jupyter it's it's a little bit of a misnomer. I think they're actually the notebooks working group is contemplating changing the the name of it, but essentially a a notebook or coding environment, web app, and controller. We'll see that in a second. It's all self-service, and supports, in you know, securely injecting secrets into the environment so you don't have plain text secrets floating around. Resource dependencies, you know, does this does this coding environment need to be need to have access to a GPU or some kind of special

18:56 resource, some extended resource in Kubernetes, etcetera. We also have quite a few attendant core services that are necessary for productionizing machine workload machine learning workloads. Pipelines, for example, folks may be familiar with tools such as Airflow or others that that provide directed acyclical graphs, DAGs as they're known, functionality so that we can have retry logic, branching, all of that kind of stuff. Kubeflow Pipelines is built on top of Argo, you can see here, and interacts natively. So so the the the beauty of of the system working together is that all of these bits and pieces know

19:42 how to talk to each other. There's a metadata store. So what did we do? There's, a Kubeflow serving, which is built on top of Knative that provides the ability to publish a model as an endpoint. Right? So you you know, if I'm a data scientist, I I don't really know that much about, you know, operations. I don't know really anything about Kubernetes, perhaps. And I still wanna be able to start doing model deployments in a in a safe way. And so Kubeflow serving is built on top of Knative, as I mentioned, and leverages Istio's ability to segment traffic as well as do

20:21 things like canary deployments, blue green, all of that kind of stuff. So if there's any language here that I'm using that that, you know, you think folks might not be familiar with, I'm happy to happy to dive into it. And then lastly, and this is super exciting also, automated hyperparameter tuning. So we're gonna get into this a little bit later on, but but and I I don't like the the analogy, but this it is a bit this is sort of where the I I the art of machine learning and data scientist science comes in as opposed to the science.

20:52 I don't think those things really I don't think art and science necessarily sit on opposite ends of a of a spectrum. I don't think they necessarily oppose one another, but this is where there is a certain amount of of finesse that you need to be able to read as a human that the the actual training operation isn't gonna provide you. So that's where that's where an experienced data scientist or machine learning engineer will will have an advantage, and and Kotib helps automate the workflow for for picking those choices. If we look just to for folks who

21:24 The Machine Learning Workflow

21:26 may not be familiar with what an ML workflow actually looks like generally, we we can segment these. These are all these diagrams are all available on kubeflow.org, by the way. So that's the the source for them. We can segment it into into two phases. The first would be an experimental phase, which is identifying what you're trying to do, identifying the problem, and then collecting and analyzing the data. Excluded from this is the the bulk of the of of the time. You know, I've I've had to explain to family members and others that programming, you spend, you know, 95%

22:02 of your time stuck. Most of the time, we're not trying to reinvent the wheel. We don't have to develop some fun new algorithm. Something we think should be working is not working, and we're spending time debugging that. In so let's call that a one to to 19 ratio, right, of time productively plowing ahead with our project and time trying to find an answer to what's not working, at least at least if you're a a programmer at at my skill level. In data science, it's it's worse. It's probably, like, 99 to one or or one to 99

22:37 ratio because cleaning the data, amalgamating the data, coalescing the data, getting access to the data, eats up a a huge a huge huge amount of time. So, you know, anything that we can do and and that's a that's a hard problem. There's not, you know, there's not a ton outside of, you know, bringing in more humans to help with that that that we can do. After all, if you had the ability to, make decisions about your data, then your job would be done. That's that's kind of the the point. Right? So so we really wanna speed this up.

23:12 Then, you know, you need to code your model, pick, pick, how you want the the structure of the model to work. You need to see, you know, what kind of results you get. That's a sort of experimentation. You need to tune that, the hyperparameters, which we'll talk about in a second. Then you need to do this whole thing again. When it comes to production, it's a similar set of steps, but but a little bit different. So we've we've coded, excuse me, a mechanism for transforming the data into there's a sort of industry standard term that,

23:43 for folks, watching. If you're ever, you know, if you ever wanna be the person who, like, sounds, like, really educated, just, you know, refer to what's known as a model ready data frame. You need to get the data into a a, state, into a format, into a form that the model can understand. You tell the model, this is what you should expect in terms of data coming in. So there's there's that. We'd like to automate that. That's that's where the notion of having a pipeline really makes a big difference. Then we have to train the model. We have

24:14 to we'll we'll show that in a second. And then we have to actually serve the model. We have to deploy it in some in some way that, we can see what the we can have other microservices or other services in our environment speak to the model. Ask the model a question. Say, hey. I've got, I've got, Dave over here. He's, he's his last five shows that he watched were x y z a b c. I guess that would be six shows. And and tell me what I what I should recommend to him next. And then over time, as as more and

24:49 more people, you know, choose not to watch Tiger King or choose not to watch don't don't pick up on those recommendations, the model is performing worse. Right? Like, that that ship has sailed. So there is a a process by which we must update, monitor the quality of the the model's output, and then update it on a on a regular cadence. And indeed, there's lots of exciting work going on and some organizations that are indeed using continuous learning, what's known as continuous learning in production. So just to to, you know, hammer the point home, these are

25:28 the the these are the the Kubeflow components that map to the different the different stages. You'll notice the the data cleansing problem, and data collection problem is, is left untouched. Oftentimes, in the wild for folks who might be interested, you'll see the the, what we used to call the smack stack, but now I guess it'd be the stack stack with Kubernetes. Is that amazing? But Kafka, Cassandra, Spark, you know, these types of of tools, streaming or batch to be able to to ingest large amounts of data, and then we can pass it into a into a process for for transforming it.

26:13 At the production level, you'll notice that pipeline span all of this. Right? We would have in our pipeline a a could even have a transformed data step. We even have a a train model step. We would have a serving step and then a monitoring step as well and hear how the frameworks line up there. So I'll take a pause, actually. I'm I'm sure you have questions, Dave, or someone from the audience. I love how you started with, you know, it's just an endpoint that answers a question for you. And then you kinda look under the covers and you're like, oh, shit. Like,

26:32 Discussion: ML Challenges, History, and Ethics

26:48 there's like so many different different parts here, so many different pieces of the stack and the architecture and wow. There's an awful lot going on here. And I'm very excited to see just how Kubeflow is stitching all this together and making it better for people to consume. Because like, is I mean, is this a standard machine learning pipeline? Like, you know, 12 or more different pieces of software? It's just it's complex. Yeah. Yeah. Yeah. I used to joke, you know, that that Kubeflow is more complicated than Kubernetes in in many ways. Right? To to have a functional Kubernetes cluster, you

27:29 know, let's see. What would you need? You need the the API. Right? So you need some some control plane, you know, resources scheduler. You'd need, you know, Kube proxy, core DNS, Kubelet. What am I missing, Dave? That's pretty much the building blocks of a distributed applications which just be need networking. I need discovery for DNS. I need compute and a scheduler. And then, like, let the workloads go and and do their thing and, obviously, an API to coordinate that. But I think what I didn't appreciate until you kinda walked me through these slides is and I love

28:07 the way you said you just start with an endpoint. Right? But that's just once the model is trained, but there's all of that legwork to get to that point. And, you know, when you had that graph and you're saying, well, you know, some organizations, it's taken them six to nine months, and it cost them half a million dollars. And, like, there's just so many different components here. And I I'm assuming that's probably consistent across most machine learning workloads is that it's the training and evolution of the model that is the best that needs coordinated. And the faster that you do that, the

28:35 the better velocity you've got in shipping new versions of that model is how you make your money. Is that fair to say? Yeah. Yeah. I mean, that's that's thank you for raising that. I mean, that that's the that's the the between the lines message. Right? It still pays significantly. You know, it's still an ROI deeply ROI positive endeavor to to actually spend that half million dollars. That's just in compliance cost. Right? So so that's that's not even including the the labor time and the infrastructure and all the stuff that went into it. And, you know, you can see the impact

29:14 of it. I mean, AI is everywhere. I'm I mean, I'm using a a camera application on my iPhone to be serve as my webcam. You know, your your your telephone or with the Apple architecture, it's got multiple special purpose cores on it. Right? Some are are neural cores. Some are image processing cores. You know, there's a, you know, a high speed and low speed CPU cores. There's there's this this is this is everywhere. You know? If I move my head, the, the the bubble around me, you see shifts. Right? Like James James Bond. Oh, this is

29:47 cool. James Bond. So so how's it doing that? It's it's recognizing my face. In real time, that's crazy and terrifying. You know? The COVID vaccine. Another perfect example. The COVID vaccine is, uses mRNA. MRNA is super unstable. That's why these vaccines go spoil so quickly. Right? Need to be kept in liquid nitrogen and all this stuff. It's because the the the the molecule degrades, and being able to predict where that molecule degrade transform the world. Right? I'm I'm very much looking forward to getting back into society when all of this blows over. Right? Like, so so it's everywhere. It's everywhere.

30:31 You know? Okay. Could we And the reason that yeah. Sorry. Go ahead. I was just curious. Like, you know, obviously, like, so we're we're we're looking at Kubeflow today. This is a way to kind of I I don't know if it's to automate or orchestrate or to simplify all of these moving parts on Kubernetes. But what was it like before Kubernetes? Was it just the Wild West? Oh, well, before Kubernetes, you know, it's so interesting I find in in technology. You know, do should we should we kill the slides for a second? Just just chat

30:58 this one out? I mean, I I don't know. Yeah. Yeah. Yeah. Okay. Where'd your audio go? No. I'm not hearing anything. I think it when the screen share dropped, it kind of cut you out. How about now? Oh, yeah. It's back now. That was weird. Okay. Yeah. That's that's fun. Yeah. I I could see the little microphone bar. Yeah. What was it like? What was it like? So, you know, one of the things that blows my mind is, just how, you know, the left hand doesn't talk to the right hand. Right? I mean, there wasn't a

31:51 lot of distributed computing going on. But at the, but at the same time, but at the same time, there you know, everything was being done. There wasn't a lot of distributed computing going on, but it was possible. Right? Like, there there were there were options for this. But the people who are working on these these training and data science questions weren't talking to the people who are doing a lot of the distributed computing stuff. Right? And they're certainly the people who are doing the distributed machine learning at that time, if you could even call it machine learning. I mean,

32:27 you gotta remember a lot of these these discoveries around machine learning are are extremely recent, I mean, really in the last ten years. You know, neural networks were for fifty years. You know, they they were conceptualized in the sixties. For fifty years, everyone thought it was a ridiculous and impossible thing until somebody figured out, okay. Let's do back propagation, and, well, it'll work. And it and, you know, there's a there was a great article. Oh my goodness. I can I I'll share a link with you talking about, the the folks who, the the academic team, who who

32:58 discovered that you could do deep neural networks? And, and with, you know, with no product, they they sold their company for, like, in an auction for, like, $50,000,000 to Google. It's just like an acquihire. You know? Anyways, so it is you know, that's the one size. It's very new. On the other hand, you know, the folks don't don't talk to each other. And then we also concurrently had had hardware breakthroughs. Right? I mean, the this is some I I'm not a a gamer, personally, but, you know, the the gaming world, we owe them a debt of gratitude in

33:36 a way because by purchasing high end GPUs, they essentially subsidized the development of of GPUs such that they became useful for machine learning. Right? Which is something that I don't think most people really think about. So you had really, what you had is people a a confluence of events. One being we now have Internet scale data. Right? We now have problems we wanna solve that previously we didn't have the the mathematical algorithms to be able to to process in a way that would result in anything at all valuable. Right? So, I mean, it doesn't matter if

34:20 if I'm not you know, if I wanna have a self driving car, if I don't have a machine learning framework or or the algorithms to be able to to produce something that will learn how to drive, I'm not gonna even attempt it. Right? And then, and then we had GPU. Right? And that that really gave us a a lot of a a lot of horsepower. You know? And and the scale of these things for things that are meaningful, you know, I I GPT three, which is a an NLP model, you know, got a lot of attention, when

34:52 it came out fairly recently. That to train that model required, I'm I'm not so good with, with electromagnetism. It was, like, three gigawatt hours of electric does that is that higher? That seems like a lot. Right? That is a lot. It was it was the it was three hours of a full nuclear power plant's output. Okay? So that's that's the meaningful metric by analogy for me. It's it's a lot. Right? So these are computationally extraordinarily expensive operations, but the yield is there. There's there's a reason to do it. Right? So that's that's really how I see

35:28 it. We we we didn't really have a lot of distributed computing. People could get away with using really big boxes with a couple GPUs attached to them. The and the the the objectives were less ambitious than they are today, if that makes sense. Yeah. That does make a lot of sense. It it helps solidify everything that you're seeing and makes me understand that a little bit better just that evolution. You know, I could even think back as probably around ten years ago as well and everything, at least the organization's time is that was all about ETL pipelines and doing everything in

36:01 big batch machines at night when it was maybe cheaper or using spawns and season, stuff like that. But now, you know, with things like Kubeflow and TensorFlow and all this other stuff, we're we're more real time. We're streaming machine learning or training models and I think that's probably driven a lot of, I don't know, change in the tools that we use as well. Like, it's not just here. Send us send us zip archive to someone at 10PM, and they'll process it at 3AM over four weeks, and then you'll get it back. Yeah. Yeah. Exact exactly. Exactly. You

36:32 know, I remember, you know, back in my Mesos days, you know, one just to have a a silly example. China Unicom was a a huge and very public user of Apache Mesos just as a to to go in the way back machine here. And the the the reason that they they made the migration to to to use Mesos was because when someone wanted to figure out what their cell phone bill was, they had to wait, like, forty eight hours for for for a combination of batch processes and just compute time. Right? And, you know, that who would accept that now? Like, that's,

37:05 you know, that's crazy. We would we would anytime I log on to my bank and they have they have a a maintenance window, I I consider getting a new bank. I consider getting it, like, inexcusable. I do that with every company. If I see a maintenance page, oh, we're done for scheduled maintenance for the next six hours. I'm like, what is this? Like, this is that just gives me so much fear about the way that they handle their data, work with their data, and work with their systems. There's very few cases where I think that

37:30 acceptable these days. I I refer to that as the onion ring and the french fry problem. I like onion rings. I like french fries. If I order french fries and I see an onion ring in my basket of french fries, now I'm worried about what's going on. You know what I'm saying? I like that. Alright. So Alright. Back to it. Yeah. Do you want me to pop out your slides back up? Please do. Yeah. Alright. So what's new in Kubeflow1.three? I know we wanted to give a quick update. I'll just go through this. It's a this is a big feature release

37:53 What's New in Kubeflow 1.3

38:03 that we're that we're coming up to. So we've got new and updated UIs. So we have a serving UI now that's got integrated logging and monitoring. We've got a new UI for hyperparameter optimization experiments. We've got TensorBoard coming. This is the TensorBoard provides you with monitoring not from endpoint health perspective, but from quality of model perspective. If that makes sense. We talked about before, know, don't keep recommending Tiger King. We have a way to monitor manage volumes. A lot of these datasets are big. How do we avoid copying them a million times, and how do we get visibility into that?

38:41 We have integration with an open source project that we call Kail. This is part of the the stack developed by one of my colleagues here at Aricto in my day job and better integration with pipeline. So all of this is is being you know, getting to the point where we've got so many people in production. You know, we really wanna make this as easy and as clean as possible. We are also including, it's been out in the in the world for a little bit, But the ability to run Versus Code in RStudio as your IDE,

39:17 like I said, you can if if it if it fit you know, if it's containerizable, which what isn't, you know, then you're generally good to go. But but this will come out of the box with a new example notebook images for TensorFlow two dot o and PyTorch. And then lastly, we've made some enhancements to KF serving, such as making it easier to do canary deployments. Super important, right, with a with a model in particular because, we're not you know, it's it's different than the software development life cycle where, okay. Sure. We can, you know, put something in stage

39:52 and use production, you know, anonymize production data that we pull back into staging. It's not you know, the the the issue that we'll see with with machine learning models, you'll see in a second, is that we don't really know what exactly is gonna happen till we get out there. Right? So you want to, you wanna really take your time and roll these out progressively over time. And, you know, just some additional library support there. We've upgraded Istio. Istio one dot nine will be shipping. We've completely refactored the way that the manifest working group, led by,

40:32 Ioannis Arcadis, also one of my colleagues at Aricto, has completely changed the way that we deploy it. So it's pure customized now. You can you can deploy Kubeflow as one dot three with with, just basic KubeCTL. We've got multiuser pipelines, so now you can isolate, pipelines, appropriately for production deployments. And, you know, Pod Affinity, pretty standard stuff, but just making the the, sort of catching Kubeflow up to the latest and greatest really in the in the Kubeflow community. I mean, Pod Affinity is not latest and greatest, but we have to think about how we expose these knobs and dials from

41:12 a you know, to a a an end user who is unfamiliar very likely unfamiliar with Kubernetes, objects and and deployment models. So that's that's the that's the state of affairs with new stuff in one dot three. And, I promise we'll be getting to a a live coding example in, in a MO. But I wanted to describe a little bit about what the, coding just provide an analogy for what the the coding, example will, be demonstrating. Because if it you know, code that makes no sense, you know, seems of of limited value. So if you this is this is

41:41 Understanding ML Concepts: The Fork Analogy

41:57 your your first time being exposed to deep learning. I apologize that I'm the one to introduce you to it, but I don't have a PhD in anything, let alone mathematics. So if I can learn it, you can too. And I'll be borrowing an analogy from Andrew Trask's grokking deep learning, which is published by Manning. Highly, highly, highly recommend it as a as a way to conceptualize and understand what we're actually doing here. So we are going to, use an analogy of a of a fork analyzer. And what I mean by that is we are going to attempt to, create

42:00 Introduction to Machine Learning by Example

42:36 a, clay mold. You can think of it as a a clay mold, a block of clay, and we're gonna try and make a mold. And the mold is not going to be for the purposes of pouring liquid metal and producing additional forks. Instead, it's the the goal of our exercise here is to produce a mold that has that that anything that is a fork should fit into the mold, and anything that isn't a fork should not fit into the mold. Right? So so the the the mold will tell us is the thing I'm putting, you know, into it a

43:05 fork, And the way it will determine that is whether the fork sits in the mold comfortably or whether it, you know, gets bumped by the sides of the mold or something like that such that it doesn't sit comfortably. Does that make sense? Yes. Okay. So, I apologize for my, heinously terrible, graphic design skills, but, you know, it is what it is. So here's here's our our clay mold. It's a block of clay. Right? And I will use you as as my my audience foil here. So so, David, tell me. Do you think this is a

43:46 fork? Are you smarter than a clay mold? That is a fork. Okay. So I would agree with that. And so what we will do is we will take this this this rendering of a fork, and we will dip it into press it into the mold. Alright? So what's the outcome of that? Well, now we've got a fork shaped mold. Right? But it's this fork shape. Right? So let's see how how well our mold performs. What's next? David, time to shine. Fork or nonfork? Fork. Alright. Agreed. I think this is a fork. Alright. So let's dip it into the into

44:26 the mold and see what happens. Well, you can see on the screen that actually, if we put this fork into the mold, the mold would tell us that it's not a fork. Right? It wouldn't sit comfortably in the mold that we've made. Right? Correct. The mold would say, this this ain't no fork. We would say it is. So the mold goes, ugh. Alright. Fine. This is a fork too? Alright. I will change my molds. Right? I will the mold will change shape, right, as it's clay to accommodate this. You told me this is a fork. Fine. Alright. It's

44:59 a fork. I believe you. Whatever. I'll change myself to to accommodate this weird, you know, I don't know, a Victorian style tea fork or whatever whatever it is. Alright. So next up, fork or nonfork? I'm gonna stick with fork. Agreed. This is a fork as far as I'm concerned too. But let's actually just take a second here and observe some differences. The forks that we've seen so far have four tines. This one has five tines. The forks that we've seen before all merge down at the base of the tines. This one has a perplexing gap.

45:35 So if we were to put it in the mold, we would say, wait a second. This this isn't a fork. The mold would reject it. So the mold has to update itself. Right? As we dip it in, now it's going to make space in the clay to account for forks like this. Next up, fork or non fork? Definite fork. Definite fork agreed. What do you think is gonna happen when we try to overlay it? We have to rotate it. Right? If we put it in horizontally, it's not gonna work. Right? So now what we're saying to the

46:03 mold is a fork is a fork regardless of which way it's presented to you. Right? You you need to to account for different rotations of a potential fork. Right? So we'll layer that in. And the mold at this point learns, ah, so orientation doesn't matter. Right? Like, as long as I I can sort of standardize the way that I'm observing these, as long as they have these characteristics and you're starting to see with the overlay what exactly a fork looks like. Right? There's a little bit of gap between the tines. There are tines. There's a a part of

46:34 the fork where it all merges together, and there's a handle. Right? Let's look at another example. Fork or nonfork? I mean, I guess technically it's a fork, but I wanna say not a fork. It's so. I I for the purposes of this highly contrived example, it is a pitchfork exactly. And and but you're you're right. It in I'm gonna call pitchfork a fork. It has fork in the name. I'm not a linguist. You know? Don't don't hate me. But let's see what the model would say. Well, the mold would say, nah. This doesn't fit in comfortably.

47:11 Right? This is this is not a fork. Okay. So now the model has to take into account a fork that has an odd ratio of length of handle to length of time. Right? Right? And that that would be different. Think about a rake. Right? That that also looks like but it's not. Right? So think about that. So now the model says, David, you're killing me here. Right? Does the handle even matter at all? And the the the the the model in this case is saying, I guess not. Right? The the thing that seems to be important are these tines and

47:46 the fact that they come together and there's a little space between them and yada yada. It seems like the handle is irrelevant. Right? So you know what? I'm gonna ignore handles. I'm just gonna chop a section of myself out so that any handle that comes, you know, through I'm just gonna disregard that. Apologies. My dog is harassing me. I'm gonna completely I'm gonna completely ignore handles. Alright? Now is this a fork? What do you think? No. No. But similar to a fork. Right? Yeah. So if if we were to compare this, you know, the model would say,

48:24 okay. Yeah. That's a fork. From what I can tell, in order for something to be a fork, it has to have tines and it has to come together at the bottom. Right? But that's actually not true. Right? It must have a handle to be considered a fork. Right? And some some there's some ratio of handle to time length that we were not exactly sure what it is, but we know for sure you're telling me this isn't a fork, then I know I need to be looking at the handle. Right? So the mold would say, it looks like a

48:49 fork to me. The comb admits it, you know, comes comes out of the closet, admits that it's a comb. And, you know, the the model then, you know, says, okay. I need to consider the handle when evaluating, but apparently, I don't really care how long the handle is. Right? So you this is the process. Credit all credit due thank you, Solomon. All credit due to Andrew Tresk who wrote wrote this book. Is a sport a sport? I think we have to answer all the difficult questions today, don't we? Like Well, it's it's so funny. So so my girlfriend

49:24 is a is a lawyer. She's gonna hate me for referencing this. My girlfriend is a lawyer, but she studied linguistics in college. And, and the the there was a supreme court case here in The United States where, where you're not allowed to destroy financial records. Right? Like, if if, you know, the government wants to investigate, you can't just throw it all in a dumpster fire and say, I don't have any records. Right? And there was someone who was caught fishing and and fishing fish that he wasn't allowed to fish, and he threw the the the

49:54 not to put what I don't know. Not illegal. I guess illegal. The the fish he wasn't supposed to catch, threw overboard before the, you know, police boat caught up with him. And, and so the government argued that a fish was a record, that you have thrown a fish out. And in this case, the fish is the record. Right? But he would you know, his lawyers argued, a record's a record. A fish is a fish. So these these sort of ontological questions, these definition questions are super important. And I think your your your comment highlights just how important it is to label things

50:26 correctly, right, and to understand cultural perspective. Like, as far as I'm concerned, this is not a fork, but I don't I I don't speak for every person on planet Earth. I have no idea. Like, maybe this is a fork. You know? Maybe this is a maybe this is a fork to somebody else out there, or maybe there's only one word, right, that that we that exists in a in a language, human or extraterrestrial to describe this this object. So that the model only knows what you show it. Right? And and and its its universe is restricted

50:56 to the data that you train it on, the the images of forks being the data that we've trained it on in this case. Does that make sense? Yes. Am I am I doing okay? Am I winning the fart game? Yeah. So far, you've done a great job. Yeah. Yeah. Yeah. Yeah. Okay. Yeah. I would I would encourage you to take up, data labeling if you ever have extra time. But you can see this is this is the the challenge. Right? How do we label data? How do we ensure that the data is labeled correctly? How do I if

51:24 I have a model that's continuously learning, how do I not how do I know that someone isn't feeding me a bunch of combs to attack my model? Now it's now my model's gonna start thinking that combs are forks. That's not good. So Is that something analogy. Are you okay for a question just now? Yeah. What's up? Please. Yeah. As many as you'd like. So how one of, you know, I am putting us online as a quiz and encouraging people to answer questions to help me train my model and there are bad actors that are, you know, throwing in bad data. Is

51:54 that easy to identify and work with? Is does that ruin the whole model? Yeah. It does, and it is a challenge. So I'm trying to remember. I believe it was the I have to remember the name of the podcast. It it it was, I don't think it was it wasn't the Go Time podcast, but it's the same, network of podcasts. Oh, thanks, Salman. Really appreciate the feedback. It's the same network of of podcasts. Anyways, they had very interesting discussion with some of the folks from Mozilla who are, you know, Mozilla, really, you're doing you're doing, really important work, and

52:43 thank you. I'm sure, you know, it's it's a challenge. But one of the issues is, you know, there's really a dearth of material for, like, natural language processing in languages with that don't have a huge community of speakers. Right? So how they have a a great system whereby, essentially, the labeling people can contribute, right, spoken versions of of text that's read. And and then that goes through a peer review process, if that makes sense. Right? So so people will double check to confirm that you're not attempting to submit bad data. The other resource that I would encourage folks

53:33 to take a look at is David Aronchik, who's one of the cofounders of Kubeflow, currently runs open source machine learning over at Microsoft. He gave a KubeCon talk this past KubeCon, the most recent one, in which he spoke about different attack vectors and for machine learning models and ways to ways to mitigate them. But, yeah, if you're if you're taking input from the community, you know, we I think we've seen that quite a bit with with sort of the issues of, news bubbles on social media networks, right, where the the the stream of of information, stream of of of news

54:17 that gets the most reaction, you know, continues to feed on, and it leads to extremes and perspectives and exclusion of counter perspectives. So, yes, this is this is a a a the ethics around AI are incredibly important, and and, everyone needs to be extreme needs to pay extreme attention to them and and be aware of who you're excluding. Think about this. If you're and I think 999 for you all, 911 for us, you know, an emergency response, if if you've only trained on people with native accents, right, 20 Yeah. 20 of America's foreign born. Right?

55:00 So guarantee that these I mean, heck, you for we pronounce things differently. We speak the same language. So, you know, the this these questions can't be afterthoughts. They must be be preliminary thoughts, prerequisites to to using this. So Nice. Yeah. I guess it it helps to be Google. Right? Mean, you can just launch something called recapture claim it for security, but really you're training all your internal models. Exactly. Exactly. Please click the one with the motorbike on it. So we do have a a couple of questions. I don't know if you wanna them now, whether you wanna take them as we

55:35 go. I'll let you decide if they're appropriate for the moment or if you're gonna cover it later, just say. So while it is asking, do you have any advice for people that wanna run it on restricted networks on prem? Yes. Come to the Kubeflow on prem speciality group meetings. Those are on Thursdays. We also have, we also have a a mailing list, a dedicated mailing list, that we use for our SIG. So, George is, is helping us out too, so, maybe he can share a link to that. Our meetings alternate between European friendly times and

56:22 West Coast friendly times and Asia Pacific friendly times. They're on Thursdays. We'd love for you to come join our join Kubeflow Slack. We have a channel there too, and we'll be more than happy to help you. As as one of Jeff Fogarty, who's one of the the coleads of the special interest group and runs production Kubeflow clusters on premises for a big American bank says, if it's not behind the proxy, it doesn't count. So you will find lots of lots of kindred spirits there. Let's see. I have other questions in the chat here. Happy to take them.

57:02 Solomon, a great question regarding Istio. So the question is why why was Istio chosen as a service mesh for Kubeflow? I think that this is where the politics of, open source come into play. Istio, obviously, a Google and IBM collaboration. And so I I think that that was that was the way the way to go. For a long time also, I think Istio you know, Linkerd at the time didn't have a sidecar proxy model, and, so it didn't get anywhere near the kind of performance, that that Istio did. It, you know, had some scaling issues. Now they moved to a

57:38 a sidecar model as far as I understand. And now everyone and their brother-in-law has a service mesh. So to the question of, you know, are there plans to allow for different service mesh? Yes. And simply by asking that question, you are now mandated by law to come to our working group meetings because we are working on, with, one of our technical co, coleads, Marlo Weston, who works at, Intel. We are spearheading an attempt to get service mesh interface implemented to allow anyone service mesh a day in the sun and to, you know, decouple in some way a dependency on on Istio.

58:19 Super early days. Do not get your hopes up. But, you know, that I think there's a lot of, I think there's a lot of really exciting work going on, in that regard too. So yep. Awesome. Perfect. Should we jump back to our forks? We're done with forks. Oh, done with the forks. Got it. That was that was the that was that was that was the the whole kit and caboodle. I I was just getting a roll there. I feel like you've cut me off too soon. I was just getting the swing of it. Listen, I can't decide if it would be

58:51 if it would be a great job or a terrible job to, you know, it's certainly not very stressful. But, know, do we do I wanna be a full time fork analyzer? I know. But if you sit and do like 30,000 of those in a day and you click one wrong and then you can't stop thinking about it and you wake up in the middle of the night and you're like, what if I have destroyed this model by clicking fork and it was like, you know, a spark. Like, oh shit. There were no rules on that. I I

59:00 Working with Kubeflow Notebooks

59:13 not for me. Too stressful. Too stressful. Okay. Yeah. No. I I agree. I agree. You know, better better to to avoid any, you know, additional stress burden in your life. So just to let you know, I have up here an example of Kubeflow. Can you see that? Okay. Yeah. Can you zoom in a bit on that? It's a little difficult to read this. Yeah. Let me see. Let me zoom it. How about that? Let's go two more. Two more? Okay. Fingers crossed everything works as intended. Yeah. I think that's alright. If anyone has a problem reading that, just drop in the

59:32 Live Demo: Kubeflow UI & Notebooks

59:57 comments. We'll go we'll go more. But it looks okay for me. Okay. Cool. And just to let you know, I have clicked away from seeing the screen because I am 100% cheating on this live covert coding exercise. Because if I if I goof it, it's not gonna be fun for anyone to watch. Mean, try to debug it. So so, yeah, should we get started? Yeah. Let's go for it. Any questions come up, I'll keep you updated. Yes. Thank you. That's that's what I was hoping for. Alright. So this is the Kubeflow UI. Every user has their own namespace,

1:00:33 and that translates to or a workspace that translates to a a Kubernetes namespace. Let's open up notebooks. Right? So I need an environment in which to, start my development. So, what can I what can I do? I can give my environment a name. We'll give this, we'll call this X-ray. And, I can pick an image that I wanna use. I can also use any kind of custom image. Most people, you know, will have an image that they're for a project. Right? So your team is all sort of aligned on one thing. I can specify my

1:01:12 CPU and and memory guarantees. So this would be a request. There's no you know, you don't really necessarily for the audience, that's gonna be using this. The end user data science scientists, you don't necessarily wanna put a a a limit on it. That's generally sort of an administrative task. Let's see here. Oh, this is not give me my configurations. Oh. Uh-oh. Alright. Time to go to work. No. No. No. We're good. We're good. Did I tell you it was cooking show style, or did I tell you it was cooking show style? So here's a cluster that's actually working. Woo hoo.

1:01:53 Alright. So I have an example that stood up, but we're we're gonna do we're doing a live. And I gotta zoom this one in, of course, because I didn't I did the same on the other one. And just tell me when to stop. How we want it? Yeah. I think that should be alright. Let's take it from there. Okay. Cool. Thank you for your flexibility. So let's call this X-ray two. We're gonna do the same thing we did before. We're gonna give it a a nice big, drive. I think I have, like, 200 gig on here. So

1:02:25 another 50 shouldn't hurt anybody. I can add extra volumes. I wanna be able to interact with Kubeflow pipelines. I could also, at this point, specify whether I wanna attach this to a GPU. I believe I have GPU in this cluster, but I don't wanna attach it to a GPU because, it's just it's it's a it's something I advise folks against because most of the time, the GPU is gonna be idle. And the beauty of Kubernetes is we have shared resources. So make a pipeline, have one of the pipeline steps, use the GPU, and then let

1:02:55 other people use it. Right? Does that make sense? Makes sense. Definitely. Alright. So this is gonna take a second to come up. In the background, I am going to SSH into So what's this spinning up right now? Is this the virtual machine inside of Kubernetes? A virtual machine inside of Kubernetes. Shame on you. You know it's not. No. This is no. I know I know you're you're just trying to feed me the the line here, but, no. No. In fact, no, David. It's not. This is a this is a container. This is a pod that's being deployed to my

1:03:35 namespace. And, there's a a a Jupiter, like I said, it's gonna be renamed a Jupiter web application. It's Jupiter controller that that manages that. And if you'll notice here, there are configurations that I can apply, and this uses an object called pod defaults. I is that I I don't know how common the use of pod defaults are outside of machine learning and Kubeflow in particular. But suffice it to say, the way that this works is what I'm doing when I actually click this, one of these configurations, and that could be, you know, for example, you

1:04:15 know, give me give us give me in this notebook, want an environment variable that has access to that has the the up to date secrets for our private container registry or a private, you know, git repository or whatever we're doing within our environment. Right? Alright. So any anything like that. Or it could be, you know, a a a you know, maybe I'm about to do some stuff that requires a a specific type of, hard disk at a minimum. So, you know, you can access it that way. And what the what the pod default controller actually does is,

1:04:48 what we do with the Jupyter web app controller does is it puts a label that, that the the, that will be applied to the, the deployment, the stateful set that's actually being generated behind the scenes, abstracted for the end user, GUI driven. But, it will apply a label to to that, and then the pod defaults controller knows to observe that that label and then to inject, you know, a a config map or a secret or or anything like that. You know, either it's environment variable or it's a mounted, what do they call, projected volume, whatever.

1:05:23 You're making me dig into my, like, CKA studying. This is this is a it's been a it's been a while since I took that test. Thank goodness they made it valid for three years now, though. That's huge benefit. I think that feature used to live as a pod preset back in the day before it was deprecated. Correct. I'm assuming it's now just the mission controller and Kubeflow's version is called pod defaults. But it's it's not something I've used personally. At least the the pod default version, I mean. But it sounds it sounds like the same thing as the old pod presets.

1:05:50 Exactly. Right. This is this is like a successor to that. Yep. Exactly. Okay. So what did I do? I just clicked I just clicked connect, and, now I am presented with a JupyterLab environment. Right? So, you have a a file browser, some plug ins. We'll take a look at those in a minute. First things first excuse me. I have to cough. First things first, we need to get our data. So in order to get our data, we need my, Kaggle key, which I don't know where I put. Actually, I do know where I put oh, alright. You can share it. But, basically,

1:06:34 we're very trustworthy here. Yeah. I I'm I just don't wanna shame myself with, let me let me pull up a a finder window here and drag and drop. Please hold. Alright. So we I need my Kaggle key. Kaggle is, a great place if you're getting started with all this stuff and and you wanna get some good datasets to experiment with, yada yada, highly recommend highly, highly recommend, checking out Kaggle. The other thing I'm going to do just because it's, really boring to go through it one by one is copy, in a requirements file. This is

1:07:14 a Python notebook. So there we go. Requirements. Okay. So I'm gonna pop open a terminal. So this is like if you did kubectl exec dash I t dash dash bash or something like that, same sort of thing. I'm gonna do a few things. First, I'm gonna alias l oh, there we Docs. Be my friend. There we go. I'm gonna alias l as l s dash a l h. Okay. Alright. See, what can we see in here? I've got some environment specific stuff from Jupiter specific stuff. Nothing too crazy. First thing I'm gonna do is put my

1:07:51 key where I need it to be. So I gotta make a dot Kaggle file folder. I'm gonna c p kaggle dot json. Mhmm. Kaggle. There we go. Dot JSON to dot I don't know why it didn't autocomplete like that. Alright. So let's l.Kaggle just to make sure it's there, and it is. Okay. Great. So next up, what are we gonna do? We're gonna do pip three install dash u Kaggle. That's gonna give us the Kaggle CLI, which will be super helpful. Alright. Next up, we are going to download a dataset. How about that? So we

1:08:35 are going to do Kaggle datasets download. Download. What do we want? Okay. Let me just copy the repo here. Can you try pressing command plus on that a few times and see if it zooms in? Yeah. I think there actually is a way to change the console. There was a presentation mode there. I don't know if that does anything, though, under view. Oh, really? Where did you see that? The top of the presentation mode. And that doesn't do it. Yeah. Just try command plus. You're on a Mac or you're on Linux Windows? Yeah. Yeah. I'm on a

1:09:12 I'm on a Mac command and I've got so many like modifier keys on the Ergodox. Nobody that has an Ergodox and has come on the stream has a pleasant time typing. Seems to be the way. He's like paying you Ergodox users. No. No. No. It's it's it is the best. But when you ask me to do something different than what I'm used to, I don't oftentimes I have to remember. Plus, I use the Dvorak keyboard layout. I I mean, that blew it up, but it's for the purpose that we'll we'll we'll zoom it out. How about when we get into

1:09:50 the notebook? There we go. Yeah. If we're not doing much on a terminal, if it's just getting dependencies, we can just move on. Yeah. Let me control. I don't think downloading a YouTube URL is gonna work with Kaggle though. Not yet anyway. Alright. Copy. No. Alright. I'm just gonna type it out. So this this user on Kaggle is awesome and you should totally check out everything that he's got on there because these are some really fun datasets. So to this time, we are going to we are going to be trying to create an image classification

1:10:28 model similar to the fork. But instead of predicting, forks and nonforks, we are going to be looking at chest X rays to determine if the, X-ray, presents with pneumonia or does not. And one can imagine, many many valuable, applications for that, particularly, say, in in remote places, that may not have access to radiologists all the time, that kind of thing, or as, an enhancement, a double check to a clinical decision making, sort of a process. Next up, we're going to unzip chest X-ray. There we go. So you can see here, lots happening on the screen. Basically,

1:11:12 the way that the images are labeled, that is to say, how do we know whether something is a a healthy, normal image or a pneumonia image is by the directory. And so the, the modern machine learning, frameworks all have very, useful helpers to just say, here's, you know, here's the the here's the the zero and here's the one. Right? A binary outcome. Either this person's got pneumonia or they don't. Right? That make sense? Yeah. Alright. So next up, what we're gonna do is just make our lives simple by doing pip three install dash user dash

1:12:00 r requirements dot t x t. And we're gonna go through all of this, and it's gonna install all of the, libraries and so on. You know, such a great oh, let me just open this up and edit it because there is something that is not a kosher file. Save file. Let's try this again. I like, looking at this. You know, I it it takes a second, but I like it because it just demonstrates, like, so many people contribute, like, stuff that's super important, you know, that that we all rely on. And and, like, we really build

1:12:42 on the shoulders of everybody else, you know, the the people who have contributed stuff before and what's it's what lets us do our do what we do. And so I I don't I don't begrudge it. I'll put it that way. It's like a a good reminder to be appreciative for everything that's been invested. Yeah. There's so much amazing open source software. There are any problem you can think of. You can go to NPM, crates, you know, the GoRegistry, PHP. You search for it. Someone's generally done it for free, made it available, and it's there for you to use, and

1:13:17 hopefully contribute back to it. It always amazes me. Exactly. Open source is a wonderful thing. Yeah. Yeah. And there's, you know, there's libraries that, like, everything in the world relies on, you know, maintained by one lady in a basement somewhere. You know? So it's yeah. It's it's incredible. It's like the it's a it's a real free for all. Okay. So we've got our dependencies installed, our our libraries installed. Now we will come over to the launcher, and we will open a Python three notebook. Ta da. How about that? I will give of course, I mentioned

1:13:40 Live Demo: Training an Image Classification Model

1:13:58 Paul Timothy Mooney, from Kaggle. I also wanna thank, the the Coursera course by, Andrew and his call collaborators, from Google. So, you know, this is where this is where I've learned a lot. And and, you know, it's all open source, but, know, I wanna give credit where it's where it's due. I borrowed some of their their techniques and a little bit of their display code here. So the first thing we gotta do is import some stuff. So let me actually Jupyter has this weird thing where, like, it doesn't it really cares, like, when you name

1:14:40 a notebook, like, as to whether you end up with, like, 10,000 files. So let's do import OS. We're gonna import random. We're gonna import NumPy as NP. Import tensor tensor flow as TF. Import TensorFlow dot Keras as Keras. So Keras is like a sugar layer on top of TensorFlow. This makes your life a lot easier. Then we're gonna get even more specific. We're gonna say from tensorflow dot keras dot optimizers. We'll talk about what these are. I even have a a track down a a neat little GIF to sort of illustrate it. Again, thank you, Internet.

1:15:26 We're gonna import the RMS prop optimizer. And from TensorFlow .KerasKeras.preprocessing.image. We are going to import image generator four. Okay. Oops. Sorry. I misspelled that. It's image data generator. I I'm a % cheating. I find live coding in front of other people, like, the the most stressful thing. Okay. So then we're gonna do some other stuff that'll be familiar. We're gonna import matplotlib dot pyplot as PLT. We're gonna import matplotlib dot image as m p image. Nope. None of that was correct. MIMPIBG. Yep. And then just so that we get pretty so we get consistent results, we're gonna set

1:16:25 the random seed, you know, just so we can do apples to apples. Alright? And we will execute the cell with shift enter. No module found. Oh, because I missed a p. That's yeah. Thing I always have to tell noncomputer folks is that the thing about computers is if if the computer says you did it wrong, it's because you did it wrong. You know? I'm just gonna copy and paste in this is just we're just gonna look at the length of a directory because our images are in the directory. Right? And I just wanna be able to use that to say, this

1:17:01 is how many images we have. So we have thirteen hundred forty two normals and thirty eight seventy six pneumonias. This is a very small dataset. You you know, this should never be used for anything serious at all. But you'll see that we are still able within this context to, I mean, you know, if we had 25 gigs of images, we'd we'd have to check back next week. You know? Is that a sort of normal ratio, like, three to one? Is that, like, just is that still the golden ratio for machine learning? Like, if we wanna do pneumonia detection, we

1:17:40 need three of them to every one non pneumonia? Or is that just complete serendipity? It's this is this is complete serendipity. You know, to the best of my knowledge, I don't I I the golden rule is more is better, defer more diverse is better. Right? So, you know, equally weighted among the the the the group that you want to you know, the groups that you wanna be able to make predictions for. Right? So I believe, in fact, that this dataset comes from, I wanna say, Vietnam, if I recall correctly. Suffice it to say, it's it's gonna be

1:18:16 a population of patients from Vietnam that may or may not be, you know, applicable to everybody else. Right? Okay. Does that make sense? Yeah. Alright. So just keeping an eye on the time. There's a I'm gonna do a copy paste for the boring bits because I wanna be able to spend time on the exciting bits. And here, what I'm gonna do this is a really handy technique that ever since I learned about it, I do it all the time where we'll just define a base directory. We've got a training directory, so this is very helpful. Right? So among our training data,

1:18:50 you can see here it's all under the the subdirectory train. You could see it over here too. Right? So here we go. Train test. Test X-ray. Train test and validation. So training is the dataset we'll use to train on. The the testing is to, observe whether or not that, you know, to to keep some data outside so that the model can't cheat and update itself based on, you know, the forks we're gonna test it against in the future because then it'll learn about those forks or, in this case, the X rays, and, it won't be representative of new data

1:19:27 that it's never seen before in the real world. Right? Yep. So all I'm doing here is just a a little directory manipulation. Okeydoke. And if we add up the training for for normal, so there'll be a 5,218 training examples and 624 testing examples. Alright? So this will be used to validate. This is very handy that the dataset comes like this. It almost never does. There are some really helpful helper functions that will split your dataset for you, like, the test train split, appropriately named test train split, tool that comes with, with SciPy. So let's actually you know, let's be data scientists

1:20:14 for a moment and take a look at some examples of, you know, what this what this data actually looks like. So we'll set an index for ourselves for comparison purposes. Alright. And let's take a look. Let's call this normal. Right? So we're gonna look at a normal example. I'm gonna give it an image path, and that's going to be gotta join this up. Train normal dir dir. Yep. Here we go. And then within that, I want train normal, f names, and I wanna pull the, obviously, pick index. That's why we made it. Okay. And then

1:21:05 oh, it's it's a little it's a little hard to do with the Zoom, but we'll get it. We'll it. We'll get it. Alright. We can scrunch this a bit. There we go. Alright. So then what we're gonna do is we're gonna have an image variable and m image. Remember, we made that variable up at the top. We're gonna do I'm read image path, and then we're gonna do plot dot I'm show image. I'm not a professional data scientist. If there's, you know, folks who take offense to any of this know that I I do this

1:21:42 very much as a novice. Alright. Here we go. So these are grayscale images. I'm I'm displaying them, in a, you know, a sort of fun color, you know, highlight way to make it just look a little bit different. But here's a person who does not have pneumonia. Alright? So there will be a test at the end. Are you smarter than the model? Apropos the, forks discussion earlier. So now let's compare that with a pneumonia example. And I've already you know, no need to watch me type poorly. So Feel free just to paste whatever. Like, can just run. Yeah. If

1:22:19 I have questions about the code, I'll just throw them at you. Don't worry about it. Okay. Cool. Cool. Cool. Yeah. Sorry. I wanna make sure we get through it. So, anyways, okay. So we're doing this. Now we're doing it with a pneumonia example. Okay. You know, it looks pretty pretty different. Sorta I wanna use the word gloopy, but I don't think that's a a Scrabble accepted word. So I was thinking more haunted. It looks more haunted. You're thinking what? More haunted. More haunted. There we go. Yep. Okay. So I think I well, so this is interesting. Right? Like, these are these

1:22:52 are different cases, and this one is only barely more haunted. Right? So, you know, not not so super simple here. So it fails already. You and me both. I mean, you know, if I didn't go to if I didn't go to a a data science graduate school, I definitely also did not go to medical school. So, you know, you put those two things together. So it's gonna be a nightmare for you and me over at the hospital, and some people are gonna die. Alright. So what am I doing here? This is how we use Keras to,

1:23:25 construct a model. So we're creating a variable model. It's going to be a a sequential model, a a a type Keras models. And, what we're, what we're defining here is a couple of things. The first is first thing we need to do is give instructions to the model about what kind of data it can expect. So oh, these are, three color channels. My bad. Okay. So these are color images. So what the first thing that we're gonna do is we're telling the model, hey. You're gonna get an image. It's gonna be a 50 by a 50 pixels, and there's gonna

1:24:05 be three color channels. So for every pixel, there'll be a a value, for each of the color channels. Then, what we're what so what you're gonna need to do is, sort of unravel that. Right? And so it's a a a flat thing. But the first thing that we're gonna do is we're gonna do a this is a convolutional neural network. So what's a convolutional neural network? Basically, it's like applying a filter. I'm gonna show you an example of what the model sees a little bit later later on. It's very similar. You think of it very

1:24:37 similarly to, like, the way that your phone, you know, will put a an image, will put a filter on an image. Right? We're we're looking at the pixels. We're seeing the values for each of the color channels, for each of the pixels, and we're making some decision about how to adjust it based on its the the pixels that surround it. Okay? So you can imagine there's lots of different types of they're called kernels that lots of different types of filters. Right? So here we have a three by three. So we're passing you can think of it like,

1:25:07 you know, when when folks used to look at slides over a light box, you know, and they would scrunch in, you know, to look and then they would go, you know, across it to really look very closely at a at a slide image or a photo negative. This is a very similar idea. And, here, we're gonna do, 16 different filters, and the filters will be, three by three. And then based off of the output that we get where where we we adjust the input, we'll come in with its own, values for those those pixels, and then

1:25:41 the output will be, based off of, how those pixels relate to each other. So here we're doing, for example, max pooling. It's a two dimensional image. We're doing max pooling, which means that it's gonna evaluate the the filtered result. Right? So it might say, you know, whatever the algorithm is that's applying to that that three by three, those nine pixels. And it's saying, you know, take the max two by two of those. Find the max values, right, for those for for in a two by two for that and use only that. Then we pass it into another layer where we're actually,

1:26:20 doing another layer of of convolution. Right? So here we're doing 32 filters. Again, three by three, and then we're gonna pull it down to a two by two. And we go through that one more time. After we get through we'll talk about activation in a second. After we get through all of that, what I wanna do is then flatten this out. So I'm getting a a a a an array of of values, right, that are, you know, between zero and one typically, some normalized values. And then I'm going to then I am going to, put those into

1:27:02 a neural network with 512, neurons. So you you can think of this as, every neuron is a dense network, so every neuron is connected to every other neuron. And, essentially, the, the way that machine learning works is it'll, push it through, observe, which of the neurons, contributed the most to, the error that was observed. Right? So, you know, which neuron was responsible for calling this know, if it makes a mistake and says, I think this person is healthy and does not have pneumonia, it'll the the it'll then recalculate. Right? Just when we got a new fork,

1:27:39 we have to recalculate the mold. It'll recalculate in a process called back propagation, that automatic well, by virtue of, partial differentiation, we'll be able to associate the end result, namely the error, how bad the prediction was, how off it was in terms of its confidence that this person was symptomatic of of pneumonia or had pneumonia, which neurons contributed the most to that error. And so we're gonna change those neurons in the opposite direction the most. And which neurons had relatively no impact on the decision or on the error, let's leave those alone. Right? Does that make sense?

1:28:18 Yeah. I think so. Okay. The activation is just what does it take for this neuron to actually fire? Right? Like, there we don't want every neuron to fire all the time. There should be some minimum of of signal that the neuron's receiving in order for that to even be to even be considered at all by the by the end result. So ReLU is a is a contraction of a rectified linear unit, if I recall. So you just think about, like it it almost looks like a, like a a call option graph. So, you know, nothing.

1:28:56 Don't give any signal. But if you get past this threshold, then have a linear relationship with with output to input. Sigmoid, if you if you remember from school, is gonna give you, like, an s shape. Right? So far extremes will never go above two bounds. But as you approach zero, you start to to see a significant increase and then leveling off of the the slope of the curve. Alright? Does that make sense? Sort of? It's good this is going. Ask me any questions. No. It's just It'll make a lot more sense once we see an an example of

1:29:34 it. Right? I hope so. What the models use. Yeah. So sorry. No. But it's it's doing stuff. You're explaining it well. It's just it's just not something that I am familiar with, but I'm looking forward to seeing the step by step through it. So, you know, keep going. Let's do It'll make more sense. This is why I picked a a CNN with with images because you can see what the model sees in a way. Alright? So, here, we're gonna have I'm gonna create two, image data generators. This is the the help the handy, helper functions I was talking about before.

1:30:07 And then we're gonna give it some information about batch size, how much, you know, we want them to we want it to automatically resize each each image. It's a binary class in the sense that you either have pneumonia or you don't. We're gonna define batch size here. It's gonna error on me because I did not in the interest of simplicity, I did not include this, but we're going to include it. Gonna define some some variables known as hyperparameters. Let me copy and paste these in. We'll talk about these a little later, but I just wanted to have these variables,

1:30:44 in memory so it doesn't whine at me. So it's gonna we're gonna define a batch size. We're gonna define a, source directory and so on. Now you didn't realize this, but when if I may go back to the fork example, we made some significant choices when going through this whole exercise. One of which was that after every fork, we made an adjustment to the model. Right? Which is, okay. Right? But may not be actually what we want. Right? Like, we may want to average all of the forks. Right? All of the what let's not change the model. Let's take

1:31:27 the average of the of the changes to the model suggested by the multiple forks that we've that we've looked at in a batch and then make make changes based on the average suggested by that batch. Does that make sense? Like, you don't wanna go too overboard. Otherwise, I'll get a model that's like, you know, Victorian t fork and and and, pitchfork. Right? Like, maybe I can average them together. That'll give me a healthier sort of look. That's that's batch size. Right? The other thing that we didn't talk about at all is, like, for example, what kind of sand

1:31:57 are we using in our clay? Clay is made from sand. Right? You know, how thick are the grains? Right? If they're coarser, I will have less ability to you know, I'll have less ability to to create fine detail. But maybe that's good. Right? Like, maybe I don't want the model to expect, like, you know, these sort of, baroque, design features of a of a fork. Right? So this the same idea here. This is how we define the, the the the nature of the mold, if you will. Alright. So we've got this going on. Now we need

1:32:37 to compile our model. Alright. And what we'll do here is we'll just call the compile function, and, we'll use an optimizer, called RMS prop. And, basically, the the optimizer says excuse me. The optimizer says, okay. I I can calculate you'll tell me how I'm gonna calculate loss. Right? Tell me how I know how wrong I am. Right? In this case, we're gonna use a loss function called binary cross entropy, which is a way to measure how off binary classification examples are. But, like, now tell me, like, how I should actually adjust myself based off of that.

1:33:22 Like, should I go really fast in the direction, like, of, of finding a minimum to my error? Should I go slower? Do I wanna have momentum? Right? Like, there's lots of lots of things that we can, take into consideration for the, for the way that the model should update itself, for the for the way that the mold should update itself. Right? Alright. So we'll compile this. Now here we go. Now now comes the fun part. Alrighty. So we are going to define a variable called history. Looks I I remember finding this to be, like, a really

1:33:56 weird, a really weird sort of thing. But, basically, you can think of it as, like, like, the tape in a cash register. Right? And we're gonna call a fit exercise, and what that's gonna do is it's going to train on the training data. It is going to use the test data for validating that, and we're gonna train it for a certain number of epochs. So how many times through the data do we want to, do we wanna iterate? And there are a bajillion different features, like, ways you can things you can specify, within this. Like, you can do

1:34:30 my goodness. There's there's, like, an an unknowable amount. You can have callbacks to stop training based on something, you know, that happened. Let me just kick this off. You can, you know, have, you know, if you if you minimize your loss to a certain level, you know, you can have, you can, you know, you can you can introduce a skew. You can have a holdout dataset that only gets used at the end. I mean, there's lots and lots of different things that you can do when you're when you're fitting a model. So what are we doing here? So this is

1:35:05 a this is a training process. You're observing it in real time. So here are the the number of, they're called steps. So the number of of, processes that are that are occurring and the number of, batches that are that are being flowed flowed through. Here's our loss function. Right? And here's our accuracy. So at this point, the model is 67% accurate, now going up to 68. And and how many epochs did we decide to lots of inch this is the the sort of finesse part. So we're doing one epoch here. Okay. That's fine. So just in, like, really

1:35:50 simple terms. That this because this has been seeded with images that knows have pneumonia and images that knows that doesn't have pneumonia. It's a looping over them, applying the model, and then the back propagation is when it gets one wrong and it knows it's wrong, it feeds it back through the model tweaking the parameters and tells the accuracy if he gets one right. Yes. Exactly right. Exactly right. And thanks to the beauty of calculus, we know which which neuron, which which signal propagator was the most responsible for the the problems and which one was irrelevant.

1:36:27 And so the beauty of of tools like RMS prop as the optimizer is that they can account for that difference in scale, right, that difference in in impact in order to adjust on a per neuron basis exactly what exactly how much to adjust the the the formula for propagating signal for each neuron. Right? Right. So I push one through. I it's a it's a pneumonia. I correctly predict that it's ammonia. Good. I wanna encourage my model to use that as a as a thumbs up. Like, whatever you just did was good. Use that a lot. Right? Like, use those are those are

1:37:08 the signals you gotta be looking for in terms of pneumonia. Right? And then it'll go back and make those, you know, changes, make them more ingrained. Right? Now I go forward. I find out that I goofed up. Okay. Well, I'm gonna come backwards and adjust that so that my loss, my my error is smaller. All of machine learning is just reducing your loss. Picking a good finding your data, picking a good, loss algorithm, way to calculate loss, and then trying to make that loss smaller. That's that's the ballgame. Right? Does that make sense? Yes. Yeah. So what we are, observing

1:37:42 here, is highly characteristic of, machine learning exercises, and that is that we have a loss. Let's ignore that for now because it doesn't really mean a whole lot. And And we have an accuracy with the training data, and the the training data is 75% accurate. Not bad. Right? I mean, I don't know anything about X rays. I don't know anything about the the, you know, the the the not abstract algebra, linear algebra that's sitting behind this in terms of, like, you know, how the the the tensor flow is optimizing the tensors, you know, to tensors are just, multidimensional

1:38:22 matrices. You know, how it's doing all of that, I have no freaking clue. But I, myself, as an end user, have, now created a, model that, is not 75% accurate, but 70% accurate. So remember, we split our data into testing to training and testing. And the training data was given to the model to, like, you know, Victor Rocky, like, okay. I'm gonna learn from this. Right? And then we, you know, here's the big competition. You know, it's not your standard sparring partner. These are images you've never seen before. Right? And here, you can see that my accuracy on my validation is

1:38:56 only 70%. Right? And so oftentimes, what you'll see let's actually run this for a a few more epochs, to to make the make the point, even more apparent, and we'll continue on, while this is running. We'll do four. A GPU would have been nice, but I didn't wanna overcomplicate things. What you'll observe is very common. It's known in the if you've ever heard this term, overfitting, which basically says, I am excellent at predicting the four forks that I have seen. You give me one of those forks, I'm a crush that. Like, no way I'm screwing up

1:39:35 those four forks. You bring me a rate. I've never seen it before. As far as I know, it's got the right you know, like, the world is different than what I've been trained on. Right? So it is highly possible. Like, here you can see, like, you know, it is highly possible to get incredibly high accuracy in your training data, but it's so specific to your training data that, you know, I don't know, a a fork from somewhere else in the world is not gonna get recognized a fork even though it's a fork. That makes sense? So so there this is

1:40:09 where we were talking art and science. This is where an an experienced, data scientist will be able to, you know, see very quickly, like, wait a second. You're telling me I got 92% accuracy and I have, like, a dataset of 5,000 images? This is overfit. This is this is basically memorizing this this dataset and not generalizing to X rays of people with with pneumonia or not. Right? Does that make sense? It does make sense. Yes. Okay. Cool. I may actually want to interrupt this because this could take a hot sec, but, it's okay. We can we can

1:40:48 we can do other stuff in the meantime. Alright. So next up, I am going to just copy and paste in some, visualization stuff. This is not, you know, overly exciting. And then once that, runs, the above cell runs, then, we will be able to see some output here. So let me let me queue this up. And then here we go. So the first thing we're gonna visualize is the relationship between, between, the epoch will be our will be our x axis. Then we will observe accuracy and loss and then validation accuracy and validation loss. Right? And

1:41:36 so we'll be able to see what we'll observe almost assuredly is that the you know, look here. Right? Here, I've got validation accuracy of 72%, and I've got, training accuracy of 93%. Right? And what you'll observe is that, you know, it'll start to memorize that training dataset, but in fact, our validation accuracy, how well it works on data it hasn't seen, will in fact, at a certain point start to go down because the model is is not is is overfitting. The model is is too specific to the training set, and it is picking up on things that are in fact

1:42:09 not generalized, qualities of a of a a pneumonia patient, but rather these specific 5,000, you know, people's idiosyncrasies. Is it fair to say then that the dataset is the most important part of these models? Or is Oh, yeah. For sure. For sure. The dataset is everything. In fact, there are new it's it's it's really exciting. I mean, this might if we watch this in five years, we might be you know, it it'd be like trying to encourage someone to program COBOL these days. Like, we may we may look like absolute medieval, you know, dinosaurs. And the reason for that, I mean, there

1:42:50 are products on the market. Apple produced one for for their software development kit. I forget what it's called. But, basically, the entire thing is a UI. You know, it looks like any Apple design product that has two boxes. Drag test data. Yeah. Drag, yeah, drag drag category one here, drag category two images there. And then you don't see anything about it, and it produces a model. The reality is that even though, this type of of a very specific evaluate you know, the the the tuning, the the model structure, how many layers, what kinda convolutions, should it be dense, should it be RELU,

1:43:35 sigma, all these sorts of details are far less important than the dataset, which is to say that you could have way more performant models with a bigger dataset as a you know, with as opposed to, like, what is the impact of, like, fiddling with your model architecture or, like, the the associated stuff. Now these it's still valuable areas of research, but, you know, it's it's it re it is really, really, really all about the dataset. So another one that I I saw recently, I think it's called MindsDB, where it's a imagine a SQL database where you say,

1:44:11 okay. I made a table, and here are the the features. So the the parameters that or the the details about the the row, And here's the outcome. Right? So, you know, think of a silly example, like an ecommerce example. Here's, you know, the SKU or the the the the product ID of the first thing this person bought, second, third, fourth, and, you know, here's what they bought fifth. So I wanna be able to predict with four what the fifth one is gonna be. So you just, like, tell it, like, give me another column that is predict five.

1:44:47 Like and it'll it'll just do this all behind the scenes. And in particular, you know, there's in Kubeflow, there's some great automation around this as well, which we'll see in a second, to be able to, evaluate, things programmatically. That's why I've defined things like, you know, epochs, learning rate, and batch size as hyperparameters to to fiddle with that. And I'd it is I really wanna do an a demo of this, but it's possible to to fiddle with your your model architecture too. Right? Like, you could say put in another layer, remove a layer, add a, you know, more neurons, less neurons. Those

1:45:23 could be variables that you could experiment with also. How specific can you scroll back up to the convolution and the the thing? Like, see, there's, 12 lines of code there. Is this the main bulk of our model and how specific is that to detect a pneumonia? Like, if I were to throw a different set of images at it, like, you know, a football and, you know, a golf ball, Is it still gonna be the same pattern? Do I have to tweak what I'm changing in the code? Like, how does that work? Yes. So you would not have to

1:45:53 you could you could throw any any binary image classification, you know, that has three channels and could be, you know, resized reasonably to one fifty, one 50 into this. There's there's nothing at all special about it. After all, if I did it, it can't be that special. Right? Like, I I literally and, like, I'm not a specialist on this. But you would have to retrain the model. Right? So you couldn't take an an an extant model and and use it for as is for footballs and and I wanted to say, like, footballs and and soccer balls, but I think you probably meant

1:46:37 a soccer ball in the first place. So one, footballs and baseballs. And and and you would have to then do you have two options. Either you can just take the raw model and and just train it again. Just basically point this notebook to a different set of directories. Or, you can do what's called transfer learning, which is, in many cases, what we're doing here is nothing more than passing a series of filters over stuff. Right? And then observing the output. So then with transfer learning and this can be very expensive. Right? With transfer learning, what we do is

1:47:11 imagine you just stuck another layer on the bottom, right, that said, yeah. Okay. Start with the model as it exists. Right? Because it's a pretty good model for for two dimensional images. I like it a lot. I don't wanna pay the compute and time and hassle to redo everything associated with them. I might not even have a chance to. It might not be possible for me. I might not have access to the source data. But I have this new dataset, and I and it is a binary image classification. So, I'm gonna have it hold this set of

1:47:44 manipulations constant, and then just fiddle with the the new stuff that's that's underneath. Don't only change that. That's called transfer learning. Okay. Awesome. Yeah. We've got a a question there, which I'll throw up just now. You know, I think our epochs are just about done. Yep. Alvin asks, how did the 512 neurons get matched to the flatten out pixel array? Do each neurons get assigned some pixels? Ah. Let me show you. So you see here, there's a flattened layer. That's that's what did it. And yes. So so it got flattened out. It basically just, you know, take,

1:48:24 I'm trying to think what was, like, a Lego, basically. Take a block off the top, put it over here, take the next block off, put it next to it, but flattens it all out, just a bunch of numbers. Those numbers go into a neuron. A neuron is just a formula, like a y equals m x plus b type type of thing. The the b is your bias. The y is your or excuse me. The m is your is the, is your slope, and then the x is your input value. And, then all of the the neurons are

1:48:59 connected to all of the other neurons. So based on the back propagation, some will be elevated, some in terms of their signal amplification, some will be muted. Right? And, as a result of that, the the model learns how to what neurons indicate what outcome and and at what to what degree. Nice. Thank you. Mhmm. Do we have some visuals? YouTube video. I'm sorry to go. No. No. There's a great YouTube video that that does a fantastic job visually animating all of this process. So so anyways, here we go. Let's let's take a look at some visualization. So we started off

1:49:37 Live Demo: Visualizing Model Layers

1:49:41 unusually high. The the seed obviously updated as this was our second time running through it. And but what you observe is that there is a significant delta. Right? And and we got very lucky. On this fourth run through, we got a higher accuracy than than in any previous. But, typically, what you see is that, you know, there is a a decline over time. And this is the loss, so just the inverse of the accuracy. Not the exact inverse, but, you know, accuracy is success, loss is failure. So, you know, you can see that the curves point in the the mirror image way.

1:50:15 That make sense? It does. Alright. Great. So now let's let's get to sort of a a a look at what is actually, what is the model actually seeing? So here we go. So here's this is these map to our our layers. Right? So what do we say? We wanted 16 different filters passed over the image. Okay? And here we can see that this is a pneumonia example. Alright. So what we're doing, we're passing a filter over it. And then we're passing we're taking the the max pool of that and then passing, that, 32 filters over it, if I recall

1:50:52 correctly. So let's see what we're what, you know, what happens. Here, you can see that the the pixel height here is, like, 75 or something. Right? Then when we condense it, when we pool it, we're only picking the loudest of all the pixels in that filter. And, as a result, our image shrinks. And here you can see that I've got 32 filters running across this, and it but it's a smaller image. And then I'm going to, do that exercise again. And now you can see that it's getting the scale is getting smaller. Here's the max pooling result of that layer

1:51:31 above. Right? And what you notice in this in the same same way that the that the fork, mold changed over time by virtue of what you plowed into it, you see sort of the same concept here, right, where we're observing certain commonalities among these things where the the it seems as though, for example, if I were to summarize, you know, there's a there's a certain vertical pattern, right, that is associated, in these examples. And and and in some cases, they're they're vertical on the side. In some cases, they're vertical in the center. And in some cases, they're completely blank. That's

1:52:08 also helpful to know. Let's let's, is this using I think it's using a random input. Let's just make sure. Yes. Random choice. Alright. Let's run this again. Let's see if we can get a a normal. Okay. So here's a normal. Okay. So what do we notice here at the very bottom layer? I hope it's it's big enough for folks to see. But but what I see here is that rather than the vertical line standing out quite so much, I'm starting to see, more horizontal lines that that seem to sort of line up with the, with the ribs, right, with the rib cage.

1:52:40 And and it's certainly not nearly as pronounced, the the vertical lines in here. So so this is what the model is seeing, and this is what it's using to make a determination based on all of the the outputs that it sees here, what class of classification it it should be in. So so, yeah, that's the that's the ballgame. It's this is actually using the model that we just trained. Let me find it for you. It's in here somewhere. We are actually plugging this into the model if I could just find the line that is doing it. Oh,

1:53:30 it's here somewhere. It's so zoomed in. Oh, here we go. Successive feature visualization model. So we we got the model, and we're asking it to predict x. Right? So it's it's taking an image example, manipulating it, normalizing it, doing everything we did, and then it's plowing it through the model itself. And this is what, this is is, what the model is seeing. So you're really, you know, quite literally seeing what the model is sealing seeing. I know we've we've we've kind of been on the air for a long time. I could show you how to turn this

1:54:00 Deploying with Kale

1:54:00 into a pipeline. This has been sort of a data science primer, less so a a Kubeflow primer. Up to you. If you have the time, yeah, we can do that next step. Let let's see how we move this over. Sure. Yeah. In the words of the late oh, goodness. Come on. Not not now. I'm trying to remember his name. Mitch Hedberg. Anything's in walking distance if you've got the time. Okay. So this is Kail. Kail is an open source tool that basically instead normally, the workflow, you know, what you would have to do is you'd

1:54:38 have to take all this code. You'd have to take it basically, take it out of the notebook, move your imports into every cell, define make each cell its own function, you know, then use pipeline Kubeflow pipelines, domain specific language, DSL code to write the pipeline. Kale says, you don't wanna do any of that. We'll give you a button. You click the button. This is what's happening here. So you'll notice I have this, like, handy dandy little pencil icon That's an overlay. So I can call this. I can define what's going on. So this is an

1:55:07 import step. Right? Okay. Now in a pipeline, I do not need to do any kind of, printing of output. Right? So this will skip this. This is my hyperparameter cell, so I'm gonna call this my pipeline parameters, aka I want it to, like, goof around with this. Right? Like, I want it to I want it to adjust this, to know to adjust this automatically. We'll see how how that goes. Again, batch size, how many forks do I wanna flow through the mold before I make changes on my model? Alright. And we're gonna put this down

1:55:43 to one epoch just because time purposes, train faster. Alright. So now I'm come to my first pipeline step. Not a super exciting step, but we'll call it something like preprocess data. And it doesn't depend on anything. You can specify, you know, resource dependencies also, but it doesn't depend on any other pipeline steps because guess what? It's the first pipeline step. Next up, Matplotlib inline. This is definitely a skip. The pipeline Kubernetes doesn't care about seeing a picture, so we're gonna remove these. Yep. Okay. Then we've got why don't we call this the compile model step?

1:56:23 After all, we are compiling the model, and it depends on preprocessing the data. No great surprise there. Okay. Now we move right along to our training step. So we'll call this train, and it depends on compile. Woo hoo. Alright. And then below, I just wanna make sure I remove all of these and just skip them so that we don't end up with, complaints from the scheduler. Alright. Now the last thing I wanna do because I do wanna use hyper parameter optimization, automatic hyper parameter optimization, is I have to tell the you know, it it the the the, the hyper parameter

1:57:16 optimization software that comes with Kubeflow needs something to needs something to shoot for. So what do we wanna do? Rule number one, we wanna minimize our loss. Cut down on the loss. If we've got loss, it means we're doing something wrong. So we wanna find the the version of this story, the version of the model that has the most that has the least loss. So I'm gonna give this a name. I'm gonna call it X-ray. So these are all Kubeflow CRDs, actually, for your ops folks on the on the line. So an experiment is a CRD. A pipeline

1:57:56 is a is a CRD. Give it a an ex a description, and they have relationships with one another on one another like an experiment, may have many pipelines, that kind of thing. Or a pipeline may have many experiments. I have to I I don't know. It's all that's the problem with visual tools. You never know what acts things are actually called. Alright. So now what I wanna enable is hyperparameter tuning with Katib. So what I will do is define the search space. So it noticed that learning rate was something that I had, defined. Right? So that was one of the hyperparameters.

1:58:34 Let me close this and take you right back up. Where did I define this? Right up here in the pipeline parameters. Right here. Right? So it's gonna automatically pick that up from that cell, and it's gonna be arranged. Let's say, on the one hand, point zero zero one, and max will say, I don't know, point one. And let's, you know, let's go crazy. Let's say that we go by in steps of point o one. Alright? So the learning rate says okay. Okay. Okay. Once you figure out how to change the mold, don't go crazy with

1:59:08 changing the mold. Figure out how much you think you should change the mold and then divide that by a hundred, right, or or a thousand in this case, and and go incrementally from there because it is very easy to overshoot the runway, so to speak, on that and and end up in on the wrong side of a of a gradient descent. Epochs, we can manipulate this. We're not going to because we don't wanna change the amount of time it takes to run this thing. We're gonna have a batch size min of eight. Let's call it a max of one

1:59:39 twenty eight, and let's go crazy and say, I don't know, 32 is the step. So let's go that way. We can define the algorithm for how it should pick different, hyperparameters, and it's smart enough to sort of look back and see what's what seems to be working. And then what's our objective? We wanna minimize the, the validation loss. Alright? So I'm gonna dial this down to 6. We'll have three parallel trials. Again, Kubernetes. This is where we get all of the the Kubernetes goodness coming to life. But, you know, thank goodness someone who's not familiar with

2:00:16 that and doesn't wanna learn it doesn't have to to do their job. And now it's what, Kata or excuse me. Now what Kaila is doing is validating the notebook. The back end I'm using for for snapshotting with Kaila is, is Rawk. And once this, snapshot completes, it's a fairly sizable dataset, so it may take a second. There's quite a bit of, blocks to look at. What it's gonna do is it's going to compile the notebook into a Kubeflow DSL code for me so that the Kubeflow pipelines knows what to do with it. Then it is going to

2:01:00 upload the pipeline for me, hopefully. Then it's going to create an experiment around that, and then it is going to create a content experiment whereby it breaks out each of the different you'll see it. For each run, it's gonna pick, okay, try batch size this, learning rate that. It'll pick the the, it'll determine which are the the the combinations it's gonna run with, first. So here we go. It compiled. Oh. Uh-oh. API. Oh, I know. I Why don't we give this different yeah. The let's make it a new experiment. X-ray two, pipeline description, X-ray two. Let me

2:01:45 close this. Let me compile and run. Naming things. What a what a disaster. Alright. And then we'll be able to observe this running. Does it matter that your new experiment doesn't have a name? Probably. Yeah. I I thought I saw it had a name. Alright. Let's let's third does. Does it matter that it doesn't have a name? Yes. It matters, David. Would have been very helpful if you'd include me into this. I I'm just watching. I'm just here to Oh, yeah. This is great. Oh, look at this. This this has never happened before. I'm so glad it's happening now.

2:02:30 It's a little little jitter here. I don't think it likes being blown up to this extent. I'm not surprised. I mean, it looks great on our side, just FYI. Good. Good. Okay. So what's this gathering suggestions stage? Uh-huh. So this is what I meant before when it's, like, making, it has to make a decision. Right? We're doing three parallel trials. No trials have have run yet. Right? So what should I use as my wacko combination of hyperparameters to kick this bad boy off? Right? I have to have to make a decision on that. Okay. So this I'm gonna zoom out just

2:03:08 a scooch here because it's a little aggressive. Here we go. Alright. So what is it doing? So, these for these three trials, it is gonna vary the batch size from eight to, whatever this is. Let's find out. From eight to 40 to 72. Alright? And each one of these will be, its own pipeline. So I can go ahead as a data scientist or data engineer. I can, click on this. And here you can see my pipeline is running. So the first thing it did was the create volume step. I can click on any of these steps and see the logs

2:03:49 for the the pod. I can see the pod spec, the pod YAML. You know, one area that I think folks who are looking for ways to contribute, you know, please jump in is to sort of simplify a lot of the stuff that gets thrown in here. And you'll also note that, you know, the the the way that it's the way that it's containerized is just powering this all through to, you know, as as a as a plain as a text file to the, to the, underlying image. It's like maybe not, like, long term. You know, it might

2:04:29 be something that that we wanna, as a community, sort of work on. But we're waiting for the pods to come up. I can actually show you what this looks like from the terminal. So let me do a clear on this. I know it's a little bit smaller. Bear with me. Let me do a watch. Kube CTO get PO is gonna be launching in my namespace. So here you can see these are the the pipelines that have that are waiting to get scheduled. Here are the ones that are running X-ray two. This, set of hyperparameters will get its own

2:05:03 sort of identifier. And then here are the ones that that are, waiting to run. So each of these, is a pod. You know, you'll see similarities between the the unique identifiers. That's because a lot of the you know, for this one, it might be that that only one hyper you know, only one parameter has changed. So for this first batch of three, they're gonna have the same five character identifier in the middle here. Where's my pipeline, actually? Here we go. You wanna you wanna boot up here, bud? Do I have something that's got a little

2:05:40 more action? Experiments, HP tuning, running three. Trials running. Are you, though? Am. Kubernetes lying to you. Maybe yeah. Welcome to my life. Let's let's let's go back and just, like, let's look at one that actually ran before, which I deleted so that I looked cool. That was a mistake. Alright. Here we go. Here's here's one that I didn't delete from that UI. Okay. So here's my DAG. I'm gonna simplify it. So, yeah, here's all my steps. So I can click into them. I can see the logs. I can see, you know, all the other stuff. And here is my, training exercise.

2:05:59 Viewing Running & Completed Pipelines/Experiments

2:06:24 You can see the same, loss and and and accuracy metrics that we had before. So suffice it to say, the the operational benefit of using a tool like Kubeflow, and Kubeflow is the the leading open source Kubernetes native platform for this type of work is that now, as a data scientist using Kail, rather than delivering a Jupyter Notebook, which is sort of like like a jail for my code. Right? It needs to come out of there to be to be useful. Instead, I can deliver a a scalable and repeatable pipeline that can then be iterated upon, chained onto, branched off

2:07:07 of to to advance the project. And in fact, we could have a a serving component at the end of it as well. And I you know, in that here's a an example is a different different model. But here, I have a model that's that's being served just off the the end of a a pipeline. And, I can see the logs for it. I can see the the predictions, that are coming out as well as the input data that's coming in. And I have metrics. This is all new in one dot three. My colleague, Kimonis, and and some of the other

2:07:46 folks in in the working group contributed this. So, yeah, this is this is what greets the the eager data scientist looking for adventure. Awesome. Thank you very much. Will I pop us back over to video mode? Yeah. Awesome. Can you still hear me? Can I still hear you actually? That should be the question based on earlier. No. I can't hear you. Oh, you're muted. No. You're gone. Is it your end? What do you think? Alright. But I can't hear you. You know what? Okay. That's frustrating, but we are at the end of our session. I just want to say

2:07:59 Conclusion and Wrap-up

2:09:11 thank you very much. I think that was amazing how you managed to take such a complex subject and, you know, you didn't just introduce us to Kubeflow and the new features, but you actually took us on a full journey of data science and how we could apply that to real world situations like detecting pneumonia and X rays. I found the whole thing riveting. I've learned so, so much. Thank you very much for that. Kubeflow looks amazing. I hope other people have really enjoyed today's session and we'll take a look at it too. So, Michael, although you can't say goodbye, you there we

2:09:40 go. We salute. We'll do it perfectly. Thank you very much for joining me today. I will speak to you soon. Have a good one.

Meet the Cast

David Flanagan

@rawkode

Michael Tanenbaum

@tbaums

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Additional Resources

Grokking Deep Learning

Kaggle Chest X-Ray Pneumonia Dataset by Paul Timothy Mooney

More from Rawkode Live

View all 173 episodes

Hands-on Introduction to Odin

Hands-on Introduction to Odin

Hands-on Introduction to Iroh

Hands-on Introduction to Iroh

Hands-on Introduction to Yoke

Hands-on Introduction to Yoke

Hands-on Introduction to sympozium

Hands-on Introduction to sympozium

Friday, January 23rd, 2026 - Chevron7

Friday, January 23rd, 2026 - Chevron7

Hands-on Introduction to jujutsu (jj)

Hands-on Introduction to jujutsu (jj)