Monitoring Kubernetes with Prometheus & Robusta

Watch / Rawkode Live Live

The embedded player needs JavaScript.

Open the video stream (HLS) Download captions (VTT)

Expand player Shrink player

Overview

About this video

What You'll Learn

Interpret Prometheus alerts with automatic context before deciding whether they are urgent.
Trigger CrashLoopBackOff and OOMKill alerts, then trace each one through Robusta enrichment.
Build custom playbooks that search Stack Overflow and inject action guidance.

Natan Yellin and Eric return to walk through Robusta's playbooks and enrichers, demoing CrashLoopBackOff and OOMKill alerts, custom Stack Overflow enrichment, and the Robusta SaaS platform on top of Prometheus and Kubernetes.

Chapters

Jump to a chapter

Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

2:48 Welcome and Introduction

2:48 Hello, and welcome back to the Rawkode Academy. I am your host, David Flanagan, and today is another episode of Rawkode live. A show where we take a look at cloud native software to help assist you make your Kubernetes cloud native microservices lives a little bit easier. Today, we have a follow-up episode as we take a look at Robusta, and specifically, we're gonna take a look at how Robusta can improve or enhance our monitoring of Kubernetes clusters with Previous. I'm joining us today. Excuse me. I am sitting in the freezing cold. It's minus five and the heating is broken in my building.

3:22 I'm just gonna get that out there in case I start shivering. Joining us today are two wonderful guests that we've had before, Nathan and Eric. Hey. How's it going? It's going good. Thank you for having us back here. I tried so hard not to mention that it was freezing cold, but I could already feel my teeth chittering as I talk. So I just Yeah. Like I said before this thing, the things that you do for the Internet. I mean Guys, if you haven't hit that subscribe button. Yeah. I'll I'll just I'll let you do all the talking today. I appreciate it.

3:56 Alright. For anyone who hasn't seen our last episode or isn't familiar with you both on LinkedIn and Twitter, etcetera, can you both tell us a little bit about you and what you've been up to? Yeah. So what we do is we do runbook automation for Kubernetes based Prometheus. And that's a mouthful, so I'm gonna break that down with an analogy. So I wanna start with a nontechnical example that I'll use to explain what this says and why it matters. So let me share my screen. And I'm gonna start off with my favorite website on the Internet, me, of course, Reddit.

4:07 The Kubernetes Alert Problem (Why Robusta?)

4:31 And what I have up on here is a subreddit called, why is this thing? And this is a subreddit where you can go and people go on here and they post pictures of different things, and they say, why is this thing? So, like, someone here posted a picture of this blue disc that they found in the bag of crisps. I think crisps is British for some kind of snack, but Yeah. It's it's British for chips. Okay. Yeah. Yeah. Exactly. And they said, why is this thing? And then people go in the comments, they try and guess what something is. And this is

5:03 a metaphor for something related to Kubernetes, but I'll get there in just a moment. So I wanna take an example that someone posted a few years ago. This is the number three post of all time. And someone posted on here, I found this box in the attic, and, you know, what is this thing? And you look at the box, and it says on here, this is radioactive material, route one or two, whatever that means. No person shall remain within one meter of this container unnecessarily. And, of course, it's, I guess, in The US, so they translate that to three feet.

5:34 If you didn't figure out, you should get the hell away from this. And radioactive units of this thing and so on and so on. And they posted this on Reddit, and everyone on Reddit said, why are you posting this on Reddit? Department of Energy. The Department of Energy came up to their house, and that's like the guys with the spacesuits. So they have on the spacesuits, and they have that thing that goes beep, the Geiger counter, and they're going all over the place. And the person is certain that he's gonna die, and he's gonna be in the hospital,

6:01 or he got, like, a special iodine or whatever. And after this whole big thing and all this excitement, it turned out that it was all a false alarm and that this box just just had some old Raidon paint and Iridium paint. And for people who aren't familiar with it, then they used to paint watches so they would glow in the dark with this radioactive paint. And it wasn't good, and a lot of people got cancer because they worked in the factories and they would lick it. So you shouldn't go around licking this paint, but, like, if you find in your

6:31 attic, you're not gonna die. Everything is fine. Nothing happens. And now I don't know. I am do you do you wanna guess where I'm going with this? I'm still wondering what the little blue disc was. But Oh, the the old blue disc is in but it's too far off topic. So I recommend afterwards. Me. Yeah. I'm pretty sure we're spend the next three hours after the stream on this subreddit, just looking through all the most random stuff. But I'll let you get back onto your your point. I'm embarrassed to admit how many of these things here are purple.

7:04 So, me yeah. So going back to this though, the point here is that this is really representative of something that happens to us all the time in the world of SRE and DevOps and when we're operating stuff in production. What happens all the time is you get some of their in your Slack channel, and it could potentially be really bad. And you wanna understand as fast as possible, is this an alert that is going to take down my production? Is this something I need to look at right now? Or if it's 3AM, can I go back to sleep

7:32 and just deal with this in the morning? And now when you get to bigger companies, then it's also a matter of responsibility. So just like you yourself didn't know by looking at this box whether, like, it was a big deal or not, you had to call that team with the Geiger counters and with the spacesuits. In a big company, you're not always the person who is not even has the context or the knowledge to look at and alert and understand what to do with it. So what we're trying to do with Robusta is when you see a

7:57 radioactive box like this, we're trying to say, you know, this box isn't so bad. Like, okay, you'll call up someone you're disposed of in a safe manner, but, like, it's okay. You can go back to work and relax. Or we're trying to say, you know, this box is actually really bad. Get over your house right now. And that's what we call runbook automation. And in the Kubernetes context now, if I jump over to, like, a concrete example, then we're taking Prometheus alerts, like this Prometheus alert that said pause is crash looping, and we're trying to get you as fast as possible

8:27 to understand what does this alert mean? Does it mean something? Should I stop everything I'm doing and have people investigate what's going on in production right now? Or is this something like that radioactive box that isn't such a big deal and can wait? Okay. I get it now. I understand the Reddit hook. So I have to say, I'm gonna stop thinking about it. And I'm sure there are other people there who won't stop thinking about it until we say. So that blue thing, the blue chip in the bag, they put the they do quality control and

9:00 they, like, run some blue chip that's, that's metallic through. And then they're supposed to identify all the blue chips and take them out of the process. And if a blue chip escapes the process and gets to the end, like something was faulty with the equipment. So the blue chip shouldn't have been there. When the blue chip goes missing, they know when the machines are, like, the production lines has broken down. Uh-huh. There we go. Today, I learned. Now we can stop thinking about that. Okay. Yeah. So David, you've heard a little bit about Robusta before. You've seen it a little

9:32 bit. So does this make does it make sense? Yeah. It does make sense. Yeah. I think there's you know, one of the things I think is really important when it comes to operating Kubernetes, you know, something that I I do quite a lot is trying to improve the signal to noise ratio, specifically with monitoring as that is very easy for you to get alert fatigue for things to be alert and all the time because you know what, Kubernetes is eventually consistent, nodes go away, this go away, everything is completely ephemeral to a certain degree and that

10:08 reconciliation process will cause alerts depending on how you have them tweaked. And then something I'm guilty of, I guess, is well, sometimes I just change the parameters of those alerts so that I'm actually not notified as quickly as I should be, but I'm trying to reduce that noise and just get more of the signals. And that's a wrong approach to do it and I'm really hoping that what you're going to show us today is actually going to give me ways for me to improve that just so that I get more sleep most of the time

10:39 I guess would be really nice. If you get me more sleep, I'll be a happy man. And I still have the original Robusta t shirt. I don't know if you're still giving them away the cube control get sleep but that thing is great. Yeah. I like like my sleep. You got a UI, yes, because I'm so lucky in a little bit but we still have some You're one of the lucky people with originals. Right. Yeah. So to test to to just touch on that and to really, like, about about what you said, then, yeah, it's absolutely a goal.

11:08 There's, two challenges. One is always a good set of default alerts. So we're doing some work there, and we're taking existing alerts and trying to put it together into a good collection and to really add on context to, like, what does this alert mean? And then the other part of what we're doing is to really, like, when an alert fires, all the data is there. The problem isn't that the signal isn't there. The problem is that the signal is buried in data. So you don't really need, like, another monitoring system. You don't need more metrics.

11:34 You don't need more alerts. Like, all the data is there. You don't need more dogs. You have all the dogs. You just need to see the right data at the right time. So part of, like, one way you could describe what we're doing is we're doing monitoring orchestration. So we're orchestrating, like, all your existing monitoring, but putting it together and tying it together in a coherent picture. Alright. Awesome. Yeah. So I'll just say two quick words about the architecture. And then from there, I think we'll just jump into a live demo. And at the end, maybe we'll jump back

11:57 Robusta Architecture & How it Works

12:06 over here, and I can speak a little bit about what's new since last time for people who have seen this before, and what we're going to be doing in the next year. So just to get started, so a quick reminder, when you're using Prometheus, then typically you have a setup where you have Prometheus that generates alerts, and that sends them to their manager. And the alert manager has some extra functionality, like grouping the alerts together and waiting a certain amount of time before it forwards them. And then it sends that by webhook to destination like Slack or like MS Teams.

12:35 And this is the normal setup. And then you get in Slack, you get alerts like this, which are great, and they sometimes indicate real problems, but they don't have context. So here you can see an example. Here you have a crashing pod, but it doesn't have context on why the pod is crashing. And I like to say when all you have is a time series database and everything looks like a time series. So time series databases are excellent at using them to write rules, identify when problems occur. They're not good at telling you why those issues

13:04 occurred, though. So what we're doing with Robusta is we're adding on an additional component. We're leaving the first half of this exactly the same. So, you that comes from Prometheus to alert manager. And then instead of going directly to stack, it gets sent to Robusta. And Robusta now has rules that go in to gather data about why that are fired and what's the context. This is an observability engine here or runbook engine if you prefer that knows how to take that error and correlate it with those Kubernetes objects. So there was an inert on the pod. Go to

13:31 that pod, look at it, pull in the dogs. There was a pod that was pending. Go pull in the output of kubectl get events. Show me the reason why it's pending. And we have rules here in this observability engine, so we can go and we can gather all the data, and then we just forward it on to destinations like stock. Okay. So the runbook engine sets it receives all the alert manager alerts, and then you have some sort of utility. It just goes collect logs and other data and then forwards them on to my so the

14:04 Slack or whatever I wanna send those alerts to. So you're just kind of retching that alert with the information that people need to actually understand if it is signal or noise and what they have to do about it. Right? Exactly. And the in data and the enrichments here, they could be really simple or really complex. It's like a simple enrichment as a pod is crashing, just fetch the dog for me. So now in stock, I get the other with that pod's dog. And I'm thinking about, like, how a big company works. Like, now the developer

14:31 you can route that directly to the developer. And even if the developer doesn't know that much about Kubernetes, he can immediately see the dog there, and then you can triage it and decide whether it's a development issue that the developer should look at or if it's an infra issue that someone from the infrastructure team should look at. So there's simple stuff like dogs, and then there's more complex stuff like, a pilot with CPU throttle, go and look at the memory limits, and look at sorry. Go and look at CPU limits, the CPU requests, perform some actual

14:58 complicated analysis on that and tell me why it happens. So sometimes we're just pulling in right the right data, and sometimes we're taking, a a known alert and we're going really deep on that and tell me why it happens. Is this static or dynamic? And, like, I mean, by that, like, have you already defined what the collection sources are when an alert manager alert triggers, or is this something that I enrich with my own logic and code? Could I call some arbitrary API? Like, what does this runbook engine look like from my point of view? Oh,

15:27 excellent question. So what it looks like if you wanna go and you wanna define stuff yourself, then it looks like just a YAML definition. So you have here a YAML file that says, when there's this Prometheus alert for QPod crash looping, go and reach that with the dogs. But most people need to do that. So you can go and you can add on your own rules, and a big part of our focus in 2023 is gonna make this even easier and to add on better documentation and to make it super easy to do and to add on your own rules.

15:58 But out of the box, we define a base set of alerts. If you use Kube Prometheus stack, we define base set of enrichments for Kube Prometheus stack. And out of the box, we try and give you something that will work for most people that you won't need to fine tune, you won't need to customize at all. Alright. Cool. And then the last thing I'll say, and then we'll jump right over already to a live demo, is we take in more data, not just from alert manager. Since the last time that you talk, that we talked, I think, we

16:29 took over Kube Watch as the official maintainers, and we'll be announcing very soon Kube Watch version two. So this is a utility that was once by before it was before they stopped maintaining it, that tracks all the different Kubernetes changes by listening to the API server. So we take in data from, Kube Watch so we can see, like, when an ingress changes, and we can see when, pods change, and we can see when jobs fail, we can get all this data. And we take in data from other sources, like from an Elasticsearch, we you can send data. We can

16:57 take in data from all these sources, not just the Dirt Manager, and then tie them all together into coherent picture. But we're mostly gonna focus on this today on the others. Alright. Awesome. Well, I'm excited to play with it. Are we moving over to my screen now? We are. That's it. I'm done talking. That means I have to take my hands out of my pocket, but we'll do it. So I'm sorry. I have a terminal with a Kubernetes cluster running on JKE. I have two tabs here. One with some demos that you've sent me. So if anyone

17:34 else wants to check these out in their own time, you can go to github.com/robusta-dev and that is Kubernetes dash demos. The other one we've got is just your Robusta. So if you're completely new to Robusta and wanna check it out and go to the Robusta repository or the docs. I guess both would work. So am I Oh, we should I should send you to the docs. Well, I was gonna say like, am I just gonna go to the docs and deploy Robusta to this cluster? Is it covered in the readme's for the demos? What's what's what's step one?

18:11 Installing Robusta via Helm

18:11 Okay. Yeah. So I sent you a a demo re I sent you a demo repository in advance, but you should start at, like, the normal Robusta repository. Yep. Or just click the big get starting button for the docs. Yeah. Let's just go straight to that installation. Alright? So alright. Zoom. Zoom. Zoom. Okay. So we're going to be installing with Helm. Do we need to robust the CLI? We do. Alright. Do I have pip? Oh, I think I do. I think it's pip three. There we go. There's another option there that you can just use Docker container if you don't have pip.

18:55 Yeah. I I don't run Docker desktop that often. I actually don't have access to containers really. Everything's in Kubernetes now. Right? Yeah. Yeah. Okay. Then we're going to do a robust dot gen config. Oh, there's a nice copy button. Might as well use that. And let's see what this spits out. Okay. Do I want to configure a Slack integration? Do we need that for today or will I say no? I think it would be best if there's a Slack we can add it to. I think it would be best. Alright. Connect Robusta to Slack. Sure.

19:39 Nope. Too many Slack channels. That's why wrong profile. Is that it? Yep. Yep. That's it. That was a bit too easy. Try. We do our best. But there's always issues. There's always issues in every live demo. So No. I'm definitely saying no to that one. What is the Robusta UI sync? So that's our SaaS platform. I would recommend enabling that for the demo. Me. So do I need to sign up to Robusta SaaS? Yeah. You just put in here. Just put in an email. You don't need us so you don't need us set up anywhere else.

20:37 Just put in a new email. Yeah. That's it. It'll generate it for you. Alright. So Robusta can use Prometheus as an alert source. You can install a preconfigured Prometheus. Yes, please. Sure. Okay. So this has generated helm values file. Oh, bunch of keys in there. Yep. Okay. I'll just show everybody them. That's not important. Right? No. It's fine. Alright. So let's add the rebuster repository and run a helm update. And then we do an install. And I'll need to change. Do I have a small cluster? No. I don't think so. Alright. I guess this might just take a wee

22:12 minute. Yeah. No worries. So what time is it where you're at? It's 07:30 over here. You? 05:30. Yeah. Yep. Is it cold where you are? Oh, it's all the matter. It's active. I I can't say yes. It still made us five. Okay. Okay. It looks like we're installed. It looks like I can go to platformrobusta.dev for some sort of web UI. And I'm assuming if I run get pods, we'll see some Robusta stuff happening. There. There we go. Cool. Alright. What do you want me to do next? Let's go back to the docs. And let's just wait, I guess, until the pods

23:13 are up and running. Yeah. So that's just the command there. And then we're gonna crash some stuff. Let's break some stuff in your cluster. That I can handle. And that's some fun. You should have also gotten a message on Slack, I believe. It sends a message just to confirm the Slack channel. So we're gonna wanna bring up Slack too. Alright. So we have all the pods and then you want me to deploy this thing to break stuff? Yeah. That's deploy crashing pods. Alright. We have a crash in pods. So does this mean I'm gonna see something

24:00 Demo: CrashLoopBackOff Alert

24:39 in my Slack channel? You should. Yeah. I don't know why that keeps deselecting. You have to wait them we don't send it on the first crash because, this is exactly the type of thing that you spoke about. Right? Like, you have these transient issues with Kubernetes, and you don't wanna alert as soon as something happens that, like, that second, it's gonna then disappear. So we try and always find the right balance. What's the point where this is really an issue that you then want to get notified, not just a transient issue. Okay. So we have a

25:21 high severity crashing pod. We have an investigate button. It's telling us it's crashed twice. It's gonna crash it back off and it's collected some logs from the application. Quite a lot of logs from the application. I can't believe you put Java in my cluster. You know I have rules about that. Right? I actually if you wanna okay. Take a look at what that is. Take a look at take a look at the pod that's running. Computer. I'm not a monster. I only put Java in there. Oh, it's BusyBox. Right. Okay. It's a BusyBox command that outputs. Well, according

25:36 Analyzing the CrashLoopBackOff Alert

26:10 to Google, it was the world's dark longest Java exception. Well, yes. Slack warning told me, do I want to expand this thousand line log message. It is experience. Oh, yeah. That's the truncated to a 68 lines. Alright. We have a crashing pod. We got a nice alert. What's next? Okay. So let's go and let's go over now to the demo scenarios repository that I sent you, and let's deploy some more stuff from over there. Cool. So that demo one was just to show the Slack integration that's posting something to my my channel. Yep. And now we'll show an example of

26:59 a different error that occurs where we pull in different data. So Okay. So these demos then are demos that you're going to show off the enrichment process that you covered on the slides and how that gives us more context than to what's actually crashing. Right? Yeah. Exactly. And we've tried to pick the demo scenarios so they don't actually mess up your cluster. Like, I love to show how, for example, we have an issue with those disk space, and then we pull in, like, a graph of the disk usage on that node. But if I do that, that's

27:28 gonna actually mess up one of your nodes. Now in your case, that's okay, But we try demo scenarios that won't really mess up a cluster. So, I mean, it's mostly stuff that there's an issue with this specific part. Oh, yeah. This is just a cluster for today. So if you wanna break it to pieces, it is completely fine by me. Okay. So let's get started with if you go down a little bit, then in the read me, we have there something about the kill. That should be the second one. Yep. So let's just run that from the, REPL

28:03 Demo: Memory Unkill (OOMKill) Alert

28:03 that you checked out. Oh, I've not cloned it yet. Who's that in the chat that said I'm finally breaking stuff? Hey, LP. He isn't it? Alright. So let's apply. Oh, it doesn't exist. I'm just doing what I do best. A typo. There we go. I'm assuming this is going to be a pod that eats up a whole bunch of memory. Indeed. And getting unclean. So this one is Java then. Right? It's gotta be. So what do you need to see when when your pod is unclean? What what do you you usually check? I wanna check the resource constraints to see what

29:15 the the memory limits are set at. And, obviously, if I can get some information from Prometheus, I'd normally like to see it's a tumbling value over the last thirty minutes to see if this was, like, a blip or if it was a gradual growth, something like that. Yeah. That's that's makes sense. Now we're gonna try and, like, listen to people like you and then gather that information upfront and send it to you. Yeah. Well, go over to Slack, see how we did on this one. Alright. I've got too many tabs, I don't wanna put anything. There we go.

29:44 Analyzing the Memory Unkill Alert

29:53 Okay. So we got our high severity pod umco, namespace node name. Okay. So we got the node allocated memory, that's useful right off the bat. The container name is memory eater. Nice. And we can see the request and a limit right in front of us. So, yeah. That's quite a good amount of information upfront. I'm assuming it's been on code because it's at the hundred meg limit and then we've got some links. Are these oh, that's a binary. So it's Oh, it's an SVG. Yeah. But you can open up not the PNG version. We're gonna knock out those SVG

30:32 those SVGs soon because that doesn't run well. Oh, sweet. So we get that tumbling value. So I can actually see that it's quite a quite a claim. And the node looks okay. The memory chart for your pod and for your node, and then hopefully, gives you a lot more context on your your own kid's pod. Yeah. I see a line, a scary line. I don't trust this workload anymore. I guess Now Sorry, Nico. It's an example that's optimized for demos. But, like, in a real life scenario, then you'd see spikiness, and then you'd see also if

31:12 it's going up over time and then you know there's a memory leak there as well. Yeah. Like, you know, if this was a real production application and then something gets unkilled, been able to see that memory claim because you always wanna know whether this is just, you know, it's just consistent growth. You know, it's just it's just running over time and maybe we just need to set we need to restart that pod every six hours or something, which is our hacky fix, but we've all done it. Right? Because because we have some memory leak, we've not debugged yet. Or

31:42 it could be that it's just absolute context situation where, oh, maybe we got a spike of 10,000 users in a fifteen minute window and that caused our Redis cache to just completely fill up and fall over. Like, you know, not until you start to connect these dots of all these disparate systems within your cluster that you really understand what is actually happening here. So Exactly. Yeah. So what we're trying to do is make it easier to connect those dots where we can't connect them automatically. And where you have dots that you know makes sense to connect and to also make

32:11 it easy for you to connect them and even easy for you to say, okay. Well, next time this happens, like, okay. Go and restart this pod after five hours before it reaches the six hour mark when it's at 80% mark. So these attachments on like, I I I don't know how this is configured, this trigger and the actions and stuff like that. But can I add new PNGs with different queries against the Prometheus? Is that something that's an option? Yeah. Exactly. You can indeed. Do you wanna go over to the GitHub, the main GitHub for Robusta

32:40 Customizing Robusta with Enrichers & Actions

32:44 and check out the helm values file? Yes. There we go. Yeah. And now over here, that's just going to the Helm folder and then right into the default values file. And look in here for. This one? Yep. And that's just how it's configured. Okay. What are these special terms in the Robusta? Yeah. So we have here three different enrichers. One of them is pulling in that data that you saw, the pod and killer enricher. The other one is a little bit more generic, but it's it's creating that graph, but it's optimized specifically for the scenario so it

33:45 knows to add on the limits. And then the third one is totally generic. It just says take this pod, go with whatever node it's running on, and grab their graph of the memory from that node that this pod is running on. Nice. Okay. Cool. So I can add more of these. What kind of enrichers are available then to people? Is there a list somewhere? Yep. Yep. Yeah. But this documentation, and this is the main area, by the way, like, that we're really gonna focus on in 2023 and making this way easier to navigate and way easier to use. So this is

34:19 gonna be a big focus of ours. If you look on the left hand side of the page, then you can find there, on the left hand side of the page in the menu, there's something there that says actions. Got it. And we have here all these different actions. Nice. And no, it's bash enricher. I could have some fun with that one. Yeah. I'm assuming I could just write a bash script and lane and the yaml and then does this run it or Yeah. Click on the example config. See that that's awesome. Right? Like, I used to be an SRE and I

34:58 used to give talks about how terrible I was at being an SRE. Right? Because I used to get alerts at 03:00 in the morning and we had a a six month old at this time. Right? So my wife was already hating the fact that she'd only been asleep for ten minutes and my alerts gone off crazy. So I would always just, like, grab my laptop and it would be like a disk alert and say, oh, you're at 96% or whatever. And I would just go on and delete the log directory on a Linux machine. Like,

35:21 I because I just wanted to go back to sleep, but I never actually spent enough time to work it out and do it properly. But if we had the ability if I had the ability to just have a bash command and, like, run, like, a d u dash h against the directories, do a depth on it, and kind of analyze where the actual consumption is coming from and identify the process. Like, I could chain a bunch of these bash commands together and I would have all the information I need to actually fix it once and for all instead of

35:44 just taking the lazy approach. So but, yeah, that's really cool. I'm glad that's the first one I've seen. Yeah. Exactly. And I see there's a comment here in the chat. Wind is asking, does it create just Prometheus rules objects in the cluster? So, yeah, the answer is, yes and no. We do create Prometheus rules in the cluster if you take the default alerts that we provide. And if you already have other manager rather than you already have Prometheus, then you can disable that. You can just say use my existing Don't create any rules for me. And then the second part of

36:16 this is we're not just creating those Prometheus rules. We're taking those, like, the Prometheus Prometheus alerts, and then we're tying them together to, like, the different context gathering in the runbook automation. So, essentially, we're doing two things. We're installing default alerts if you don't already have them, and that does create Prometheus rules. And then the second thing that we always do is we take those alerts, and we now pull in the right context. And for that, you you can't do that with Prometheus alone because it's not just based on time series databases. It's, like, not just based on looking at

36:46 time series. It's saying, okay. Here's an alert. Look at the labels here. Has this label saying that's running on this node. Go run this batch command on that node and so on. Nice. So a big part of what we do is not only building the platform that enables to do that, but we also want to collect the community knowledge. So if you have some insights about some specific alert and you know as someone who who is specialized in that, what exactly you need to do, you can actually contribute your own playbook and even your own alert,

37:21 and then everyone else can enjoy the the benefit of It's about taking it's like you said at the beginning. Right? Like, you have all the data there, and you're trying to find the signal and the noise. So what we're saying is, okay. If you're running on Kubernetes, we all actually have this very similar alerts anyway because we're all on these different scenarios. I mean, there's, like, known issues with the nodes, then we're all dirthing on the same common scenarios. So why not also share the investigation for that? Why not share that as community knowledge? Nice.

37:58 Alright. Let me scan what other ones we have here. Uh-oh. The node graph nurture. Everybody loves a good graph. Right? Hard bash. Does this run up execute a command and save the pod? It does. So you would put that if you have an alert that fires on a specific pod, then that would execute on the pod with either it fires. Yeah. And send you the results to Slack or whatever you define. Okay. Alright. There's quite a lot of choice here. I'm not gonna scroll through all of these, but there's definitely a lot of flexibility in the enrichers that we can add. And

38:41 that was just was that just the event ones? There's, like, remediation stuff. And Yep. So when this sort of fires, we're going to be the hyphen troubleshooting. When this happen. Alright. There's loads. Alright. So we'll take a look at another example? Yep. Let's go back for a second to the demo repository. I think we'll look at one last example that will be good to generate the concept. And I wanna take here may go down a little bit. Just a little bit more. Yeah. I wanna take care of an example with pending pods. So skip the next one. The CPU throttling,

39:30 we can skip. Just like the, you know, pull pull in the relevant data. And, yeah, let's take care of the pending pods one. I'll clean up after myself. I think it's gone. Now the pending pods is an interesting one because what we're trying to do is we're trying to give everyone the best out of the box experience, and that means it means we have to make decisions about what we are there on. Right? Because let's say you have a normal healthy cluster and you have an autoscaler defined. A pending pod isn't necessarily a big deal

40:11 because the autoscaler will fix it. You spin up a new node. It takes a few minutes. And then once you spin up that new node, now everything is fine. Right? Like, the pod won't be pending anymore. So for pending pods, you actually don't wanna alert on that right away. So by default of the box, we don't send you an alert right when the pending pod happens. We wait fifteen minutes, and then we send it if that hasn't been resolved, and then we give the autoscaler a chance to do that. But if you have pods that are pending, you

40:40 still wanna get some form of visibility into that. So that also ties into what we're doing in the SaaS platform, where here's a scenario where we don't wanna alert you on it. But if you're going and you're checking, you're saying, okay. Well, I just deployed what's going on. We still wanna give you that visibility even before it becomes an issue. Okay. Does that mean we're sitting here for fifteen minutes? Because No. I was thinking we could open up I I was thinking we could open up this SaaS platform, and then we'll also tie that into some other stuff

41:03 Exploring the Robusta SaaS Platform

41:07 that we didn't show off in that we that's in the open source as well that we didn't show off. Alright. How does one get to the Robusta SaaS platform? I guess the login button. Yep. Now because I just typed in my email address and I don't have an account, if I click Google off, is it gonna work? Yep. I think I will need to move this to the right profile, though. You're gonna hack me? I don't really get access to do anything meaningful. No. It's just It's just OAuth. Yeah. Alright. It was just a fairly random

41:59 sub domain string on that super base thing there. Yeah. Yeah. Alright. So what do we have here? I don't think I've seen the SaaS platform before. Did we look at this last time? I don't know. And if we did, it looked totally different. Alright. Why don't you give us the guided tour? Okay. So let's click let's say on let's take what's a good place to take? Let's just take one any one of these. Take GKE metrics that you check. Yep. And now you can see here there was a transient issue, but it fixed itself. There was some

42:21 SaaS Platform Features & Timeline View

42:40 issue with the mount that self resolved, so that's just debug priority under the priority there. We're saying this wasn't important. It happened. You might wanna know about it, but it wasn't important. And now if you jump over to the pad screen for a second, then one thing that we can do is we can let you pull in different data here from what's running in your cluster. But for security reasons, we want you to generate a token that's specific to your browser, so to give it, like, access to do that. I think you need just a step three.

43:19 Yeah. Yeah. I've already used my context. Right? So I don't need to worry about that. I can just do this. Okay. So what happened? Let me check. Okay. So we generated a token so that Robusta will have access to pull in a little bit of data, very limited data type pulls from Robusta itself, about what's going on in your cluster. And that will then pull into the SaaS platform some graphs, like, about different metrics, CPU usage, and so on, and to get dogs. Okay. So the Robusta that we installed to the cluster, does that have any permissions to

44:14 the rest of stuff? Does it speak to the SaaS platform? Are they completely independent? Like, what's the relationship between those two components? Yeah. So it depends on whether you enable it or don't enable it. So if you start off and you check off at the beginning that you don't wanna enable the SaaS platform, then nothing gets sent. And if you check off that you do wanna enable the SaaS platform, then we're sending just metadata there, like, dirt's crashing pods, so data about what's happening, and it's all with a push model. Now when you go into the UI, though,

44:44 you might wanna see data there. Like, you might wanna see data. Okay. Where is the pod log right now? In order to enable that, then you have to explicitly give permission there. Okay. And is this a short lived token, or is this something I should be cleaning up after a certain amount of time? Does it have an expiration? This is a token that you can remove from the UI itself afterwards. It's not a Kubernetes token or anything. So it doesn't have any Kubernetes meaning. It just has meaning in Robusta itself to authorize your computer. Okay. Got it. Cool.

45:17 So now we get some information about the the pod itself. Right? So we can see the CPU and memory consumption, and we can see that there's more than one of them. Mhmm. Yep. And if you wanted, then you could pull in from here the logs, for example. You can see different data. And one thing that's cool about this is we're doing this on all your different clusters. So even if this was a microservice that was running on 300 different clusters, you would still get a single pane of glass to see all those different clusters in one single place.

45:54 Alright. But let's go back to that transient issue we were speaking about. We were saying, okay. How sometimes this issue that occurs, that's a transient issue. So just click here on the timeline page on the left. Am I on it? Oh, no. That's over. That's fine. Okay. Yep. And then you can see here we can go in here, and we can see some of these transient issues. So I'm gonna want you to can you go first just zoom in a little bit with your scroll bar on the right hand side of this graph? If you use your scroll bar on the

46:31 right hand on the graph itself, put your mouse over the graph. Yeah. And then you can zoom in. Yeah. You can drag it as well. Zoomed in a little bit too much. Oh, never trust me with a mouse. This is what I'm learning today. Alright. Hold on. And the resume and there we go. That's what you wanted? Yeah. So see that icon on the top row? Mhmm. Yep. So if you click there, then that's an example. Yeah. Just click on that. Oh, I just wanna make sure I understand what I'm seeing here. So this these are the different types of things, events

47:15 within my Kubernetes cluster. This is a timeline of when these events happen. I don't understand what the colors are yet, but that's okay. And this change one is not like the others. Why is that? Because this is a special event. Is that got some significance? This comes from the API server. So we're tracking a list of all the different changes that happen. What I would do is go in and run KubeCatl edit on a deployment or a pod or something in your cluster, even that crash pods. And we're using the UI here to show this, but you could do this even without

47:50 the UI. Like, what we're doing here when we're tracking all these changes, this is based on KubeWatch. And all the functionality here, it's important to emphasize, it's also in the open source. So the UI is just like a thin layer on top. You can change the, like, the time range for one hour. It will be, like yeah. Yeah. Up they'll be above that. I'm just gonna start playing with all the sentence now. Mhmm. Yep. Maybe hit the refresh button there. Have I broke it? That's I may not get the label, the metadata change. Maybe that was just a silly

48:49 change. I'll delete the pod. No. I want do a change. The deletion will definitely pick up on, but do a change. Maybe go back to the change. Yeah. Do edit pods or edit deployment. Do you change the pod or the deployment? I changed the labels on the pod. So try changing the deployment. We might if the pod is part of its deployment, we might only look at the deployment. One moment. So let's change the replicas. Right? This is probably a change I'd wanna be able to visualize. Alright. And I'll give that one moment. Well, what can you do about the demo

49:39 gods? What can you do? Me. Can you check what's in your Robusta on your logs? Yeah. Sure. Do deployment slash Robusta around there. Yeah. Oh. No. I think that's okay. Yeah. I don't think it's that. It should appear in the UI, I guess, in a few seconds. Go back to the UI and see if it appeared yet. Right. So I'll I'll look into that a little bit afterward. Let me go back, I mean, go back to the home page over here. Yep. And you added a crash pod. Right? Mhmm. Yeah. I did. Yes. So if you click here,

50:42 me I'll have to take a look later on and see what's going on there. Well, it's noticed that it's not scaled up. Yep. That is true. I'll need to check and see what's going on with the change over there. But let's go back for a second now. Just go back to the main app screen. So just go on the left over to the app screen. Yep. Now remember we put here this example pod that was gonna go into pending state? So just open this up. And if you go down here and you look at the morning, then you can see

51:21 here, okay. It failed scheduling, and we're pulling in here the reasons why it failed scheduling. So you can see here that it failed to be scheduled because three nodes didn't match the pods node selector. Uh-huh. And what we're doing here is we're pulling in out the events to show you the data on what's going on. Even if this is a transient issue, we're still showing you the data on why this is happening. Nice. Quite an issue. Alright? Thank you. I'm gonna have our also our UI UX design tomorrow, and then she'll go over the stuff

52:04 and then she'll pick pick out the things where you didn't click on the things she thought you would click on, and we'll do improvements on that. So that's always useful. Yeah. I think my Zoom is causing problems, but yeah. I like this kind of level of visibility into everything. I like this timeline view. I think that's very clever. And then, yeah, understanding why the pod isn't scheduling. Like, is this something that you expect to, like, hypothetical question here because I don't know if it does this just now. I don't know if it's going to do this in the

52:35 future, etcetera. But I wonder if a pod fails to schedule and it's because of, you know, the CPU of the memory request can't be satisfied by a member in the cluster, then we could probably expect that pending status which kicks off the cluster autoscaler will eventually work. But this one was an affinity selector, like, does does Robusta know that there are no nodes with those matching criteria? And can it can can it handle that differently? Is that what it's doing already? Yeah. So we don't have that resolution yet today, but what we've done is we've built

53:12 out, like, the thing which we can express that resolution. And a big part of what we're doing in 2023 is to handle, like, increase our coverage on each of those fine edge cases. Yeah. Because, you know, I could be especially with with node topologies and different architectures and GPUs now, there's probably a lot of things that you can just make assumptions about. Like if there are no GPUs in the the cluster, maybe there's never going to be any GPUs in the cluster. And if there's no nodes and EU East one, maybe it's because we never deployed EU East 1. There's almost

53:43 just copy and pasted some crappy YAML from the Internet into their cluster. Like, these are probably things you could surface relatively quickly for people, and it doesn't require a lot of knowledge of their cluster. You can make certain assumptions, I guess, based on the the complete lack of a certain criteria. Yeah. I think this is a really interesting space where you can deliver a lot of real value to people with some very simple kind of checks like that. That that's cool. I really like this. Yeah. No. It's kinda like open sourcing the operations. It's like taking how you work,

54:17 forcing, like, the way that we actually think and we handle this. Right? Like, ultimately, this is just a rule in the in the repository that's written in YAML. And we're, like, taking the knowledge out of your head for how you would handle this error and now turning into codes. It's almost like infrastructure as code said, okay. You shouldn't set up servers ad hoc. And what we're doing is we're doing incident response as code and saying that you shouldn't respond to incidents ad hoc either. Okay. Nice. Alright. We still got a whole bunch of tabs on the robust assess. Do you wanna

54:48 dive into any of them? I see we've got jobs, notes, comparison, health, silences. So everything here you would like to show up? Arik, anything you wanna emphasize? No. We're constantly adding and adding. So the help of new to just give you, like, a one if you have a lot of clusters to give you a single overview of everything over there. The silences, when there's a Prometheus alert, then we add their silence button in Slack. So with one click, you can silence alerts directly from Slack, and then that shows up in the silences screen. So tell me a couple more things then about

54:56 Robusta SaaS Free Tier & Pricing

55:27 the the SaaS platform. For people that wanna get started and start using it, is there a free tier? How do they get started? Is it as simple as what we've shown today? Maybe you can give them a little bit more detail. Yeah. Sure. I mean, so very extensive free tier. Mean, we currently have our free tier up to 50 nodes. Mean, so that's 50. Five zero. Yeah. You've seen them in shock. I mean, I just yeah. It's it's it's not a lot of nodes. You know, obviously, the clusters get much, much bigger than this. But I think it's

56:02 it's rare or maybe it's just the people that are calling me asking for help is that their clusters are normally in that kind of eight to 12, maybe 20 and absolute, personally. That is a very generous three tier. Yeah. We know that. Yep. We know. We're not guaranteeing that we will forever have 50 nodes on there. Although if someone adds things on in advance and they're already on, like, Robusta, then we do try and, like, at least be thoughtful. Like, if people if we ever change the tier, we will try and be thoughtful people who are also there beforehand.

56:33 And is there a limit on how many clusters of up to 50 nodes I can have on my account? No. We don't wanna force you in that regards. You can deploy however you want. You want, like, to split everything up into clusters. That's fine. You want everything on one mega cluster. That's your choice. That shouldn't affect like, our software shouldn't affect how you choose to deploy. So I could have 10 clusters with 49 nodes each and that's still free tier, or have I misunderstood? Is it 50 nodes total? 50 nodes total. Ah, okay. Yeah. Okay. That

57:02 makes a lot more sense. Alright. Yeah. I just assumed I could have an inform a clusters with 50 notes, but that was just me. Mean, you could technically create a different account for each one, like, you probably we wouldn't go after you, but I think the value add from what I'm seeing so far is this multi cluster pane of glass. Like, you know, this UI with apps and timeline and jobs, we've got all the debugging information there. We've got alerts. Like, you know, multi cluster is definitely a growing popular way of running Kubernetes, lots of smaller

57:33 clusters. And that single pane of glass is still a kind of missing piece of the puzzle so far. Especially as it comes to actually being able to apply run books or logic or conditional statements to those clusters rather than just have a graph on Grafana that can filter by the cluster. Yeah. Exactly. I I think where you're going with this, I like the direction, and I'm very excited to see this. So yeah. There's a question too, I think, also about that you can also rather route others with Grafana. So the big part of what we're doing

58:04 here is to not just route the alerts to the right destination, but to pull in the right context. Like, if I go back to that example with the radioactive box that I started off with, then, like, you wanna find out that there's a radioactive box there, but you also wanna find out what that box means and to see the context there as fast as possible. So our goal is to pull in the right data and show you all the right data at the right time, not just to do the other routing, which you can obviously do on your own as well.

58:32 Cool. Alright. If anyone else has any more questions, feel free to drop them into the comment section, and we'll pass them on before we finish for today. Alright. So as far as what we've looked at today, we've got all these options for enrichers. If I wanna start going rogue on this and do my own thing, what's the best place to get examples? In the docs. So if you go to the docs, then we have examples there on configuring the enrichers, and we have examples on writing your own enrichers too. So for example, you could do

58:56 Creating Custom Playbooks (Stack Overflow Example)

59:09 an enricher that takes an alert, and, it acts asks GPT sorry, chat GPT what you should do about that alert and then sends it on Slack channel or to Discord. So you could do that, something like that if you wanted. Have you already done that demo? I haven't. I wish I hadn't. If someone does that demo, though, we will send you swag to your house. We will send you swag and special swag. I will I will send you a custom swag that no one else in the world has ever gotten if you ask chat GPT.

59:43 I think that would be generally awesome. Like, I think we've all seen enough of these chat GPT things on Twitter now where people are filling up with a bunch of context and then ask a question based on that new information. And you could totally do that with Robusta. Like, here's the logs for my application and here is the scheduling problems for this pod. Like, what is the remediation path here? Thanks. And it would probably tell you, oh, you need to go and spin up a note of a GPU on it or you need to tweak your requests and limits, etcetera. Like

1:00:11 We actually have an enrichment that run the search on Stack Overflow. So this will be an enhanced version of that. Do you wanna try and configure one that runs the search on Stack Overflow? Does that work? I I think so. I mean, let's let's go let's go to actions. Uh-huh. And I'll just say in the meantime, I think it's under event enrichment. Either event enrichment or miscellaneous. Just look here for Stack Overflow. Yep. Stack Overflow and richer. And in the meantime, I'm just gonna say to Wind, excellent questions. So thank you very much. And, also, we're on Slack stuff, and we

1:00:53 have a community. So if you have any questions, always reach out. Okay. Now take care of the example config. And yep. And so you're just gonna copy this. Copy the whole thing. Oh, the whole thing. And now we're gonna add this to the your home values. So you have to now change the home values, and you should have already have the generated values. And we just add on a new value here called custom playbooks. I think yep. C u s t o m. Yep. Playbooks. Yep. And just pop this in here, but just make that, like, a member of an

1:01:35 array. Custom playbooks is an array. Yeah. Okay. And then you have to indent all of this. With that. Right? No. The triggers, yeah, should be the triggers should be backspaced. No. The triggers Yeah. Backspaced. Two backspaces on that line and the line after it. Yeah. On triggers and on. Yeah. That's it. And Oh, yeah. Actions and then trigger. Alright. Okay. Gotcha. Gotcha. See, YAML's hard. Yeah. Yeah. So just save that to the helm. Upgrade. Okay. In fact, I still have my helm command here. And I'll need to change this. Oh, no. Don't do it again.

1:02:19 Home upgrade slash slash install. Is that gonna be enough? Yep. Perfect. So now when we cause some sort of failure scenario, it looks up the it does a search for Stack Overflow with some of the information and then tries to get me a response? Something like that. Yeah. We'll see it in action. I mean, I'm hoping you don't just copy and paste the commands from Stack Overflow and paste them into my cluster. Although that would be fine. I'm not gonna I'm not gonna do that. This is actually one of first enrichers I wrote to kind of play around

1:02:59 with the concept myself and mostly as a toy, like, not as a real production use case, but it's interesting to show the concept. Well, let's go to Robusta. So where's the code for the enrichers? So go into playbooks, and then we have here there should be one here called Stack Overflow. Let's see. Maybe so try Search? Yeah. Try doing a search on Stack Overflow. Yep. There we go. There's integration. So 9262. Okay. So I can add chat GPT and 25 lines of code, hopefully. Exactly. Yeah. Because I I actually have played with the chat GPT API

1:04:08 and that is just a regular request you're making to an endpoint with some context. I mean, we could totally make that work. Not that suggestion we try and do it right now, but I think this demo needs to exist and has to exist soon. So If you do this, our our offer stands. We will send you some incredible customized one off swag that no one else in the world has. Consider it done. Only that. Well, the text on the shirt, we will generate with ChatGPT. I will ask chat GPT to give me the appropriate quote for the shirt to send you.

1:04:43 Alright. I mean, we could just have chat GPT take over the stream as well. I'm pretty sure. Just write write a script to show off Robusta demo and improving your previous monitoring and alert. Yep. We have deployed the Stack Overflow thing, I think. So how how do we check? We have a utility in the CLI that generates a demo alert. So you run Robusta demo minus alert. So it takes a random pod from the cluster and create the and a major alert on that. And then within few minutes, you should be able you should get a new message on

1:04:56 Demo: Stack Overflow Enrichment

1:05:29 Slack. That, I believe, arrived yeah. That arrived beforehand. That's not from right now. So just give it a moment. You have a manager's sinuses. So it sorry. An manager wait period, so it takes a moment. Yeah. This one's from six minutes ago, so we just need to be be patient. Yep. Are you all the you're all the way scrolled down there. Right? Yeah. Okay. Cool. Cool. Cool. Mhmm. Just a weird Yeah. Yeah. So while we're waiting, though, I'll say, okay. Here's an alert. So see how how there's that silence button at the top of the alert?

1:06:13 So I used to get a live alerts in Slack, and then sometimes I wanna silence them, but it was, like, annoying to do, or I had multiple clusters that I'd have to go and silence them on. So sometimes I didn't really do what I should have, and sometimes I didn't silence them, and I would just get that same alert every six hours. So we've tried to make it really easy to silence. And even if you have, like, 10 different clusters, each with their own Prometheus and alert manager, then you can push out with one click that silence to all the different clusters.

1:06:40 And we prefill all the boxes here. It's really, like, trying to encourage you to silence something if it's noisy. You need to put Mhmm. Oh, no. But if you silence this, we're gonna silence the dirt that you have to delete it. We're gonna silence the dirt that we have to simulate. The the same what? Yeah. And I've heard from that we will have a hundred at stars at some point during this stream. A thousand? Sorry. A thousand GitHub stars. At some point during the stream, we're currently on 999. So if you're the lucky person who gets

1:07:19 this thousand star, then, send it to us, and we will send you a T shirt. Oh, I've already started it. Correct. Whoever gets the thousand star, just screenshot it and post it on Twitter afterwards. Tag me, tag, David, and we will send you a T shirt. Alright. Someone get out there. Win that T shirt. Let's see. We still don't have our Stack Overflow alert just yet. Okay. Let's check the dogs. And someone just got a T shirt. I don't know who, but someone just got the T shirt, so congrats. Hey. Nice. Okay. So we think there's no problem?

1:08:05 I don't know. Let's jump over. Let's check the Robusta dogs. From the runner? Yep. Yep. Congratulations, JS. We got that T shirt. Alright. Well done. Yeah. I don't see any issue here. Maybe we just need to wait a couple of more minutes until alert manager will file it. There's default alert manager silences. Mhmm. So it's not silences. I'll wait periods. So when you send in the dirt, depending on your configuration, it actually takes some time for the dirt to arrive. Let me see. I'm trying to think if there's a different way we could simulate it faster.

1:08:43 Yeah. Maybe we have an alert simulation playbook you can take. That that's not the demo there it does? No. It creates an an alert on alert manager. Oh, that's why we have to wait the period. Mhmm. Okay. So can you send him the command then for the we're gonna send you a way that does it faster. It's here. It's here. Woo hoo. And now if you click the button, the search Stack Overflow for So this also shows, like, how playbooks can be interactive. Here's an example of a playbook that's taking an action. You can also automatically remediate stuff

1:09:19 when you click that button. You can do it different ways. I think it will be really good to have chat GPT. Yes. Whoever flat GPT is getting the prize. I'm telling you. Hey. This stream's done. I've got some code to write. I'll point everyone in the right direction. Okay. So go over to the Robusta docs, and there's a tutorial there on the left that says custom automation. And that tutorial shows you how to write your own playbook and to load into Robusta. And the source code, of course, So the source code for this, can you drop in

1:09:57 the source code for the Stack Overflow one so people know what they can go and copy? Alright. We'll drop that into the chat. Which one? The Stack Overflow one. Yeah. I think David's gonna do it anyway. Okay. David might email someone else. So it's just some Python code. So nice and easy for people to work with web APIs and pass in the data. I'm assuming that's event event, which I'm loving the name of and hating it every much the same at the same time. Yep. Does does this get event event get event. Does that have all of that enrichment that

1:10:36 was previously done? Like what what do they have access to? I know we see we've got type, we've got reason, but does it have the logs? Does it have the request and memory limits from the other enricher? Like, what what do we have access to within this? So each enricher is independent and each enricher access, it depends on the event there. And I know event event has a really awkward name. But what event event represents, like okay. What you would normally see there isn't event event. What you would normally see is pod event or deployment events.

1:11:08 And in this case, it's really a an event that happened on a Kubernetes event. You were listening to kubectl get effect. So I know that's a super awkward name and, like, I'm sorry about that. But what you would so what you would have access to on the event object depends on, like, what it is. So if you're looking at pod event, you can access the logs. If you're looking at Prometheus event, then you can access the pod that's associated with that. You can access the Prometheus labels, the alert name, and all the different stuff. And you have

1:11:37 all the complete, of course. So you'll just be able to see there what's available. Nice. Alright. I'll be playing with that later on. So I think we're shown a lot here. We deployed Robusta. We've used the SaaS platform. We've poked around our cluster. We caused a couple of failure scenarios. We've seen the messages come into Slack. We've seen that those are interactive. We've got the Slack overflow one. We've got the silences. There's there's a lot of value from people deploying Robusta to the cluster. So maybe if you're happy to, you could tell us, like, what are you working on? What's coming next?

1:12:15 How are you going to take Robusta to the next level? Oh, so let me open up my slides and I'm gonna show exactly that. So is there anything else you want to show on on my side before we do that? Like, I didn't mean to No. No. I think that's Alright. Your slides are now live. Take it away. Okay. So I'll I'll speak briefly on what we've done since we were last here. So what we did is a live coverage. So, like, that's a lots of stuff to cover all these different things, all these different scenarios and add these playbooks

1:12:19 Robusta Roadmap & Future Plans

1:12:47 for them. And the second thing we did was to add in live destinations. So last time we were here, you could send stuff to, think, just Slack, and we added on MS Teams and Discord and PagerDuty and Cisco Webex and Opsgenie and all these other destinations. So those were the two main things that we worked on here as well as improvements to the SaaS platform, but those were the main improvements in the open source since we were last on the stream. And what we're gonna be working on in the next year is more coverage. So it's

1:13:13 got different types of stuff, major improvements to developer experience to to make it even easier to update the playbook definitions. And I edited my slides while you were talking. So I think we're gonna get top GPT from David Flanagan. They from our favorite Rawkode, and there'll be some more surprises coming, more stuff that we're going to work on. We're very driven by feedback we get. So, like, it's the old saying goes, like, send us all your weary and tired alerts. So, like, tell us what's going on in your clusters. Tell us what you want automated or why.

1:13:47 Community & Contribution

1:13:47 Tell us what issues you saw that happened in your cluster that were annoying to investigate, and those will go into their own map. Tell us if you wanna send alerts to different destinations that we don't support today. That's why we support now PagerDuty and AppsGenie, and, of course, we got a lot of contributions there from the community who have open PR requests as well. So I think that's everything. There's some more stuff I could talk about, but I think those are the main things. Awesome. Thank you. Alright. Let me pop this back over to here.

1:14:22 So thank you for sorry. Were you are you not done? Oh, hey to JS. I I want I need to get your address and stuff, so message me afterwards on Twitter. And it's already late here, so I'll probably reply to you tomorrow. But message me on Twitter, and then I'll put you in touch with Mayan from our team, and you'll get some swag. Yeah. Definitely drop him a DM, get that swag. Alright. Well, thank you both for joining me today. It's been fun taking a look at Robusta again. It's always great to see where these

1:14:45 Conclusion and Farewell

1:14:50 projects are going and the speed and the velocity that these projects are kind of evolving at. So it's just nice to to see that you're making people's Kubernetes lives a little bit easier. So thank you again for taking the time. Any last words before we finish up for today? Oh, there is a fun remark. I I think, but I think last time that Arik and I were here, then there was someone in the chat who we noticed and who was creating excellent content on Twitter and who we subsequently added to our team, So I think

1:15:27 I think that's, like, a good closing the circle. Think Pohang was here as part of the community last time, and now he's also here as part of the team. Maybe next time also on the screen with us. So, I guess it's good to say, like, we're doing our best to really build that open source community. We are always hiring as well. So if this sort of thing interests you, then please reach out. We're hiring for front end positions. We're hiring for a lot of different stuff. So my DMs are open. Yeah. JS, just include your CV or your

1:15:58 resume when you're requesting your swag T shirt. You never know. Yes. Yes. Yes, please. And has been doing excellent work on the Robusta team. We also have the widest Kubernetes thing now that we're doing together. So sometimes interesting, some stuff comes out with these streams. Well, yeah, it's it's nice that we get the opportunity to share knowledge with each other and with others. And then for the people that, you know, wanna get more involved, there's always opportunities. You know, open source is one of the most welcoming places to get involved and start contributing. It doesn't take any prerequisites.

1:16:33 And especially, I've gotta say the Kubernetes is a cloud native ecosystem with some of the most welcoming communities I've ever had the honor of kind of participating in. So Yeah. It's like, you know that old joke down the Internet, no one knows you're a dog? No. I don't know that joke. I think it's a famous New Yorker I think it's a famous New Yorker comic strip, but it's so true, especially with open source. Like, it doesn't matter who you are or where you are, even what age you are. I first got involved in open source when I

1:17:02 was 14 years old, and I was like an old kid who didn't go outside enough. And then next thing I knew it, was, like, flying me out to different hackathons and conferences. And it's really one of the incredible things about open source, the ability to really connect people and to, like, be judged based on the merit of your codes, not on your background, not on what country you live in. Like, none of it even matters with open source. They're all just melts away. So thank you, David, for, like, helping also to organize this event and this stream. And think

1:17:35 at least one person, like, is who is in the community, we became closer to also due to a stream like this. So thank you very much. Yeah. Thank you. Awesome. Alright. Well, thank you both again for your time. It's been a pleasure. I'm sure we'll speak again soon. And until next time, we'll be back. Have a great day, everyone. Yeah.

Meet the Cast

David Flanagan

@rawkode

Natan Yellin

@aantn

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Documentation

Robusta custom automation tutorial

Code

Robusta Stack Overflow enricher source code

More from Rawkode Live

View all 173 episodes

Hands-on Introduction to Odin

Hands-on Introduction to Odin

Hands-on Introduction to Iroh

Hands-on Introduction to Iroh

Hands-on Introduction to Yoke

Hands-on Introduction to Yoke

Hands-on Introduction to sympozium

Hands-on Introduction to sympozium

Friday, January 23rd, 2026 - Chevron7

Friday, January 23rd, 2026 - Chevron7

Hands-on Introduction to jujutsu (jj)

Hands-on Introduction to jujutsu (jj)

More about Robusta

View technology

Hands-on Introduction to KRR (Kubernetes Resource Recommendations)

Hands-on Introduction to KRR (Kubernetes Resource Recommendations)

More about Prometheus

View all 26 videos

Flatcar Linux: A Modern OS for the Always-On Infrastructure

Flatcar Linux: A Modern OS for the Always-On Infrastructure

Hands-on with Headlamp: The Kubernetes UI

Hands-on with Headlamp: The Kubernetes UI

Hands-on Introduction to Perses

Hands-on Introduction to Perses

More about Kubernetes

View all 172 videos

Hands-on Introduction to Yoke

Hands-on Introduction to Yoke

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Kubernetes Security Scanning: The 4 Tools You Actually Need

Kubernetes Security Scanning: The 4 Tools You Actually Need

More about Helm

View all 49 videos

Hands-on Introduction to Yoke

Hands-on Introduction to Yoke

Platform Engineering: Asking "Why"? with Evelyn Osman

Platform Engineering: Asking "Why"? with Evelyn Osman

Build a Production-Ready Kubernetes Cluster with Spectro Cloud Palette | No-Code Tutorial

Build a Production-Ready Kubernetes Cluster with Spectro Cloud Palette | No-Code Tutorial