Overview

About this video

What You'll Learn

  1. Understand Prometheus as a pull-based time-series monitoring system with a dimensional metric model and labels.
  2. Set up Prometheus and node_exporter scraping, configure scrape targets, and validate collected metrics in the UI.
  3. Build and evaluate PromQL and PromLens queries like rate, ratios, and predict_linear for disk-alerting scenarios.

Julius Volz, Prometheus co-founder, walks through the data model and four metric types, scraping node_exporter, writing PromQL in PromLens, and building a predict_linear disk-fill alert.

Chapters

Jump to a chapter

  1. 0:00 Holding screen
  2. 1:30 Introductions
  3. 1:31 Introduction
  4. 2:30 What is Prometheus?
  5. 2:46 What is Prometheus? (Overview & Core Concepts)
  6. 8:00 Why write your own database?
  7. 8:22 Why build a custom Time Series Database?
  8. 15:00 Running Prometheus
  9. 15:02 Installing Prometheus
  10. 17:44 Prometheus Configuration (prometheus.yml)
  11. 18:00 Prometheus configuration
  12. 19:37 Running Prometheus & Web UI
  13. 21:00 Exploring and understanding Prometheus metrics
  14. 23:31 Data Model & Metric Types (Counter, Gauge, Histogram, Summary)
  15. 35:35 Are Metric Types Strictly Enforced? (Q&A)
  16. 38:00 Querying with the Prometheus UI
  17. 38:28 Querying Internal Metrics (PromQL Basics)
  18. 40:00 Adding the node_exporter
  19. 40:12 Installing Node Exporter
  20. 43:03 Adding Node Exporter to Prometheus Configuration
  21. 45:00 A new UI: PromLens
  22. 45:05 Introduction to PromLens
  23. 47:20 Querying Node Exporter Metrics in PromLens
  24. 48:10 Using the rate function
  25. 50:20 Why PromLens and a look at its editor and features
  26. 50:46 PromLens Features Deep Dive (Visualization, Editing, Explanation)
  27. 1:00:15 Contrived situation - lets fill up the disk
  28. 1:01:13 Predicting Disk Usage with PromQL (predict_linear function)
  29. 1:18:34 Real-World Alerting Example (Kube-Prometheus Disk Alert)
  30. 1:22:30 Other Prometheus Topics (Service Discovery, Alertmanager, Exporters, Remote Storage)
  31. 1:24:00 Final thoughts
  32. 1:25:00 Viewer question: What metrics are important for a web application?
  33. 1:25:15 Q&A: Monitoring Web Applications (RED Metrics)
  34. 1:27:44 Conclusion
Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

1:31 Introduction

1:31 Hello, and welcome to Rawkode Live. Today, I am joined by Julius, the cofounder of Prometheus, the founder of PromCon, and newly founder of PromLabs, the creator of PromLens. Today, we're gonna be taking an introductory look at Prometheus and who best to have join me than Julius. Hello, Julius. How are you? Hi there. How's it going? I I love your intro. Thank you very much. Yeah. I took a lot of inspiration from the IT crowd if you've seen that before. Yeah. Yeah. Yeah. And yeah. I I some days I love it, some days I hate it. And what's

2:05 very consistent though is that I can never start off the introduction properly and always mince my words. So that's now just a feature of the show, which. Oh, interesting. When you were speaking, I was just thinking like, oh, wow. You you have such a nicely collected introduction in your head. Yeah. Yeah. It's it's all of a say it, and say it is very different. But I'm I enjoyed this. I'm really looking forward to today as we take a kinda look at Prometheus and the the problems that it solves and and really get people that you know, Prometheus one zero one. Like, where

2:30 What is Prometheus?

2:36 do I get started? What does it do? And how can I make it better? So do you want to just quickly start by telling us what Prometheus is? Sure. I would say Prometheus is a monitoring system first and foremost, which is based on a time series database that it includes internally. And so it's really yeah. It's a time series based monitoring system for monitoring your IT infrastructure, your services, but also your devices, like, devices, hosts, potentially anything, actually, that can serialize data in the format that Prometheus expects. So people also monitor their homes with it, like humidity, temperature, or

2:46 What is Prometheus? (Overview & Core Concepts)

3:19 wind farms, what's the current, like, power that they output, anything really that you can press into these kind of time series. But, really, it got born initially, by myself and Matt Proud at SoundCloud coming from Google and missing a proper monitoring system for more dynamic environments. So SoundCloud already had a cluster scheduler very early on in 02/2012. And, basic the key aspect that Prometheus brings to those environments is integrating with service discovery in such a way that Prometheus constantly has an updated view of what things should be where. And it's a pull based monitoring system,

4:00 so it actively not only knows which things should be where, but also how to reach them and pull metrics from them over HTTP. So whether that's your application server, your networking device, or anything else, as long as it can serve an HTTP endpoint in the Prometheus metrics format, then Prometheus can collect data from it, store it in a time series database, and then you can make use of that collected data either with dashboarding, maybe showing nice stats, what's going on overall in your system, or actually also basing alerts on that collected data. So you can actually wake people up at

4:38 night if if the website is down, if it's slow, if something else is broken, if your windmill doesn't output power anymore, for example. And, yeah, it comes with a dimensional data model. So, there's a metric name, and then a set of key value pairs that we call labels similar to Docker labels or Kubernetes labels, that really give you the different sub dimensions of a metric name, like the number of HTTP requests that have been counted on a given server but then broken up by a method, status code, something else, etcetera, and the process it came

5:15 from. So you you collect data in that data model, and then it goes together with a query language called PromQL, which we're going to talk more about, I guess, which really makes use of that data model where you can, you know, slice and dice your data as people love saying. So really select exactly the data you want and then aggregate it by any of that dimensionality that you have, but also do more complex operations between whole sets of time series, like binary operations between sets of time series, different transformations, taking rates over counters, filtering, etcetera, etcetera. So pretty flexible.

5:55 And, yeah, I guess, I mean, that's basically it. So you have the data model. You have the pull based collection. You have the query language. It's pretty efficient in its implementation. I do have to say, though, Prometheus itself, the vanilla Prometheus, is explicitly not a horizontally scalable cloud thing cluster based thing where you just press a button after adding more machines, and it just naturally scales up. Prometheus itself is a single host server of which you can run multiple that run-in loose federations doing different things. But we explicitly wanted to avoid complex clustering in the system because a lot

6:38 of things can go wrong in complex clustering, and we basically want Prometheus to be very reliable and simple in its architecture and not get stuck on, like, some replication or not being able to talk to other cluster members or so. So, you know, if you want high availability in Prometheus, you just run two of the same server. They do the same collection, the same alert computation, and then later on, you have a component called the alert manager, which actually deduplicates things for you. But they don't actually talk to each other, like the two replicas doing

7:11 the same thing. So I guess that's important to understand. Of course, then you don't really wanna overwhelm those two identical replicas because then both break. But So that that's kind of the core philosophy, but then there's also reimplementations of Prometheus that take different trade offs. There's, for example, Cortex initially started by Weaveworks. Now it's more dominant, I guess, lead led by Grafana Labs and others where the trade off in Cortex is, yeah, we want to have something like Prometheus but more horizontally scalable. Basically, you just add more machines, press a button, and it still works.

7:51 But it comes with with different trade offs. It's it's a it's a way bigger beast to operate. Awesome. That is a fantastic overview. And I'm I'm looking forward to diving into each of those different components, particularly PromQL. I know from the events that I go through that it's a it can be a little bit difficult for some people to reason around it. And I'm hoping we can just simplify that for them today. There's there's a question that I've always wanted to ask someone and I think you're a good person to answer it. And the question relates to why.

8:22 Why build a custom Time Series Database?

8:22 Why would you write your own database? Like, there must be a certain level of Your database or a whole own monitoring system? Well, just the database, but, you know, it it's it's such a difficult and challenging task to go into to write, like, a TSTP specifically here rather than using another database. I mean, like, when we started out in 02/2012, at least, we looked at everything that was out there, and we're really not happy with any of them in terms of efficiency. So SoundCloud was using graphite, for example, back then. And, yeah, the format, at least, we're using back

8:58 then and also the data model that it was using, both strictly hierarchical and not dimension based. And the query language that goes along with it were really inflexible and also not efficient in storage, not doing exactly what we wanted, and expecting very regular collection intervals in these things that Prometheus doesn't have. And I could say similar things about almost all of the other areas that ended up being in Prometheus, like, yeah, the way yeah, the data model, UI. Although, I mean, Prometheus is known particularly for a great UI either at the moment. That's part of of the reason PromLens is

9:38 now starting to exist. But when we were looking all all these different areas, how alerts tie into things, really being integrated into the time series system, but usually being a separate system like and so on, We really wanted to have, like, have it work different in all of those aspects, so we were really kind of stuck with building our own. By now, I think there are probably, like, more alternatives that we maybe could base ourselves on. But, you know, PromQL and the way it selects data and does certain things, it also has certain specific, both, I think, requirements to the TSDB, but

10:24 also things that can leave out that more general databases would have to support. Like, for example, if you look at InfluxDB, it supports all kind of data types, etcetera, etcetera, but that makes it necessarily a way more complex code base versus what we have in Prometheus, which is exactly optimized towards purely numeric time series, exactly this label based data model and nothing else. So it's really, really well optimized towards that currently. And, also, what you had previously, in more static environments where things weren't changing around so much, you had more long lived time series, where you have your server and you just

11:07 like, you monitor it and you see the CPU usage over a year or whatever. And, the time series sticks around for a long time, though. Now what you have on Kubernetes clusters, especially, since the pod name and certain other ephemeral identifiers become part of your time series identity, so as part of a label, the actual time series that exists at any given point of time change all the time. So you need to index them in a way more efficient way that is suited for this more dynamic world, and that was also not really much the

11:43 case back then with time series databases. Yeah. Well, I mean, I I applaud you for doing it. It's one of those things. I think, personally, for me, having looked into writing my own database just as a as a hobby and fun to learn is one of the most daunting tasks I think I've ever tried to to take on. So Yeah. I mean, I I gotta say, I mean, I started we started out with a really inefficient, dumb implementation, and then we went through separate several revisions, including, like, a complete re architecting that works completely different, which is

12:18 the current storage. And that has become way, way better than than the previous iterations at actually, yeah, indexing, changing series over time, and so on and so on. And there's now just many companies, full time employing people who are at least part time, you know, really doing great contributions to making the TSTP, like, way more memory efficient, faster lookups, and so on and so on. So it's it's gotten really much better over the last couple of years. Nice. And so we do have our first question, so I'm gonna pop that up. But we have a question from Ilya who's asking,

12:54 could we could you have used an OSQL database? Would that have worked? What or or did you or do you need to use something that is specific for time series data? I mean, we started with that with local host level DB, so we didn't wanna have anything that's over the network or so. We wanted to as I said, initially, a simple architecture where it's a single process owning everything locally that you can rely on. So if the network breaks down, your monitoring system at least can still work. So a lot of the network based stuff

13:27 wasn't really that attractive for us, and we wanted it to be simple to operate, kind of built in to go. So but, yeah, initially, we we had LevelDB, which is an in process key value store, and we based our indexes on that. And you can do that, but it's at least the design that we had initially that I came up with wasn't too great. And it it didn't deal well with with this changing indexing of series over time. And the current database that Fabian came up with with the design is just a complete rewrite from scratch, encoding everything exactly how

14:09 we need it after already having many years of experience what exactly we do need. And, yeah, I I think I'm not able, like, to dive too deeply into the theory of why a particular underlying NoSQL database would work well or not because I'm not too much of an actual database expert. But, yeah, I can only say that the current revision of what we have, which doesn't use one anymore, works way better than what we Yeah. And I think the point you made earlier as well is really important here is that the time to live or how long we keep

14:46 around our monitoring data or time series data is drastically different to maybe, like, a no a no SQL store as well. And that, you know, we have to be able to deprecate or kill off that data on a regular interval and the sharding model can all change like yeah. There's a lot of bet lot of database to do there. Yep. So why don't we install Prometheus? Like, let's let's show people how it works. So I have the prometheus.i0 website here. Mhmm. Now have we just figure it out without me telling you anything. Yeah. Sure. I mean, I'm what I'm thinking here

15:02 Installing Prometheus

15:24 is do we wanna jump straight into the 02/22 RC, or should we play it safe and go with the 02/21? I think it shouldn't matter. I think this the RC should be pretty fine, but either is totally fine. It's not a lot of changes in the latest version. Alright. So I have a Linux machine here. Hopefully, with Carl, that's a good start. And I will just download. Perfect. What did I get wrong there? Let me try again. Oh, there's a redirect. Okay. So let me oh, what's the flag for the redirect? It's dash l? Isn't it yeah. Just

16:10 capital l. I think capital l. Right? This isn't supposed to be the difficult part. There we go. Nothing would go wrong today. Okay. Let me actually check. W get. Doesn't matter. Sorry, Carl. I'm not I'm no longer your friend. So we can extract this and if we go in here, upper binaries. So right away when we untar this fail, we've got Prometheus binary and a prompt tool. Yeah. Actually, that's maybe the the first thing that we should do is delete stuff we don't need in here, which is completely optional, like delete license notice console libraries and consoles.

16:56 And you'll need the dash r, yeah, for the console stuff. Those are kind of optional things like the consoles, the kind of HTML templates you can serve directly from the Prometheus server. It's a very niche use case. Almost never nobody uses it. So this is the the remaining three files are kind of what you actually need or actually only need the Prometheus binary and a configuration file, which is YAML in true cloud manner, which we all love, of course. And the prompt tool is kind of just like a side helper tool. For example, you can

17:27 use it in CI checks to verify that a configuration file is really valid before you apply it to your Prometheus server or check it into production. And it can do some other linting and and helping things, but we don't really need it probably today. So the core thing, you have a go binary. It will store its data locally in a directory that you tell it to. By default, it will just be dot slash data in the current directory. And the configuration file, it that is there, its main job is to tell Prometheus what endpoints to scrape and how to label the

18:00 Prometheus configuration

18:06 data coming from those endpoints. I mean, there's some other settings as well in the config file. It will tell us, if if we do that at some point, how to talk to the alert manager to actually send alert notifications, how to send data to a remote storage system if you want to keep it for later, etcetera, etcetera. But the main thing that we'll be dealing with is basically the the scraping the scrape configs, which you see at the bottom. Maybe just for our demo, let's change the global scrape interval to five seconds so we see results

18:39 and changes coming in faster. So, you know, in in production, that's a pretty aggressive scrape interval. Typically, you go for maybe fifteen seconds to a minute or so. Sometimes people do a second, but that's you know, you need to have the capacity for that. Yep. So this means Prometheus, by default, will go out every five seconds to every target, as we call it, these endpoints that it knows about, and collect the current value of each metric from there. And, yeah, in this default configuration here, you see Prometheus only scraping one thing, and that's itself. So Prometheus,

19:18 obviously, also exposes metrics about itself on port nineteen ninety where it also runs its web interface. And, yeah, this this this setup, as you see it here, is completely useless. It will just monitor itself, but it's a good way to start and see if everything's working. Awesome. So you're saying we got a default config for scraping every five seconds for scraping ourselves. So I could just run. Just run Prometheus. Maybe maybe one one additional flag to add that would be nice was your current what, like, if you there are certain links in the Prometheus web

19:37 Running Prometheus & Web UI

19:56 interface where Prometheus links back to itself. And if you are like, if this is kind of a virtual machine you're running and if it's thinks it's local host, but if you click the link and it doesn't really bring you to that Prometheus because it's in a VM or so, then we could tell Prometheus what its externally visible URL is. So for example, from your main host that you run your browser on, it will be reachable under my VM port nineteen ninety or IP something something. So, these backlinks to itself, we can tell it how to generate them properly. So

20:35 So this is a bare metal machine on Equinix Metal. It is publicly available, but I'm assuming we're okay because the Prometheus interface is is read only. Is that correct? Yeah. That that's fine. Yeah. Unless you enable the admin interface and so on. Yeah. Yeah. By default, it will be read only. People can DDoS it, but I hope they won't. So if you do dash dash dash and then web dot external dash URL equals, and then you need to provide the whole URL. So it's like HTTP. Yep. And then nineteen ninety should be correct. Yep. And then Prometheus will

21:00 Exploring and understanding Prometheus metrics

21:15 know to modify the links to itself. Nice. So, hopefully, we should yeah. We we see it starting up. It gives you a bit of a blurb about versions. No errors. Good. Let's let's maybe switch directly to the new UI. You see the link in the top right there in new. It's it just looks a bit nicer already. The old UI will go away soon. And if you head to the very top and click on status and targets, we want to be able to see that it actually scrapes the one target it knows about, which is Prometheus itself.

21:53 So in this case, if you click on that local host link, yep, it see, now it actually applied this local host external URL trick. So here, we can actually now see the metrics exposed by Prometheus about its own internal operations. It's written in Go, so it'll tell you about, like, garbage collection, how many go routines there are, etcetera, etcetera. If you scroll further down, there will be more metrics that are prefixed with Prometheus underscore, and those will be more Prometheus specific, like how many samples have I collected, something about like, way more stats about its

22:33 internal TSDB, how many targets, etcetera, etcetera. By the way, there is a Chrome plugin, which is really cool. It's called Prometheus Formatter. And if you install it, then you will get this black and white thing in nice colors and with a bit better formatting, these metrics endpoints. I don't know if you wanna install that now, but it's it's I have it, and I always always I love it. Prometheus Chrome extension. Why not? If it's if it's It's called Prometheus Formatter. Oh, there we go. There we go. Yeah. Yeah. Today, I learned. And it just makes it look a bit

23:17 nicer. Hopefully, it's not virus infected. Ta da. Oh, that was nice. Hopefully, it's not very thank you for that as I just installed it on my Chrome browser. This is not investment advice. Okay. So yeah. So it looks, you know, much more readable already and much more friendly. So what we see here is the data model that I talked about. Conceptually, what happens in this format? So when you go to this endpoint, the process that you get metrics from is basically instructed to to serialize its current value of each of the time series that it's

23:31 Data Model & Metric Types (Counter, Gauge, Histogram, Summary)

23:56 currently tracking into the Prometheus format. So you get exactly the current value of each of the time series in that Prometheus is tracking about itself. So it's not about what it has in its TSTB. So there's there's not really history included here, like, what was the value a minute ago or so. Basically, the history is only built by Prometheus itself coming along every five seconds in this case and storing that history in its time series database. But every time it comes to this page, it only sees what's the current value. So example, currently, the concurrent maximum queries that are allowed

24:33 is 20, or and currently, there's zero ongoing prompt Prometheus engine queries. And, like, if you reload the page, the next time it might have a value of one there or five or so, and then Prometheus will just store that current value. Especially those counters that you see, like something something seconds count. Yeah. For example, that one. This is a metric that Prometheus will only see go up. So if you refresh this page a couple of times, then you will only see that metric going up. It's like the total number of seconds that the Go runtime

25:16 has spent doing garbage collection since this process started. So these counters can reset, but only if the process that exposes them and track them also actually restarts because they're only kept in memory. Yep. So this so, basically, one line, one sample metric name, and then the label values, and then the actual sample value at the very end. Okay. So let's let's just break that down what you just said there. So this is we call this the metric name? Yep. So the metric name is kind of the fundamental aspect of the system that you're monitoring. In

25:52 this case yeah. This is a summary. It's maybe a bit more complex, but let's look at the very simple metric, go go routines. It's the aspect there is, yeah, you're measuring the number of go routines. In this case, there are not even any sub dimensions, so labels on that metric, but some metrics do. If you scroll a bit further down, maybe we'll find, like, a good simple one, like, boom, further, further, further until we see something with more labels that that's not a summary or yep. Yeah. Let's let yep. Yep. Yep. Mhmm. This this is good because it's still

26:28 just the simple counter metric, so always the same metric name. So contract failed connection dialings, but with different four different subdomain. So this counts for each of these dialer names and reasons how many of these failures have happened. And then yeah. And then you have the space, and then you have the actual current value of each of these time series. So each time series is identified by the metric name and the unique set of labels following it. So each line is one time series in Prometheus lingo. Okay. So we got metric names, we've got labels,

27:14 and we got the value. What is the different metric types that we have for them, Prometheus? And I see that we've got access to something called a counter. We also had a gauge. There was a summary. I mean, what what are these different types of metrics? Yeah. Oh, I actually just gave a talk about that a couple of days or last week ago or so ago. So metric types are yeah. We we have four of them, gauge, summary, counter, and histogram, and they're a bit different ways of measuring different aspects of your system. So, for example, a counter is a metric

27:50 that can only go up, because it's used for counting cumulative totals of either requests that you process or total number of seconds spent in certain CPU mode or something like this. So those are metrics that only go up. And we have a convention that we we end them with underscore total. That's that's why this metric is named like that, but it's only a convention, but people should follow it. The gauge is a metric that can go naturally up or down. So memory usage, queue length, etcetera, etcetera. And then you have histograms and summaries. So is this a histogram

28:33 is used for tracking the distribution of a set of values across a number of histogram buckets. So the most you the the most common case is request durations that you're measuring in your process as you're handling requests. You might want to record how many of which latency category you have seen. So, yeah, this is a histogram example here. It's not a particularly beautiful one, but, yeah, it it will serve our purpose. So in this case, it you know, when when you actually track these when when you track things into histogram in your process that's handling a request, you have

29:16 a single object where you you create a histogram saying, I want to have a histogram, and it will have this the following bucket configuration. So latency bucket going from zero to, in this case, you know, one hundred milliseconds and one from zero to two hundred milliseconds, one from zero to four hundred milliseconds. So it's a cumulative histogram. Every bucket starts at zero and thus includes the counts of previous buckets, but then has a given upper bound. So you create that object in your client library, and then every time you actually handle the request of a given duration, you just call a

29:56 method on that histogram, say, observe. And you say, observe two hundred milliseconds. Observe one hundred fifty milliseconds or so. And, internally, the histogram will then increment counters for each of the buckets that are relevant to that event you have just observed. So each of these, so that's one object in the client library as you're tracking things. But then when it gets serialized out in this text format, we have to kinda spread it out into different time series. So the TSDB in Prometheus doesn't really have a clue about what a histogram is. It only knows the scheme of,

30:33 like, metric name, a bunch of label pairs, and the sample value. So we kind of have to, yeah, serialize it into that format. So the way we do that is just expose each of those buckets of the histogram as a separate time series with the underscore bucket appended to the original base name of your metric. And then we have the LE label on there. It stands for less than or equal, which means it's like every request gets counted into this bucket that has a latency of less than or equal to sixty seconds. And yeah. And and so, basically, this this histogram

31:17 gets kind of expanded into this form. You also get two other auxiliary output metrics in the serialization, which is this underscore sum and underscore count. This is just the total count of all observations you've made, so total count of all requests, and the total amount of seconds you have spent in handling all those requests. So you could get the average latency by dividing the sum by the count, for example. Alright. Yeah. So that that's a histogram. A summary is and and so, by the way, like, you typically wouldn't then later on, like, look at a histogram in its raw form.

32:01 In PromQL, you would run a function over it called histogram quantile to calculate quantiles from histogram or at least approximate them because it's never going to be perfect. It depends on the resolution of your buckets. But then you can ask questions such as what what's my 90 percentile latency over the entire system or at a given aggregation level? A quantile is something that directly a summary is a metric type that in the client library already computes these quantiles for you. So directly from the process, as we see here, we get the the minimum, the maximum,

32:41 and and the, you know, twenty fifth quant percentile and fifties and so on. The downside is I mean, the the upside is you directly get a really well computed quanta, which has little error. And if you compute it using directly the summary type and, for example, the Go client library, it you can specify the absolute error margin you want to have in the result. The downside is that you cannot aggregate over quantiles at all. It's just statistically invalid. So if you have 10 processes and they each expose their ninetieth percentile latency, let's say, you have no

33:20 idea what the total system's ninetieth percentile latency is. You cannot average over them. It gives you statistically nonsensical results, but you can do the same with histograms. So but those, you know, they they're they can be more expensive. If you want to have a good approximation of quantiles from histogram, you need many fine grained buckets in exactly the right latency places. So it's a really tricky subject. There's a page in the Prometheus best practices docs that really goes deeper into that comparison if if someone wants to look at that more. But most of the time, you wanna look

33:56 at systems in aggregate, so you kind of have to use histograms. Okay. Awesome. That's a a really great explanation of those four metric types. I'm kind of curious between the gauge and the counter. Are there any optimizations that are made in the database level because you know the account or can't climb down? Like, why is there that distinction there? Yep. So at the database level, the database has no idea. It doesn't know about metric types, period. It does already just, you know, do, gorilla styles or float compression on sample values and so on. So if they for example,

34:33 if a sample value stays the same all the time or increases very regularly or so, that already compresses very well, but it has no idea about the the type. The only so that that's it's just basically you know, maybe if we were doing Prometheus completely from the ground up or if there's a Prometheus three dot zero at some point, we could think about making these metric types more first class, storing them better and more efficiently directly in the database. Even a histogram could be stored way better, right, instead of, like, separate time series. It could be one time series with different buckets

35:10 stored more efficiently. And then also make PromQL itself more aware of that because currently, it's not directly aware of metric types. There are functions that expect input times used to behave like a counter or like a gauge or like a histogram, but the function cannot actually ask the TSDB, like, is this the right metric type? Okay. I guess Yeah. That leads us on to a question we have from Elliot in the chat. It just says, is the the metric types purely convention and are there any plans to maybe make them strongly typed in the future?

35:35 Are Metric Types Strictly Enforced? (Q&A)

35:45 And I think you kind of maybe just answered that a little bit there. If there's a three point o. Well, it is kind of a in some places so there's a whole chain, like instrumentation, collection, processing of the data in the TSDB, and then PromQL. And in some parts of that chain, they're more than a convention. They're an actual code level object, and that's the strongest in the instrumentation library where you really say, I want a new gauge. I want a new summary. And you get different methods, depending on which type of metric you created. On the

36:19 counter, you get an increment method, but no set method. On the gauge, you get a set method on a etcetera. And then in the text format, in this exposition format, it's already slightly weaker. It's there in these kind of commented things that get slightly ingested by the Prometheus server, but not really don't really make it all the way into the TSDB. There's currently it keeps it it's kind of for the longest time, it didn't do anything with it, but then now it stores this help and type information in memory, and you can query it from like, for Grafana or PromLens or other

36:49 UIs that then actually surface this information for the metric names for users. But then the TSDB, it's completely lost again. And then when you work with stuff in PromQL, you basically have to know what metric is what, and that's why it's important to name your metrics according to conventions. Because, for example, there's the rate function, and the rate function gives you the per second increase of a counter metric. And, when it sees a counter going up and up and up and then down, it assumes, oh, that was a process restart and and reset of the counter and not,

37:29 like, something that could have happened naturally. So it will act as if that reset hadn't happened and kind of neutralizes it in a way. So if you stick a gauge metric that is, memory usage or so into the rate function, the rate function will think, like, whenever the memory usage goes down, oh, that's a counter reset. I act as if it doesn't happen. I will only return you, like, positive rates. So you'll need to use the derivative function for gauges instead. So, yeah, that's important to know. Awesome. Excellent. So I think Agree. So let's I think the next step, and if

38:00 Querying with the Prometheus UI

38:05 you agree, let's go down that path. We should add an exporter now. Like, we want something Yeah. Maybe slightly more substantial than the metrics that Prometheus has exposed in itself. Yep. Yep. I mean, we could already start querying the metrics about Prometheus itself. We can see, like, how many samples it is ingesting about itself per second, for example, as a good start. So if you go to yeah. Yep. So if you just type, I think, head underscore append, it should fuzzy find the right thing, and then the sample the second one. So this is the metric name. And in

38:28 Querying Internal Metrics (PromQL Basics)

38:44 PromQL, if we just give the name of metric name, then it will return all the metrics all the time series that have that metric name. In this case, it's only one. So this the total is a countermetric, underscore top total, which tells us the total number of samples that this process has ingested so far into its t s t s TSTB. So you could switch over to the graph tab, and we see it going up over time. Right? So it's a counter. So this is completely useless. The one thing about counters is you basically never wanna care about the absolute value. You

39:21 always wanna rate them or do something with them. So let's wrap a function around it called the rate function. And, yeah, if you go to the very end, you will need to provide the time window to average the rate over, and you do that after the metric name in these angle what are they called? Square brackets. Straight. Yeah. Yeah. Okay. Mhmm. Exactly like this. So this would give you the per second number of ingested samples but averaged over one minute of ingestion. So we see it currently at hundred something samples per second. So now if we add

40:00 Adding the node_exporter

40:05 an actual second target, like, node exporter, we should see this number go up. Okay. So we want to deploy the node exporter to our machine. Yes. Let's do it. Alright. Let's see. We do more complicated Prometheus, could probably start using PromLens because it's less of a pain. K. So we have node exporter. I'm just gonna go to releases and grab Linux here. How to split this? Mhmm. Is that my IP address? We'll find out. Yes. It is. Okay. Excellent. So do we have? We have a license notice and then a node export. I'm assuming I

40:12 Installing Node Exporter

41:07 just run the binary. Yep. There's tons of flags, but by default, it will do reasonable things. So the node exporter has nothing to do with Node. Js. It's the node in there means, like, a network node or compute node, so a host. And it exposes all kinds of metrics that you would typically expect about a system, CPU memory, interrupts, etcetera, etcetera, etcetera, network. You see a lot of collectors listed here in the log already, so you can, you know, you can able enable collectors, disable them, enable ones that aren't enabled by default also, configure them,

41:45 have, like, include list, exclude list of what devices you actually want to include in, like, network devices, for example, or or file systems if you have a lot of these virtual ones that you don't want to have included. But by default, it will give you reasonable information. So the node exporter mostly gets its information from the proc and sys file systems, but it also does a couple of system calls to get information about the machine that's running on. So just to confirm then, because Prometheus is a pill based system, our node exporter right now is

42:17 exposing an HTTP endpoint on port 9,100. Yep. And currently, Prometheus is not actually aware of that. Right? We're we're gonna we're gonna have to change something? Yep. And so this is one kind of cool thing. Like, when Prometheus isn't actually coming by and scraping, the node export is just doing nothing. So it's not collecting any data. It's just since all the operations it does are so cheap and only take, like, milliseconds or so, it can do them synchronously. The moment it scraped, it just, like, goes to all these proxies files, parses them, translates them to Prometheus metrics, and

42:53 feeds them back to Prometheus. So, yeah, we would now need to go into the Prometheus configuration file and add another scrape configuration. And so just one comment. So, currently, we have completely static configurations of targets in this config file. Where Prometheus' power becomes way way larger is when you have it, for example, in a Kubernetes cluster and you don't configure any target manually, you just say, like, talk to the Kubernetes API server, get all pods you know, and then use a certain labeling scheme to identify them and scrape everything. So that's that and then it keeps itself

43:03 Adding Node Exporter to Prometheus Configuration

43:35 up to date. But in this case, we're just gonna do static targets. Yeah. So if you just copy the first scrape config, and then we can modify it a bit. So let's call it node. Yep. And use the node exporter part. Yep. Mhmm. Yep. Mhmm. Easy. And so you can either restart your Prometheus server, which I I guess now you already killed it anyway, or you could also send it a hop signal or reload it over the web if you have the admin APIs enabled. But yeah. Oh, nice. Reloading or restarting works as well. But it's, of

44:17 course, not something you would typically do want to do in production. You wanna, like, reload the config. Everything in the config is reloadable during the run time. Yeah. I just don't wanna cancel the territory of of too many splits on my screen. So I was quite happy Yeah. Yeah. I'm gonna bring it back. Makes sense. Makes sense. Let's go to our status targets again. Yep. Nice. So we have two. We should actually, if you just use the browser back button, we should go go get back to the query. And now if we graph that again yeah. We see a brief

44:52 outage there because Prometheus was down, and now we see the scrape rate going up. Yep. Perfect. Yep. And now if you actually, should we should we start querying using PromLens right away? It's just so much nicer PromQL interface. It's a built in one. I don't wanna just, like, plug my stuff. But No. No. No. Show me the way. Show me the good stuff. Okay. So the freebie kind of preview version of PromLens that is just hosted directly on promlens.com, you can actually just plug in your, yeah, that URL as long as your browser can oh, see, that's that's the downside, though.

45:05 Introduction to PromLens

45:42 You would need to be able to reach it over HTTPS because prometheus.com is served over HTTPS, and the browser is not allowed to do, mixed content AJAX. So, that's, yeah, that's the downside. You could you could run it locally, though, if you do just, a docker run PromLens slash PromLens. K. Let's do that. So I'm gonna do your tab. Docker container. One exposed one exposed port, which is eighty eighty. That all I need? Yeah. I think that's all that you need. Yeah. Oh, and Docker, of course. Let's check on our level. You'll you'll see why why I said that. It's it's gonna

46:28 be much nicer to work with PromQL. So the the main reason is actually the text editor, which is currently in PromLens, which is just but the current Docker version doesn't have the very newest stuff yet. Too bad. It's still better than still better than the build in one. But the text editor will also end up being in Prometheus OSS, just the rest of the PromLens features will not. So Yeah. The port goes first. So let me Yeah. I know better. So you were saying I can just set local host on a Yeah. Yeah. And now if you just paste the server

47:12 there and yeah. If you just unfo yep. So that works. Cool. So now we can just type the same metric that we had before for or let's let's explore node export metrics. So if you type node yep. So the up to date version would also show you, help strings and type information and all that here. We don't have that yet. So, yeah, we could take we could start with the CPU usage, for example, which is the second metric you see here. And if you just press return and you you can graph it below, you get a

47:20 Querying Node Exporter Metrics in PromLens

47:50 bunch of CPU usage counters. So what you see here is the total number of seconds spent in each CPU mode and in each core and in each process, but we only have one node exporter process. So that's not really useful dimensionality for us. So the the main things are the CPU label and the mode label. And so, yeah, we could now so they are counters, so the absolute value doesn't really help us. So PromLens already suggests you to add the rate function here. So it kind of auto detects this is probably a counter. And now if you graph it,

48:10 Using the rate function

48:32 you will actually see proper CPU usage. Why does it make sense? I have to maybe explain why it makes sense to use the rate function on CPU usage seconds. So what is this? So imagine you only have one core and one mode to make it more simple. Right? You have, like, a CPU usage, and it can be between zero and one. And imagine your CPU is at 100% usage. Then per second, it would spend one second of CPU usage. Right? And if it's at 0% use, it would spend zero seconds for per second c in in

49:14 the CPU. So if we take the per second rate over this counter, this is basically giving us the usage in course per mode and per core. So if you have multiple cores and sum over them, you might actually get like, if you have four cores, you might get a number up to four. You could multiply it then by a hundred to get to percentage. And yep. So you probably wanna zoom out a bit in in the graph or or zoom in rather pressing the minus button. This one. Because we don't have a lot of data collected yet in the Prometheus server.

49:56 So just there, you you see, for example, idle. Most most of it like, it's almost 100% idle mode, and then you have a tiny bit of usage in the in the other modes weighed on there, user and and system and so on. And, yeah, you have a bunch more metrics about the machine coming from the node exporter. Okay. So it's that's the special yeah. That specialization is is really great, and I love the fact that it offered to add that rate. That I mean, that's I just don't live very very well. I haven't really talked much

50:20 Why PromLens and a look at its editor and features

50:34 about PromLens itself yet. It was just like, I I can't my personally, I can't really use the I can't really use PromLens PromQL without PromLens anymore. So I'm just like but so PromLens, to give a bit of a background. So, I mean, I I freelanced the last four years around Prometheus just doing consulting, training, custom development for companies. And this year, I I decided, hey. I have a really cool idea for a new prompt query builder tool that should exist, but I can't really justify just building it completely in my free time and not even, you know, trying to do

50:46 PromLens Features Deep Dive (Visualization, Editing, Explanation)

51:08 something commercial out of it. And so I created a company, PromLens, and PromLens is kind of my first venture into this whole business of, yeah, trying commercial software. So it's meant to be run on premise as a docker container, and you give it a license key, and then it can enable features such as at the top where you currently have to manually enter your Prometheus server. It can actually talk to Grafana, and it will give you a selector of all the Prometheus data sources in your Grafana installation. It also allows you to create entire shared

51:45 links that you can send around on Slack and so on to query pages. The main value of this tool is, like, it's basically a PromQL power tool, I would say, because, a, the text editor from the prompt PromQL is way great greater than what's currently in Prometheus, though I'm I want to open source that, bring that into Prometheus as well. But what it gives you is once you have more complex queries with different binary operations and filters and so on and so on, in just the text version, it becomes really hard at some point to understand

52:23 in which different sub selector and sub query is it selecting what data and why what labels on that sub selector, etcetera, etcetera, etcetera. So what it does is it visualizes the query that you input as a tree here, with every sub expression becoming its own tree note and telling you what labels are on that tree note and, with what cardinalities. So, for example, you got 48 different values for the CPU label in this case. And, yeah, shows you how many results there are. So you can really, really quickly spot, like, oh, in this sub selector,

53:07 there's actually, like, zero results, so the entire expression will never return anything. It will also tell you if there's an error, like, where exactly that for example, a group by matching error happens in your tree. I had one consulting session with a client where I had access to their Prometheus server, and then we went I shared my screen. We went through all of their alerting rules, which are important to work correctly, and half of them were broken. And we could like, I just copy and pasted them into PromLens. And in PromLens, we could see pretty much immediately, like, oh, yeah. This sub

53:44 selector is, like, not even the right metric name. It always select zero series, so, like, this alert will never fire. And, yeah. So it gives you a lot of deeper insights into the structure. Now the other thing is it also includes a form based editor. So every note here, you can actually if you even either press e for edit or hover over it, and there's, like, a little icon. Yeah. You can edit any part of the query using completely form based editing. So you can say, like, I want an aggregation, and I wanna aggregate over the following labels, etcetera, etcetera.

54:18 You could go to snippets, for example, and also say, hey. I want to calculate the quantile from histogram. Like, let's maybe just do that. Doop doop. And there, yeah, there can be more snippets in the future. So it fills out these placeholders, kind of. If you press e again, it will close that. Oops. Yeah. You can also add a PromQL inline. So if you just, like, double click on a node or yeah. Then then you can, you know, actually edit any part of the tree in line as PromQL and so on and so on. So, like,

54:56 it gives you form based editing. It helps you in a lot of ways. It gives you these little quick actions like add rate, add some to the expression. Hey. This is a histogram. Are you do you maybe mean to, like, wrap histogram quantile around it? And, yeah, basically just helps you really understand and build your PromQL queries way more effectively and especially avoid, like, completely broken alerts that, yeah, are way too common. Okay. So that's really, really useful stuff. Let's tackle one of the questions we have in the chat, and then why don't we contrive a situation

55:34 then? Like, we'll run some processes. Let's try and, you know, spike the CPUs, the test usage, and see how we can explore those metrics with on the note. Yep. And get people a good feel for for PromQL and and PromLens. So the first question we have, another one from Elliot, is asking, can we compare the results from different time periods? Can I compare the I'm assuming he's asking if I view the CPU usage now, can I overlay that with the CPU usage from an hour ago or ten minutes ago? Only if you use yeah. Not not really. Not in this tool.

56:09 It's more for building the query itself and less for, like, the actual data. The yeah. Like, looking at exactly how the data behaved itself, but more about building the correct query, I guess. So you can so the the thing is the graph tab, there's only one of them on this page currently, and it has one time range. You could theoretically build a prompt query which selects the current set of time series for something and then ors in with a couple of query tricks, like an offset version of the same query, but that's a bit cumbersome. Like, that's not

56:52 what you would typically do. Yeah. But what you can do is, like, really quickly switch what you're looking at. So you can, like, click with a mouse button on any of these query notes. Maybe maybe if you go to if you go to the hosted PromLens, you will see it better. So promlens.com. If you go to, let's let's say, about and click on that yeah. Click on this example page. So really quickly, you can just, like, click on the mouse where on any of these query notes and then see exactly the shape of the data at any of the notes.

57:35 You can also navigate by keyboard with j k like in Vim. And, yep, like this, there's, like, a bunch more keyboard commands to help you, like, navigate around. If you press enter, you go into, like, editing mode, inline editing of that mow node and so on. And yeah. Yeah. We figured it out pretty quickly. So that's it's kind of, like, almost like a little gauge or tap that you, yeah, like, stick into the system and see, like, okay. What's the value at this sub element here? And whereas, like, if you do the same thing in the standard just text

58:08 editor, it gets kind of annoying because you have to, like, copy and paste different parts of your query into a separate panel, see, like, what does this evaluate to? And here, you just, like, can immediately see it. Also, like, at the very top, if you type a metric name, let's say, delete this demo API request duration seconds bucket. No. Sorry. Like, in Here. Yeah. In in there. Yeah. Yeah. Yeah. If we just type bucket, for example. Here, you see now you get the cut the the type and the help string. So if you if you just, like, go up

58:45 and down with arrows on your keyboard, you see, like, the actual metadata for each of the metrics being surfaced more nicely. This looks and feels quite familiar. I mean, is this using Monaco to be as code? It used to, and now I styled code mirror next in the same way as it looked before with Monaco. Yeah. Oh, and there's also an explain tab at the very bottom that can actually explain the meaning of any node and sub expression to you. So if you head over there, yeah, and then you select the function call, for example, it

59:20 gives you the documentation for the function, for the aggregation that explains what it's doing. If you go on the binary operation, which is the root of this entire expression, the the slash on job thingy Oh, there we go. It even shows you, like, how it matches up left hand side, right hand side. Maybe I'm overwhelming people a bit if they didn't do PromQL before because we didn't talk really about PromQL much yet. But I just I guess, I I went into the mode of, oh, I wanna show all the features of PromLens, but I will stop now.

59:57 No. I I I I think, you know, being able to visualize the queries, explain them, and have all the documentation in mind is a a really viable tool, especially as people are learning how to write their first Prometheus, you know, queries. Like Yeah. A little bit So So I I wanna use it in trainings as well for people to understand, like, why are the queries working or not working, what exactly does a certain node do, and so on. Yep. Okay. So Okay. Let's I I hand it back over back to you and stop the commercial section.

1:00:15 Contrived situation - lets fill up the disk

1:00:30 No problem. Alright. We will let's so let's use our Prometheus server, whether or to export our we have the ability to to run our PromQL queries and visualize what we have here. So let's try and tackle maybe what would be you know, if someone is just gonna start it with the node export or in Prometheus. Some of the first queries that are popping into my head are how can I track the CPU usage over time or in fact, here's a story? So I used to be an operator SRE, you know, over the last kind of five

1:01:05 years. And I always used to get woken up at three or 4AM in the morning. And I really, really hate getting woken up at three or 4AM in the morning. And one of the the one of the most regular culprits was the desk filling up by some bad actor. And I always thought it'd be really cool if I could just use some sort of linear protection to warn me during office hours that the desk was gonna fill up in the evening. Yep. Now is that something that we could contrive here, artificially create? Maybe if I run

1:01:13 Predicting Disk Usage with PromQL (predict_linear function)

1:01:36 DD and write out a large file on his desk, would we be able to track that usage and see the graph going? So we we can. Prometheus has the function. And and by the way, we say also always, like, if you can avoid it, don't, yeah, don't alert on these absolute disk usage values, but rather try to predict when will your disk actually be really at zero free state. And so we have what we typically do or recommend people to do is put the, what's the metric called? Like, there's a metric, the node file system

1:02:11 available disk something file system avail bytes. It's important to use available bytes, not free bytes because available excludes those blocks reserved for the root user. So you run out of of, yeah, out of that earlier than just free. And plug that into the predict linear function. So there's a predict underscore linear for that works on gauges like this. And you provide it, first of all, on the metric name itself, you provide it with these with this square bracket selector of how much past data you want to look at for this linear prediction. So it does linear regression

1:02:55 over the data points. Typically, you wanna look over, like, multiple hours. Right? Let's say, sometimes people do four hours because disk usage often, at least, it's not like a very quickly moving thing. So let's let's say maybe four hours or so. My thoughts were, though, if I'm gonna start running the the I love a hundred gig, it's gonna go very quickly. So we maybe wanna look at it like, seconds so I can kill it, and then we can just see that So this won't actually give you any results because we it it to do a linear regression, you need at

1:03:28 least two data points to fall under this window, and we only scrape at a at a resolution of five seconds. So to be really sure, yeah, you want to have at least, like, fifteen seconds there or something something bit larger. Like, a minute is yeah. Let's do a minute. The second thing the second parameter to this function is now that we have this linear regression that we're doing over this, we can predict this value into the future. So we can say, based on the current development that we've seen over the last, what, one minute in this case, what will

1:04:04 it be in, let's say, five minutes? So the the thing there at this point in this it expects a normal expression note and not a duration, so you have to provide it in as a normal number. If we want to say five minutes, you could do five times 60 or 300. Yeah. Yeah. So it expects number of seconds to predict it into the future. And so now what you get as the output is, like, the expected disk usage in five minutes. And now we could, we could add a filter condition and say, like, only output it to me

1:04:44 if it is expected to be less or equal to zero. Okay. Just say less than zero. So if we say less than zero yep. So this filters the set of output time series, but only the ones that are going to be full. And alerts in Prometheus work in that way that the the the heart of any alerting rule is a PromQL expression, which is expected to output nothing at all if everything is fine and output one or more time series if there is a problem. So then every output time series will be turned into an alert.

1:05:25 So, for example, if there were two tie two different file systems now for which this condition is true, you would get two different alerts, one for each, unless you, like, aggregate over them in PromQL or so. But by default, two output times here is two alerts. Cool. So, yeah, if we now start filling up the disc really quickly, this might, at some point, start telling you that Deft, I'm gonna need a new tab now. And, like, this is one of the situations where I would then, like, go into PromLens and see because the filter, right, you you

1:06:03 typically don't see anything anymore as the final output of the expression. So then you can still click on the subnote that shows you the current trend of the disc usage and see, like, okay. Will it even, like, get close to my filter condition soon, or do I have to change my filter and so on? Okay. Yeah. Let's let's trash the disc. I'm sure there's a way to specify how big I want this to write, and I can't remember what it is. Yeah. That's block size and the count. Yeah. Let's just look it up. So It's

1:06:40 just count in BS. Is it as simple as that? This is the that not yet. Oh, there we go. So yeah. Okay. Problem is if if it creates sparse files. Yeah. DevRandom will not create sparse files. That's good. Better than DevZero then. Yeah. Okay. Otherwise, we won't get the disc users we're expecting. So I should be able to run this. That should create 1,001 meg files. Right? Yeah. Which is not a lot, but let's just should be one meg, basically. Right? With the current 1,003 The b b s is already in in kilobytes. Right? Or is

1:07:32 it in bytes? I forgot. Okay. Kilobytes kilobytes. Okay. Cool. I think. Alright. So let's let's just see. Right? We have this query here, so we can I guess the best thing to do is just to duplicate this tab? Right? Some bytes. Some bytes. Okay. Yeah. I guess that actually makes sense. Way more probably, but yeah. So See you see what you're doing now? You have to, like, manually edit away parts of the query? Yeah. Yeah. Yeah. And and Prometheus, you just click on that subnote, see, like, okay. What's actually returned by this? That's exactly why I hate I don't I

1:08:11 don't work in that normal interface anymore. Okay. So so what we're saying is we could actually just use our fail system available by this here. And then what we had was a predict predict linear. Yeah. Uh-huh. We had a one minute and a three hundred. Yep. So you're saying I could just filter the squidgy to just one section? Oh, and we had the filter condition as well, like, less than zero. Yep. And now you can just click on the notes and see what each subexpression actually is doing. Thing is with a range vector, you cannot

1:08:46 graph it because, like, at every point in time, it would have multiple values. You can table it. You can you can see, like, what's the, yeah. This this is the kind of raw disk usage over multiple peers in time, but you can see what the linear prediction of that is, for example. So I can filter this using the label selector, right, to only get the root disk? Yes. And is that just the case of adding Yep. Mhmm. Yep. Exactly. So now you see, okay. Yep. The total query doesn't return anything, but these subqueries do return useful

1:09:24 data. Okay. Okay. So now I I I do wanna start filling up this desk. So let me just confirm what the hell I wrote there. Slash random. That's not the size I was expecting. Or maybe they can't isn't there what I want. Let's just write, right now. Yeah. I mean, the we yes. It's fine. So it only would have created, like, a tiny file. Oh, this is even tinier. Yeah. Yeah. It was smaller than I expected, you know. So we're I'm just gonna keep writing with DD. So I guess, indeed, what I should see now is if I You go

1:10:10 to the graph of predict linear, you will see the linear prediction of where this will end up in five minutes should go up and up, like, as as very soon now, at least. Yeah. So we're still waiting on that fifteen sec fifteen second scrape underflow, and we need a couple of values underneath the test. Two samples to actually see the change and yeah. Yeah. Yeah. So if if you see at this moment, it looks like it's going down a lot, but it's not. If you look at the y axis, it's basically staying the same. If you kind of just

1:10:42 yeah. The the I guess the only real way to refresh this graph is to either, like, change one of the settings or select should put, like, a refresh button there directly. Yeah. Yeah. For the sub mode, you know, if you wanna if you wanna graph the predict linear directly. Oh, no. Actually, it should go down because we're predicting the available bytes, not the used bytes. But currently, it's still not doing much. So is you is the actual is it doing much? Zero bytes copied, so it's not doing anything. Oh, you wanna probably use dev view random.

1:11:24 Right? Or because it's faster. Random bytes. Just Let's just check if that's working. Okay. That's right. That's That's right. Yeah. Yeah. So now we have to we could put on a clock there. We have to visualize this before I fill up that disk. Yeah. Before you before everything crashes and Prometheus runs into problems. It's a particular there. I'm using my nice auto complete here and digging it. You want verify one minute with 300 less than zero. Can click on this. I am. I'm I'm liking this. I'm liking this. Nice. Now we do okay. So now we started

1:12:01 right into that disk. We just have to wait for those two samples to collect now, which is gonna take ten seconds. Mhmm. Zoom in a bit, like, do over the last five minutes or so. Yep. And I I'm just gonna split this again so I can actually track the disk usage. Yeah. That's a good idea. Yeah. And we wanna make sure that we're really tracking the right one. So you're writing into root. Yeah. Oh, no. Wrong one. There we go. So s t p three, let's just grab. I don't know what. It's our desk is filling up. We went

1:12:48 from 6%. We're at 7%. It's gone up to 6.34. Okay. Right. Yeah. Yeah. So this is gonna I think it was a hundred gigs that said we had three. Yep. But now that this is slowly filling up, we should see this graph start to take shape. Right? Yeah. Yeah. So I guess the the workaround for not being able to directly refresh this graph for a particular sub note is to select the other subnote and then switch back to predict linear. Or you can just, like, change the the group the graph setting or something. Yeah. Why is it

1:13:22 not going down yet? We should probably try can you add another query? Press that. Yeah. Add another query and go directly to node file system avail bytes. Oh, yeah. We should also add our root mount point filter again, I guess. Oh, yeah. I saw that point. Yep. Mhmm. And so that is not doing much yet. It's not really is it x oh, it's dev v d a. It's not are you are you sure you're filling up the right machine's root disk? Because it down there in the legend, if you scroll further down, it says and I think you had

1:14:12 something. Right? Oh, am I right into the wrong desk? Yeah. I think so. Or am I on the wrong machine? Yeah. I think you're Alright. So can we filter by the series? Like, how do I Like, are we monitoring the right thing? CZedJ8CZedJ8. Will I get the host name from here? Where is give me a prompt. Yeah. CZGA. So we we are on this the same machine. It's weird that the desk labels are different. But, I mean, this is going down. Right? This is correct. Not really. It's going down, like, couple of bytes or something. You know? So if you look

1:15:03 at the y axis and it's a different device name, so it's it doesn't look like the right device. So we went from 6% to 13% usage here. But if we're we're seeing a drop in around, what, six six gig? If you so I'm a bit confused. Just can we make sure we're we configured Prometheus correctly to to scrape from the right host? Oh, wait. At the very top? Oh, I think I know what's going on. Okay. Yeah. Yeah. Yeah. So we we we reloaded the page. Yeah. Yeah. Yeah. So okay. There we go. Okay. This makes way more sense. I'm like, what

1:15:49 what's going on? Okay. So yeah. Okay. We see the disc usage going down or go So I can remove this query, add in my label filter again where we only wanna see the main point for the root disc. We've got a one minute aggregate with a five minute prediction. Yeah. And now if we yeah. So then okay. So it will still take some time until it thinks we will be. So if you increase the prediction time to, yeah, longer and longer, eventually, we will get to the point. Boom. Yeah. Now But it's telling me my

1:16:26 desk is gonna fill up. In oh, I don't know how to read this. Well, basically so yeah. What what this does is for every point in time in the past five minutes, it calculates the value of that from QL expression for you. You probably only want to look at the current value, like, what so if you go maybe it's better to look at this in the table view in this case. So which tells, like, at the current point in time, what is the current prediction based on the last one minute five fifty thousand seconds into the future?

1:17:13 Right. Okay. And then what I'm saying is that fifty thousand seconds under the future, I'm actually gonna be a negative space on my disk. Exactly. Exactly. So we could, like, fiddle with that number to see when exactly it reaches zero. So maybe Trial and error. Yeah. 3,000. It's for Emma. Oh, it's fill up now. Is that what it's telling me? Right. No. Of course. Yeah. Yeah. Good enough. Good enough. So, you know, like, in half an hour or something, if if you still had the look only five minutes in the future, then it would alert you in, I don't know,

1:17:57 half an hour or something. Yeah. I I guess what's what what is important here? Like, for this scenario that I just could randomly throw at you there is that if we say, you know, in twelve hours, what's that in seconds? Twelve hours. I should know that. Right? Right. 43,000. Like, within the next twelve hours, will my desk be in a negative value? That's what's really important for me there. I know that I should be risen and alert during office hours to say, look, you may wanna and and that's only for predictable growth based on the prediction the linear prediction algorithm. You know,

1:18:27 it's not gonna detect random bursts and desk usage, but still, like, a a good alert to have, I would suggest, probably. Yeah. Yeah. And maybe just to a bit of an idea of how to use this in a more sophisticated alert because this one alone is maybe sometimes not enough. You wanna combine it with, like, oh, the trend, but also combine it with the total number of disk space. Like, for certain disks have different trends or or limits. If you go to if you search for cube dash Prometheus, which is a project initially by oh, sorry.

1:18:34 Real-World Alerting Example (Kube-Prometheus Disk Alert)

1:19:07 No. Like, Google. Yeah. Yeah. Dash dash Prometheus. Yep. That's fine. And then press t, and then you press rules. And then you go to the manifest slash Prometheus rules. No. Like the Yep. Watch one. Mhmm. And I think you search for, like, node file system avail, like, this this metric name that we had. Yep. So this is one example. For example, like, node file system space filling up. You could actually paste that over into PromLens, that expression. Visualize it. Oh, so The the text field at the very top, you could just, like, paste it in. Got it.

1:19:56 And so the first thing that does is it calculates the the ratio of available bytes to the total size for each of the partitions. It multiplies it by a hundred to get 2% from ratio, 2%. Then it filters, so it only gives you those partitions that have less than 40% available space. So this is an alert that would with just this this top stuff, it would alert you for everything that has less than 40% available space. But now it adds an extra condition and says, okay. And for those same label combinations, it has to be true

1:20:38 that the linear prediction over the last six hours, and projected one day into the future, this 24 times 60 times 60, has to be also negative, less than zero. Like, that's what we did. So if it's if a file system currently has less than 40% free space and in one day is projected to have zero, then we want to alert. And there's actually more conditions further down. If you scroll, there's, like, one more end, which is like, okay. If it's read file read only file system, we wanna ignore it. So, like, in practice, there's a few more conditions

1:21:17 you wanna add. And it's it's always tricky also to get this predict linear to to work just right for your workload because, like, what exact amount of history do you want to look over and how far into the future? You have some workloads that generate, like, a sawtooth, and then they do cleanup and go up again and do a cleanup. And if you look over the wrong, like, only over a couple of minutes and predict too much far into the future, then you might think, like, oh my god. The disk is gonna fill up. But, actually, they,

1:21:53 you know, they do a cleanup way before they actually fill up the disk. So kinda need to know the workload on your machine and adjust these alerts accordingly. So it it is a bit tricky still, but such is life. It's messy. Exactly. Yeah. I think a lot of this stuff is really just you gotta start collecting the data first and then start to understand how it changes and evolves over time and write your queries. I don't I I mean, I think Prometheus is a really good project. It provides some really sensible defaults, but, you know, teams probably only understand their own infrastructure,

1:22:25 their own use cases, and and adapt them wherever possible for sure. Yeah. Kubernetes is great. Yeah. I I mean, we've not even looked at the at the Prometheus the Kubernetes integrations for Prometheus yet. And, you know, we're just about out of time. So maybe that'll be something for another day where we can because as as said, if we just summarize it, I guess, in a few minutes, but where we have been specifying our our targets here and and the the configuration fail itself, it's not something we would necessarily have to do with with Kubernetes there. Is that correct?

1:22:30 Other Prometheus Topics (Service Discovery, Alertmanager, Exporters, Remote Storage)

1:23:00 Yeah. Yeah. You basically can have in the extreme case, you could only have one scrape config where you point it at the API server and map everything over in the correct ways, and then Prometheus will always have an up to date view of what is in your cluster. Now in real use cases, you still have multiple scrape configs for certain things a bit differently, and you can do black box probing and, like, this external probing that and so on. There's, like, different things you can do. And I guess, like, what are other areas we haven't mentioned yet? We haven't even touched on

1:23:35 the alerts. Yeah. We haven't set up alert manager. Like, it's like a component that aggregates the alerts of all your different Prometheus servers, can correlate them, bunch them into one notification instead of, like, a thousand, give you nice human readable alert snippets on Slack, PagerDuty ops ops genie, etcetera, etcetera. There's so many there's, like, hundreds of open source integrations to get metrics out of all the things you care about. These are called exporters. There are people building and also sometimes offering a SaaS services remote, durable, scalable storages. So if you wanna keep your data forever in a durable system, Prometheus

1:24:00 Final thoughts

1:24:17 can send it there. Yeah. There's so much more. Yeah. I think what I what I'll say to the people that are watching this, I will definitely run more sessions on Prometheus. That's that's a guarantee, especially with Kubernetes as as a service discovery and looking at how we monitor Kubernetes. I won't make you commit right here and right now that you'll join me in those sessions. Oh, got you. Yeah. For sure. I was trying to give you a go a jail free card there, but you you you threw away. So Now, Julius is just committed to joining me on on more exploration

1:24:47 of Prometheus within a Kubernetes context. We just have a a comment from hacker saying very interested in the lights and thank you both. So Definitely, Julius. This has been an absolute pleasure. It's been great to to see Prometheus and, you know, play with the exporter and really get a nice look at PromLens. It looks like a really exciting project, and I'm I'm I'm hoping that there's a lot more to come there as well. So thank you. Yeah. Thank you. That's been fun. I hope we got one one comment popping on at the end there. Let me see if I can read this

1:25:15 Q&A: Monitoring Web Applications (RED Metrics)

1:25:18 quickly. For example, I'm gonna have to work on a fresh Laravel project. What would be the first metrics you would like to monitor? Give some examples, please. Yeah. Okay. We can we can do this in in three minutes. Right? So Laravel, don't know if you're familiar with it, Julius, is a PHP framework. It delivers web applications. So what do you have any opinions? I certainly got a few I can throw out as well on how what metrics are important for a web application. Yeah. For web application in general. So PHP can sometimes be a bit challenging if it's

1:25:47 not a long running server that can actually track metrics over time, like these counters that, you know, are supposed to only go up. If you have such a situation, just a general comment, you will immediately need to, like, externalize your events to some system that can aggregate them into counters. For example, send them as statsty style metrics and then use the stats d gateway that can aggregate them over time and have Prometheus scrape that. Just just one one thing. Not too familiar with Laravel itself, but, generally, with web applications, the first things you start tracking are the

1:26:20 so called red metrics, r, e, and d. So r stands for requests, which is, like, basically counting how many requests you've had and of which different types. Like, maybe you wanna track each method or path or status code a bit separately. So a counter for that so then later on, you can graph with the rate, like, how many requests per second are you getting. The second one, e, would be the errors. So if you don't already have it broken up by status code or or, you know, if if you have errors in your application, you

1:26:55 wanna count how many of which type are happening. So then you can also alert on that and graph that. And the d stands for durations, basically a distribution of your latencies in your in your request handling. So you could use a summary for that if you never wanna aggregate over a dimensions. But, typically, probably, you would want to use, like, a histogram metric for that to track the request latencies into buckets and then later on calculate percentiles from that and so on. So those are the first ones, and then it depends very much on the internals of an application,

1:27:32 what other stuff is happening in there. Are there queues where you wanna expose the length or, like, different other stuff? But, yeah, those those are the the first metrics that are, like, the obvious ones to add. Yeah. Definitely. I mean, I think the ones you've listed there, they're probably libraries. I know that, you know, PHP frameworks are very popular. They'll probably handle most of the red metrics for you handling how many request command, the response times, the histograms, and all that. And it's really important, like, Julius just said there as well, to understand your own application and then met

1:27:44 Conclusion

1:28:01 your own instrumentation. They actually understand your own domain. So I think there's actually an open telemetry exporter for Prometheus for PHP applications and PHP runtime as well. So there's definitely loads of things you can look at there, Robert. Alright. Awesome. We have no more questions. I will say thank you again, Julius. It was a really insightful and a great learning experience. So thank you for joining me, and I look forward to our next session. Thank you. That was a lot of fun. Be happy to join again. Alright. You have a nice day. I'll speak to you soon. Thanks. You too. See you.

Technologies featured

Meet the Cast

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Comments, transcript, and resources

More from Rawkode Live

View all 173 episodes
Prometheus

More about Prometheus

View all 26 videos