About this video
What You'll Learn
- Differentiate metrics from events using logs, cards, and football examples.
- Use Telegraf to collect metrics from systems and services.
- Apply downsampling and monitoring patterns for Kubernetes workloads.
Lecture one of the InfluxDB 2 course. David traces the history of encoding and sharding from ancient Rome, defines time series data (metrics versus events), shows collection with Telegraf, and previews downsampling and Kubernetes monitoring with InfluxDB.
Jump to a chapter
- 0:00 Introduction
- 0:16 Introduction and Course Overview
- 2:04 Beginning the Lecture
- 2:30 Encoding
- 2:32 Pop Quiz: History of Data Concepts
- 3:08 History Pop Quiz: Encoding
- 5:45 Sharding
- 5:53 History Pop Quiz: Sharding
- 7:45 History
- 7:49 Transition to Time Series & Course Goals
- 11:10 Lecture Topics & What Will Be Covered
- 11:56 Time Series Data
- 11:58 What is Time Series Data? (Definition)
- 12:15 The Value of Timestamped Events (Example)
- 15:52 Logs as Time Series Data (Events vs. Metrics)
- 17:05 Types of Time Series Data
- 17:07 Types of Time Series Data: Metrics vs. Events
- 18:36 Examples: Metrics & Events
- 19:40 Football Example: Identifying Metrics and Events
- 21:16 Collecting Time Series Data
- 21:31 Collecting Time Series Data
- 22:00 Introducing Telegraph for Collection
- 22:35 Data Collection: Push vs. Pull
- 23:27 Use Cases for Time Series Data
- 24:20 Time Series Databases vs. General Databases
- 25:31 Growth and Adoption of Time Series Databases
- 26:01 Poll Results
- 27:53 InfluxDB
- 27:55 Introducing InfluxDB
- 28:45 Time Series Data Vocabulary (Points, Tags, Fields, Measurements)
- 31:00 Value & Cost of Time Series Data
- 31:26 Frequency and Resolution
- 33:18 Demonstrating the Cost of High Resolution (Examples)
- 36:38 Managing Data Over Time: Downsampling / Rollup
- 39:46 Downsampling
- 40:13 Downsampling Example (InfluxQL)
- 41:23 Downsampling Events & Anomaly Detection
- 41:50 Advanced Use Cases & Course Plan
- 42:04 Monitoring Example: Simple Monolithic System
- 42:58 Alerting in Simple Systems
- 44:28 Monitoring Example: Cloud-Native Complexity
- 45:05 Course Focus: Kubernetes Monitoring with InfluxDB
- 45:48 Root Cause Analysis & Statistical Functions in InfluxDB
- 46:35 Practical Examples of Statistical Analysis
- 47:17 Using Histograms in Time Series Data
- 48:10 Proactive Operations & Prediction with Time Series
- 48:53 Outro
- 48:54 Lecture Summary & What's Next
- 49:28 Rawkode Academy & Future Courses
- 50:04 Closing Remarks
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:16 Introduction and Course Overview
0:16 Hello, and welcome to lecture one of the complete guide of InfluxDB two. My name is wow, confusing. My name is David Flanagan. I will be your guide to InfluxDB two. And I just want to start these sessions by talking a little bit about how they're gonna work and other future sessions that will be coming soon. So live courses on the Rawkode Academy are a mixture of live streams just like this, which will be driven by some live coding, some slides like we'll see today, and live q and a's. We'll be doing these multiple times per week to just try
0:59 and engage and make sure that the course material is being absorbed and consumed and, you know, we're all learning. Will also be prerecorded videos which accompany each of these live sessions, which will break down into smaller components each of the each of the sections of the workshops that we'll be tackling and and playing with over the next days and weeks. So they're multifaceted. I really hope you enjoy them. I would love any feedback that you as incubating members are happy to give to me. So jump into the Discord. Don't forget, Rawkode. Chat is the best place to come in. There is
1:35 the incubating lounge. If you're not already added to the incubating lounge and there are a few of you that I know are not in there, then remember to connect your YouTube and your Discords together so you can join the other members and chat about these courses. And also please make suggestions for other courses you're going to see coming along at the Rawkode Academy. I will do my best to make that happen either guided by myself or guided by friends that I've got in the community, a cloud native community. Okay. That being said, we're gonna start today's
2:04 Beginning the Lecture
2:06 first live session. This should be around thirty to forty minutes long. We're gonna talk about the history of time series, and then I'll give you a little bit of information on what is coming next week. Let's get started. This is an introduction to time series, lecture one in this course. Let me get my mouse in the right place. Mouse mouse. There we go. Well, I'm gonna start with a little bit of a pop quiz. I don't just want to throw the time series at you and be like, hey, this is time series. Like, let's make it a little bit fun along the
2:32 Pop Quiz: History of Data Concepts
2:43 way. So we're gonna start with a little bit of a pop quiz just to talk about the history of computing and time series side by side, and we'll see how we got. Now I know I can't receive any of your answers because of the what is that annoying thing in the corner? I know I can't take any questions from you right now, but that's okay. Feel free to leave them in the comments. I will do my best to tackle them. Encoding. Now when I talk about encoding, I'm talking about as, you know, computer scientists and engineers,
3:08 History Pop Quiz: Encoding
3:14 programmers, whatever you want to call yourself, is our ability to transfer some piece of information in a format that isn't the raw format. You know, base 64 would be an example of encoding our ability to translate messages into other formats. And when we think about this, you probably think that's relatively new. I'll give you two seconds to pack your numbers in your head and see if you're close. One, two. And the answer is no. Encoding well, not specifically to do with computers, of course, but encoding goes back many, many, many years. First used, at least as far as I
3:55 could see, in April. An eyephant is documented in The Lies of the Noble Grecians and Romans by Roman historian Plutarch, where Plutarch is telling us about the story of Alcibiades. Alcibiades was a mercenary who had a fleet of ships and people by his side, and he didn't really have allegiance to anyone except himself. So when Alkabates would show up to a war in the seas, it wasn't until he raised the seine that people knew which side of that battle Alkabates was gonna be on. So that's encoding a message of support through a flag on a ship.
4:39 Now, of course, it took a long time before this system really changed. Know, according to everything I can find online about flag systems, it wasn't until the fourteenth century that they actually evolved to have, like, two signals, and that was, like, one or two flags. And this is from the Black Book of Admiralty. Although just a hundred years after that, things did evolve much quicker. I think people realized maybe there was a flaw in their one and two flag system. But by the fifteenth century, it had 15 flags and each flag had a different encoding
5:11 or message, symbol, etcetera on the flag, which all had different meanings. So you have to understand what each of the flags meant to know which message was being passed. And then a few hundred years later, we have the French system, which I will fail to pronounce, which has 10 coloured flags each representing zero to nine, and generally these chips would have three sets of those flags being able to transfer tuples of information to the other chips with a big book to look up which each system means. Now I'll do one more of these, of course, based on the encoding one, you know
5:45 Sharding
5:50 that I'm using the term encoding loosely. The same will be true for sharding. So I'll give you a few seconds to guess. Where do you think sharding or what was the oldest reference to anything I could find that resembled sharding? And I'm going back to January this time. The first documented example of charting I found was actually described by Polybius, and this is talking about the way that the ancient Romans transferred messages on the battlefield. Common theme, but not important to today's conversation, is that ancient battles drive the innovation of their time. Now what the ancient Romans
5:53 History Pop Quiz: Sharding
6:32 were doing was splitting their alphabet into five parts and using tablets. So they would have five tablets and they would use these to transfer messages really, really quickly or at least yeah, they would be used as a translation tool for them to translate the message. And each tablet would have five letters on it. Here's a photo from ancient Rome, so they were using fire to actually send these messages many, many distances. But what this system would do, it'd say, okay, we've got two flames on the left, that means we want to look at the second tablet, and then the five flames on
7:12 the right means look at the use the fifth letter on that tablet. And you can imagine the reason these fire flames are super super quick, they can translate and send those messages hundreds of miles depending on how depending on what they're using to burn and what kind of smoke they're getting. But this was the way that they won wars. Very cool. A lot of these tidbits came from the early history of data networks. You can go and buy this book. It is quite expensive, but from start to finish, it is just such a fun read and I couldn't recommend
7:43 it more to people. Okay. So let's get on track. This is the talk about InfluxDB, and we wanna understand use cases for InfluxDB. And the best way to do that is to quickly take a look at the history of time series before we break down what time series actually is. So let's see. I guess, you know, having that pop quiz at the start, we see that, you know, the ancient Romans did drive a lot of this or at least some familiarity of these concepts that we have in modern times is that when it comes to when it comes to
7:49 Transition to Time Series & Course Goals
8:21 time series, the Romans did that first too. And in fact, there's this great paragraph that says I wouldn't read it verbatim, but, you know, the things that are highlighted here are legal bodies sold to public investors and traded, and the values of these shares fluctuated over time. Now it doesn't say time series here, but of course the Romans must have had a way of tracking the value of the fluctuations of these commodities over time to understand as the organization or the legal body doing well is going down, etcetera, and been able pay out those dividends to the public investors.
9:06 So yeah, time series is old even though it seems to be like something we can consider pretty modern. Some little facts for you is that the first ever IPO was the Dutch East India Company in sixteen o two, the first US IPO wasn't until 1873, and that was the Bank of North America, And then it wasn't until 1984 someone asked a question about the price of wheat. Why is this important? Well, this is the first documented usage of the actual term time series that I could find. So this is the first time anyone had ever put those two words together that was
9:46 in some sort of logged and written form that could travel the years to today. But in 1884, this paper was published in the Journal of Statistical Society of Lending, which was building comparisons and looking at the fluctuations of the price of wheat correlated with the value or the import price of cotton and silk into Great Britain. We're trying to work out if we are importing more cotton and silk, the price of wheat can up or down, etcetera, etcetera. Now it's not a paper I would encourage you to go and read. Of course, it's not that long, so feel
10:22 free, but it's just that was a very nice find to see time series mentioned in this way. And, of course, just having that paper, at least the first one I could find that was applying the statistical mathematics to the dimension of time I found really interesting. Okay, so my former boss, Paul Dicks, the CTO of InfluxDB once said that most data is best understood in the dimension of time. And I think that is one of the truest things that I ever said. And I'm looking forward to taking you on this journey of time series and InfluxDB too,
11:03 and helping you understand your time series, your data and your systems. Okay. With that being said, what is time series data? Well, we're going to cover that. We're going to take a look at what are time series databases. Some of you may be familiar with, some you may not be not. I will get you acquainted with the vocabulary of InfluxDB. And I really want to talk about the value of time series data. When we talk about time series data, it's very simple to talk about the collecting and storing and querying of that data. But there's something much more
11:10 Lecture Topics & What Will Be Covered
11:39 important as we understood, and it's grasping the value and longevity of that data and being able to work with it accordingly. And then we'll talk about a little bit of those more advanced use cases for time series, and that's all this is some of the stuff that we will be covering over the next couple of weeks on this course. Okay. So what is time series data? I'm gonna keep it simple, and it is any piece of data with a timestamp. That is it. It's really, really that simple. If you have a value and a timestamp, you can track the change
11:58 What is Time Series Data? (Definition)
12:09 of that value over time and that is time series data. Gonna try and do this through an example, and I'll try and move my face out the way a little bit. What we have here are events, things that could happen in your infrastructure. Here we can see that the memory is 100%. We can see a health check field. We can see a database migration has been run and a whole bunch of other things. But they don't mean much in this form. Get my eyes back. But what we can do is try to identify what these events potentially represent.
12:15 The Value of Timestamped Events (Example)
12:52 And what we can represent, or at least what we can try to infer from these events, are reddy pinky color, is that these are cause for concern. These are events that maybe I would want some alerting on. Like if a health check failed, I probably wanna know about it, depending on how many of them happened and what space of time. More on that next week. Memory hit a %. Yeah. I probably definitely want to know about that. That seems quite dangerous. And pods being killed by the arm? Yeah. Of course. Right? These are important events
13:26 I wanna understand. Now in the yellow color, what we see here is potential causality. So a database migration ran, pod restarted, a new version of our container was deployed, and we have CI passed and started. Now these events are not particularly malicious themselves but they do have the ability to mutate state. Something in the system has changed And change is always the cause of one of the red ones. Something has to happen for bad things to happen. Purple, what do we see in purple? In purple, see nothing. Really, these are events that we probably just
14:10 want to discard. We would not consider these things that could cause too much change or too many problems within our system. And of course, the red heading in pink, it's Scotland qualifying for the World Cup. Never gonna happen. Now these events, while we can infer what they mean and we can try to guess what is happening in the system and its current state and its current visualization, we just have no idea. We are flying blind. However, if we apply the dimension of time, move me back down. If we apply the dimension of time to these events, we actually
14:49 get a really strong understand understanding of what happened in the system. We can actually see that the memory hit a %. We then see that the ARM killed the pod. We were probably using the latest tag which triggered a new deployment of our application, application, maybe one we weren't ready to deploy just yet. That caused a migration to run-in our system again, which we were not expecting and now our health check is failing. We now know what happened and all we did was replay these events in the order that they happened, or at least visualize them in the
15:26 order that they happened to build the understanding. So most data and data as events with points in time are always best understood in this fashion. And that is why time series is so important and why we're putting this course together on InfluxDB two. There we go. Now, you may already be familiar and have lots and lots of time series data. This is the screen that where the words may not be familiar right away nor even probably visible to you depending on your screen, But we've all really familiar with this, right? This is some log being tailed.
15:52 Logs as Time Series Data (Events vs. Metrics)
16:07 Now, what is the first thing that we always get at the start of a logline? It is typically a timestamp. Logs are classical time series data. These are events that happen over time that can be aggregated into metrics. And I want to make that last little point really clear. All metrics, every single metric in the world, and you can bring me all the metrics you want to the Discord. Every metric is an aggregation of some series of raw events. Challenge accepted. Bring me a metric that isn't. I will concede, but it's not gonna happen. What we really want to understand here
16:48 is metrics or aggregations always events of the raw form, and we have to make trade offs and understandings of when to store which. That will also be covered either next week or the following week as we dive into InfluxDB more. Okay. So what is time series data? Well, there are two different types of time series data as I just kind of covered in that log back there, but I really wanna make sure that we understand the difference. The regular time series, these are aggregated events or metrics, are predictable and evenly distributed. Now what I mean by that is that
17:07 Types of Time Series Data: Metrics vs. Events
17:28 I should be able to get the value for a metric at any given interval and it should be consistently available. If we think about the CPU load average of a Linux system, as I can request that value from the kernel every one second, every five seconds, every minute, entirely up to me, but there will always be a value. Much like the temperature, I can always use the thermometer to get a temperature. Every regular time series or the raw form slash events are unpredictable and very inconsistent. An example here would be if I am working on a ticketing system for a football
18:06 stadium, I don't know when the next person is gonna scan their ticket, but I can tell you how many people are in the stadium through an aggregation of the ticket scans over a certain window. We'll dive into more, and from probably here on end, will always reference them as metrics and events rather than regular and irregular because it's the vocabulary that you're probably more familiar with and that we all use every day in the software and technology world. Now, good examples of metrics are CPUs. So you know I cover this and I mention that all the time because it's the one that
18:36 Examples: Metrics & Events
18:42 we're all really familiar with. Memory usage is another one. I can always get a value for how much memory is being consumed. I can always get a ping time for google.com even if it's indefinite because it's not up. And I can always get the number of processes from a kernel as well. So these are metrics. These are values that change over time. Great examples of events. You cannot predict when a user is going to click the login button no matter how hard you try. You cannot predict when someone is going to get their password or username wrong to log into your
19:16 system. You have no idea when an XCI build is going be published and you definitely have no idea when a network engineer is going to trip over a rack and pull out a cable. Now, these are still important events and we can build aggregated metrics from it, but they are irregular and unpredictable. Now I'm not gonna sit on the screen for too long, but I always like people to really try and cement their knowledge of what are metrics and what are events by looking at a very common example. If you're a sports fan, I don't even
19:40 Football Example: Identifying Metrics and Events
19:49 watch football, I don't know why I keep using this, but a lot of people do watch football. And just by looking at this, there are so many examples of metrics and events just in this screenshot. So I'll pick a couple, but if you want to try it yourself, feel free to pause the video now and see how many you can come up with. Now the metrics that we have available to us, we can kind of already see. We have a score. Right? We can see that Liverpool are beating Barcelona two nil, and that is three two on aggregate. Those are
20:25 two different metrics. But we also have the raw form of that, the events that contributed to the score. We can see that those two players scored at seven minutes and fifty four minutes, and that gives us an aggregated score of two nil. If we had well, I'm sure someone has the event data, but we can also see the other three goals are first one on the first leg of the match. So those are the events. If there were any corners, freeze kicks, red cards, yellow cards, those are more events that happened in the game that could be tracked.
20:57 The number of people in the stadium is a metric. The events are the people scanning in and out of the stadium and so forth. But there's a lot of time series data all around you, everywhere you go, every day of life. Time series data cannot unsee it when you know that it's there. So again, think it's just super important all metrics are aggregations of events And it just that should hopefully help your understanding of this as we move forward. So the next thing that we want to get comfortable with when we're doing this in action on the next video is collecting
21:31 Collecting Time Series Data
21:36 time series data. Now you can do this through Prometheus exporters, which you may be familiar with if you're already working with Kubernetes, or you can use a tool like Telegraph. It's entirely up to you. They're all great. They all do a wonderful job. It just depends on where you're going to be storing them. Because this course is going to focus on InfluxDB two, we will be using Telegraph for the majority of our metric collection. Telegraph is a really cool tool because it has inputs for almost everything. It's got at least a few hundred input plugins
22:00 Introducing Telegraph for Collection
22:11 where it can read metrics from, you know, Kafka, Kubernetes, Linux, Puppet, TLS certificates is a really cool one. It has remote support for HTTP endpoints, gRPC endpoints, etcetera. It can write metrics to anywhere. Again, we'll be using InfluxDB two for this course. And from Prometheus, there's always or generally always an exporter equivalent to the plugin, which always brings us to the wonderful debate, push versus pull. Now for metrics, pulling them is fine. Why? Because they're always available. They are consistently and predictably available. If I want to pull metrics on a ten second, twenty second, one minute, hour
22:35 Data Collection: Push vs. Pull
22:53 basis, then that works. You cannot pull events. You cannot pull the raw form because they're unpredictable. So you actually need a combination of both, which is why I always push people in favor of tools like Telegraph because it can handle both scenarios. So you definitely need both. Understand that for metrics, it's okay to pull, it's encouraged to pull. For events, you definitely need push based system regardless of what you choose. Now use cases for time series data, why is this course going to be important to you? Well, if you're coming from a cloud native or Kubernetes background, you probably want to
23:27 Use Cases for Time Series Data
23:35 do monitoring. Infrastructure. You're probably writing applications. You're probably consuming third party services. You need to understand how well they're functioning and where problems arise from good, strong root cause analysis. If you're into IoT and centrifugation of the real world, then you maybe have too many ZigBee, Raspberry Pis, and other devices around your home, and you can do so much cool stuff with all of the metrics that those are meant day in and day out. And if you know, you have your own website, your own blog, you track your Twitter performance, your YouTube performance, things that are important to me,
24:11 then real time analytics from all of these services provide really fantastic time series data. Okay. Now when it comes to the TSDBs of choice, this course is focused on InfluxDB two, but there are others. But the thing that I think is important for right now is a lot of people do sort their time series data in general purpose databases. And there are a few reasons that that's not gonna work. Firstly, time series data is generally high rate frequency. Right? Especially with IoT and sensorification. Is that we are writing these real world metrics at milliseconds, sometimes nanosecond
24:20 Time Series Databases vs. General Databases
24:51 frequencies or resolution, particularly with high velocity, high frequency trading in the financial market. The way that we read data from a time series data is also wildly different from a general purpose database. We're generally scanning large chunks of data over a particular time window, so you really want a database that is built for that purpose. And then the last part of this session today is going to be talking about the value of time series data and time to live and life cycle management and time sensitivity of that data is really, really important and something that you
25:26 wouldn't get through a general purpose database. So, again, why is this course important? Well, we take a look at the trend over the last, I think, twenty four months. In fact, that was this is last year's graph. I really need to update this. We can see that time series databases are the fastest growing database category. And that's because of this migration to cloud native and Kubernetes. As we build more and more distributed systems, we need the tools, the knowledge and the understanding to be able to operate them efficiently. Time series databases do that. Now, there was a really cool
26:01 Poll Results
26:05 poll that was done by the news stack that was asking, do you store your time series data in a time series database? And only 12% of those people said yes. So I'm here to see if that is true. And if not, I want to show you the truth. Plus, I just really like Rick and Morty. Now, you're probably all familiar with New Relic. It's expensive, so if you work for a larger organisations, you may be familiar with it, otherwise you may not. And if you're not in a large organisation, you're probably more familiar with Datadog
26:35 and even to some degree, Google Analytics. These are all time to use databases as well. Sometimes we don't really think of them in that way, but they are tracking metrics and events that change or happen over time. So I'm not entirely sure that the NewsDesk survey took that into account, and I don't think it's I think more than 12 people more 12 12 people. I think more than 12% of people are using time series databases for their time series data. Here's a question I threw on Twitter a long time ago, but still has resonated and
27:08 stuck with me ever since. But I asked, you know, I run Kubernetes in production and I monitor it with. And there's like a thousand responses to this. And 74% of people said Prometheus, which I thought was great. 3% used in FluxDB. Yep. That's all good too. There was some new relic and Datadogs, but 13% of these people said nothing. 13% of people are not monitoring their Kubernetes. That is really scary. So I hope if you're in that 13%, you're watching this course, and you're gonna learn how to monitor Kubernetes with InfluxDB two. That is our goal. That
27:46 is one of the end achievements of this course, is you will have pitch perfect monitoring of your Kubernetes system. Okay. So it's not too late. Let's take a look at InfluxDB. Now, none of this is unique to InfluxDB. This time series introduction thing will probably be the same introduction to time series that I give as there is a Prometheus course down the line. So this is widely translatable to any time series database. So please, even if you're not that interested in using InfluxDB, you want to use Prometheus or you want to use M3 or you want to
27:55 Introducing InfluxDB
28:19 use Thanos, all of this is still completely relevant, at least this one particular episode. So as far as the introductions go, it's a time series database. It is completely open source. They're currently under second version and FluxDB can set itself full stack and it has a UI, it has a stream ability to work on the stream of data and a whole bunch of other things. The vocabulary when working with time series data is typically that we talk about points. So at any point in time this being context was value n. And if we look at an example of
28:45 Time Series Data Vocabulary (Points, Tags, Fields, Measurements)
28:57 this, then we can see that the load average on the machine VM1 was 6.32 at one minutes. As you know, if we're talking about load average, it would be 1.515. Eight point two on five and nine point five five on 15. We also have a timestamp in orange. We have to track the time this value was recorded otherwise not time series data. And as far as InfluxDB is concerned, values are fields and the context, the tags, the series is the bit on green. Typically we would call the load to be the measurement, host would be a tag key,
29:33 and VM one would be a tag value, and we break that down here too. So we have the name or the measurement name at least in yellow, tag keys in green, and tag values in blue. Now tags are indexed, so it's really important to make sure you get that distinction as correct as soon as possible when you start storing your time series data. Here's another example where we could see that the series changes. Even though the market is the same, the ticker is different. The series is actually a triple of the measurement name and the tag keys and the tag values,
30:10 and that's important too, especially in around We'll touch on a little bit next week, but more so the following week when we start to really understand how to use Flux to query and build dashboards with this time series data, understanding what a series is particularly important. Now, when you're trying to decide what to do with your time series data, what to store as tags and fields, remember tags are indexed and always strings because of that constraint, whereas fields are not indexed and you can use multiple data types. Now I think the value of series is the
30:47 most important part of today's session. Yes, everything we've covered so far from the fun to the intricacies of time series is hopefully really relevant and I hope you enjoy it. But I think this part here is super important to understand because as you come from your development engineering, the SAP background, and if you haven't worked with time series before, you may fall into the trap of storing a data forever, and it's really important that you don't do that because it's really expensive. Very, very, very, very, very expensive. Maybe my cursor back. There we go. So when we talk about time series data,
31:26 Frequency and Resolution
31:27 we'll talk about frequency and resolution. Language is interchangeable almost, but I prefer to use high and low resolution. When we talk about resolution, we're talking about the interval, that predictable interval of the metrics and how often we collect that data. So a ten second resolution means we collect the values every ten seconds. Ten nanosecond resolutions, then we collect that every ten nanoseconds and so forth. A one error resolution is a lower resolution than ten seconds, which is higher resolution because we collect the value more frequently. The value of the time series data that we collect is directly correlated
32:10 with the resolution that the data is available or collected. Meaning, storing time series data at ten second resolution is potentially or definitely more valuable to you than collecting it at an hour because you will have much finer understanding of those values changing over time to be able to do predictive analytics, anomaly detection, and a whole bunch of other things on it. If you only collect the value every hour or even like every day, it's not as valuable to you anymore. Think about the weather patterns and your region. If you check the value every day at
32:49 noon and see that it's 25 degrees Celsius, it doesn't tell you anything else about that day. If you click the temperature every ten minutes, every fifteen minutes, you then have a much better picture of the daily cadence or patterns of that temperature over months, years, decades, and centuries. So the cost of time series data is installing that resolution. I'm going to do this through example. Let me drag my face again. So cursor, very professional this course. So here's an example. We're gonna use load averages. I'm hoping that you're all familiar with this. Over here we have a machine,
33:18 Demonstrating the Cost of High Resolution (Examples)
33:40 we have a single tag, machine equals ABC one. We're collecting the CPU measurement. Not gonna call it load average, we're not gonna do one, five, and 15, we're just doing single, single usage for now, and then we have a timestamp. Now for us to monitor this machine with a single measurement, the CPU, one series, which means one machine, at one second resolution, which means collecting a CPU value every one second, that is 86,400 points per day. Now as we run through this scenario, please, please try and think about using a general purpose database, your database of choice, whether it be MariaDB,
34:19 Postgres, MongoDB, Cassandra, whatever, and try to think about how the burden or the tax you would have to pay to store these values. So 86,400 probably doesn't make you uncomfortable. You're probably okay with that every day. Now if we have two machines on our infrastructure, because it's unlikely we have one machine, so we have two CDs, we've still got one measurement, we're only tracking the CPU, and we're still doing one second resolution, we double that value to a 72,800 points per day. We're probably still okay with this regardless of which database you're using. Okay. Let's assume we have 10 machines,
34:58 still one second resolution, only this time we're a bit smarter. We know we don't just know that we're not just worried about the CPU of our machine. We're worried about the memory, and we care about desks, and we care about network, and maybe there's something else on there. But five measurements. We're now at a much larger number because we've got 10 machines and five measurements at one second resolution. So we're now tracking millions of points per day. And, you know, infrastructure is not typically 10 machines. Right? We probably have a lot more than that, especially for
35:37 production. And we definitely have a lot more than five measurements. We're not just tracking five superficial metrics from a Linux host. Our applications are met hundreds, if not thousands of metrics. But I don't know what your infrastructure looks like, so I'm gonna use the Nasdaq as an example instead. And we're just gonna track one measurement, which is the cost or price, the share price of a ticker on the Nasdaq. The Nasdaq has 3,300 roughly companies, and we're gonna do one millisecond resolution. Because, you know, financial trading, we need these values. But look at that number. And that's I
36:13 I don't even know what that number is. I'm not gonna try to guess. Let's just say billions. That is definitely billions, but points per day. Imagine storing this in your database of choice. But if that time series database, yes, it's a lot, but it's manageable. It's definitely doable because time series databases are built to handle this particular use case. Now if we drop the resolution, and this is the most important part now is that we're talking about resolution change in time series data. So we've went from one millisecond to one minute. We're down to a very manageable number of
36:38 Managing Data Over Time: Downsampling / Rollup
36:50 4,000,000 again. And if we drop it again to one hour, we're into the thousands and we're looking at this and we're comfortable regardless of database, restoring this value again. And that's concept of changing the resolution of data is super important in time series. At six hour resolution, we could store this in any database. We're not gonna be worried about 13,000 points per day. Here's my wonderful diagram trying to understand and explain to you the value of time series data. And I drew this myself, shocker, very, very cool. Now what we have is that for a certain point in time,
37:31 data at a ten second resolution has value x. However, there's a certain point in time and who knows what that is, right? It's very specific to the types of data you're working with, and as we work through this course, we will spend a lot of time talking about the value of the metrics that we collect. But for now, we're gonna keep it superficial. But this value, the value of this data at ten second resolution is important until a certain point when it isn't. Now there's normally a rather large cliff, whereas the value just drops and then it's valuable
38:08 for a little bit more time and drops and then valuable for a little bit more time and drops and valuable and so forth. Until the data is just not viable at all anymore. And one of the things that we really need to get good at and understand as we store more and more metrics and raw events and times of each data enter our system is being able to understand that if we lower the resolution, we minimize the drop in value. The value of the data is gonna be continually, the value versus the resolution of the data
38:38 is gonna continue to be It's gonna be valuable for longer. Wow. So if we can actually take that those values that we store at ten second resolution, drop them to hourly resolution, you're calculating a mean as an example over the hour, because we're storing less values, the value increases and the resolution, yes, we change it, but it's still valuable. Then we do it again. After a certain amount of time, we say, okay, we don't need an hour resolution. Let's calculate the mean over six hours. We're storing less data, reducing the cost and the burden of storing that data and increasing
39:17 the value of that data. Even though the value will increase a little bit because we lose a resolution, as we go further away from the point in time the value is stored, actually works really well. Eventually, we'll have this really nice cascading downsampling system where we take all of our time series data from ten seconds to one hour, to six hour, to one day, to a week, to a month, whatever. It comes down to the types of data you're storing. Again, we will explore this in a lot of detail over the next episodes, the next lectures.
39:45 The time sampling is really important, and we have to understand when to lower the resolution of each type of data that we store in a system and understand the value that we need from it. Some values, some data, you will need to keep the higher resolution. Some data, you will just bend because it's no longer relevant at all, but really for long term retrospective analysis of that data, you will want it in some form. Now, this is what a continuous query would look like in InfluxDB one. It uses a SQL like language. This is a lot easier to understand than
40:13 Downsampling Example (InfluxQL)
40:21 Flux. We haven't really touched on Flux yet, so I'm not gonna show you that, but we will be diving into Flux and you will see how to do this through Flux tasks. But really what I want you to understand is how the semantics or the pseudo code of a downsample or a rollup would look. So you can see here that we create a roll up that runs on a certain bucket or table or measurement. We basically tell it how to calculate the mean, and that could be a min or max, mean, it could be whatever calculation you
40:50 want, and we tell it how to group that value. So if we have ten second resolution, we can group by time of one hour and calculate the mean. And we just let that downsample run forever doing the thing we need. We chain these together to go from one hour to six hours, from six hours to a day and so forth. So I'm hoping that this InfluxQL version of it helps cement what a downsample will look like, and we'll tackle the flux as we move in to subsequent lectures. Now you cannot downsample events. Feel free to Google anomaly detection with InfluxDB.
41:23 Downsampling Events & Anomaly Detection
41:26 We will be doing a little bit of it in this course as we try to use HTTP queries to pull out, you know, five hundreds, four hundreds, ones that we didn't expect. So we'll do a little bit of this, but there is a lot of prior research on Google available if this is one of your main drivers for this course. We'll do a little, but not a lot. Excuse me. Okay. So now that we have our time series data, we're done sampling it. We've got it in our InfluxDB database. We're starting to put things together.
41:50 Advanced Use Cases & Course Plan
41:56 What are some of the things we're going to cover on this course? Well, let's do this by example again. Let's assume you have an application that speaks to a database. I call this the simple days, the monolithic days. You know, back in the early two thousands where this is mostly how we built applications, and if you're still building applications like this, you know what? For some scenarios, it works great, good on you. Your monitoring is a lot easier than cloud native monitoring. Now, if we want to understand what to monitor in this system, it's really,
42:04 Monitoring Example: Simple Monolithic System
42:27 really, really simple. We typically wanna monitor the CPU to go above a certain threshold. That means bad things are gonna happen. We wanna monitor the memory consumption. Bad things are gonna happen. If our customers have a response time greater than three hundred milliseconds, bad things are happening. We need to fix it. We may also have predictable problems. Black Friday and other holiday and seasonal items and cadence and things that happen that you will understand within your domain. Okay, so how do we know when to send an alert in this system? Well, we can have an application health check
42:58 Alerting in Simple Systems
43:06 and if that begins to fail, we send an alert and if any of those symptoms that we've seen in the previous slide happened, we can send an alert. Really, really simple architecture, really simple to work with. Time series data does not have to be terribly complex. As we start to bring horizontal scalability, things get a little bit difficult, and this is where time series data, you then need to start manipulating it and working with it in more sophisticated fashion, grouping and windowing and other things. And we're gonna cover all of this. But the question here is how do I
43:35 know when to send an alert with a system? You know, we can't just use a health check on a single node because we may have two, seven, a hundred other of them that are returned in a good health check. Well, my face is always getting in the way. Please stand by. So what we could do is an aggregation of our metrics, and we can actually look to see if, you know, using service level indicators or having a service level objective that we have to meet, whether we have x number of 500 exceptions within n window periods. So
44:13 whether I have more than 100, five hundred exceptions within a five minute period, to me that is a problem that goes against our service level objective. Yes, we definitely have to send an alert. All right. What about cloud native hell? So this is the system we have now, right? We've got service A that speaks to database A, we've got service B that speaks to database B, we've service B that speaks to database B because it's horizontally scaled. We've got service C, which speaks to database C, which also has Canadian deployments and progressive deployments. Of course, all of our networking is virtualized
44:28 Monitoring Example: Cloud-Native Complexity
44:48 because it's on Kubernetes and we're running through the service mesh. Help me. How do I know when this system is healthy and how do I know when to page someone? Time series data is the answer. So we're going to specifically work on this example. We will have a Kubernetes cluster. We're going to have InfluxDB too. We're going to collect data from that cluster and we're gonna break the system, get our alerting into place, and really try to build as much understanding through time series as we can. Now this is the cloud native architecture, convenience versus cost, So yes, you can still follow
45:05 Course Focus: Kubernetes Monitoring with InfluxDB
45:30 along with this course and learn a lot if you've got monolithic architectures. There's still a lot of great knowledge here to understand. We will be looking at microservices monitoring them using the Prometheus exposition format with InfluxDB too. Now, one of the things that we want this course to help you do is to root cause analysis and understand causality within your system. So we will be breaking into some of the statistic. We were exploring the statistical functions provided by InfluxDB. We've able to analyse weeks, months and years of data using tags to build correlation, looking at how we can use linear prediction
45:48 Root Cause Analysis & Statistical Functions in InfluxDB
46:10 derivatives, median absolute deviations, moving averages, whole winters, not machine learning, but statistical learning or statistical predictions based on previous data and cyclic data structures. So if you want to know how to understand the data, this is a really good way to do it. We'll be covering all of these functions at InfluxDB across the next couple of weeks. And some examples here, I like to lean on my own previous life as an SRE here, but I have been paged at 4AM because a disk usage of a machine went above a certain threshold, would trigger PagerDuty alert and wake
46:35 Practical Examples of Statistical Analysis
46:48 me up. Some of the things we'll explore in this course is, but how can we actually try and predict these outages during office hours? That is not always possible, of course. Bad things happen and spikes happen and those are very difficult to predict. But when we have linear growth of a disk, yes, we want to be able to learn that before I go to sleep. So we can do that. We can also use, you know, we have distributed applications, distributed HTTP requests. We wanna be able to use histograms. Now there are two ways to do histograms
47:17 Using Histograms in Time Series Data
47:23 in time series data. You can use histograms of metrics where you preassign the buckets within your code. There's the slightly better format, but more expensive, and we'll cover that trade off as well in the coming weeks, of where I can store the raw events and how long every request took and build dynamic histograms with dynamic bins to understand my application. And then some use cases, you may wish to go down just the pre allocated bin hardcoded in your code. Again, trade offs. You have to understand them. You have to understand the risk and the cost
47:53 with both, and we'll talk about that. Here's an example of the Prometheus one. I'm saying beware because it's really difficult change these buckets retroactively, and we'll have examples of this in the next week or two. Proactive operations. One of the really great things I want us to try and take away from this is some of those really cool predictive things that InfluxDB offers, being able to understand and make predictions of previous events. Now we'll probably use an example like Black Friday and this course where we take a look at artificial data of three years of a store
48:10 Proactive Operations & Prediction with Time Series
48:30 and see if we can predict what the utilisation of infrastructure will be the following year. And we're going to do that through applying the whole winters to the machine to the time series data that we have in our system. Alright, let me pop back over here. So that is our introduction to time series data. Now we haven't looked at InfluxDB two yet. We haven't collected any time series data yet. We haven't stored it, we haven't queried it, we haven't built dashboards, we haven't built alerting, and we haven't done downsampling and analysis of our data. This is the introduction. This is lecture one.
48:54 Lecture Summary & What's Next
49:17 There will be many lectures in this course. We will be doing thirty minute to sixty minute videos multiple times per week as we explore time series data via InfluxDB two. Hope this has been a pleasant and enjoyable introduction to this course. Please remember to, if you're not already, sign up and become a Rawkode Academy incubating member to see all future episodes on this course. We also have new courses launching in August. We have a course on eBPF as we look at building and trace points into the Linux kernel to do really cool stuff. BPF is a very awesome technology, and
49:28 Rawkode Academy & Future Courses
49:55 I'm very excited to share that with you. We also have other courses coming, and they will all be announced to incubating members first. I hope you will join us on this journey. Thank you for supporting this channel. I hope you have a wonderful day, night, morning, whatever time you're watching this. And remember to stay tuned. There will be Q and A sessions and prerecorded videos coming that will guide you through the workshop course. If you do not have access to the workshop course yet, it is available on GitHub. The link will be in the show notes
50:04 Closing Remarks
50:29 as soon as possible, again, for incubating members. Have a wonderful day. That's BT Elson.
Technologies featured
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments