About this video
What You'll Learn
- Set up a distributed Thanos architecture using multiple Prometheus instances with sidecars and a querier for global querying.
- Upload Prometheus data blocks from sidecars to S3-compatible object storage and query long-term history.
- Run live demos for Store Gateway and Compactor features, including downsampling, HA deduplication behavior, and troubleshooting connectivity issues.
Bartek Plotka walks through Thanos, the CNCF project that extends Prometheus with global query, HA deduplication, and long-term storage. Covers the Sidecar, Querier, Store Gateway, and Compactor, plus a live demo running three Prometheus servers against MinIO.
Jump to a chapter
- 0:00 Holding screen
- 1:00 Introductions
- 1:07 Introduction to Thanos
- 2:13 Guest Introduction (Bartek Podkała)
- 4:50 What is Thanos?
- 5:24 What is Thanos? (Problem it Solves)
- 7:19 Observability Concepts and Signals
- 8:41 Thanos: A Distributed Prometheus
- 10:13 Thanos Architecture and Components
- 13:01 The Thanos gRPC Store API
- 14:13 Thanos UI and Community/Mentoring
- 15:29 Transition to Live Demo (Katakoda)
- 20:00 Generating some fake time series data with thanosbench
- 20:48 Demo Setup: Generating Sample Data (Thanos Bench)
- 26:00 Running three Prometheus servers
- 33:20 Accessing Individual Prometheus Instances (UI)
- 40:00 Running the Thanos sidecars
- 40:02 Introduction to Thanos Sidecar
- 40:29 Demo: Running Thanos Sidecars
- 43:01 Debugging: Sidecar Connectivity Issues
- 45:20 Running the Thanos querier
- 45:29 Introduction to Thanos Querier
- 45:40 Demo: Running Thanos Querier
- 47:07 Querying with Thanos Querier & Deduplication
- 50:01 Debugging: Querier Connectivity Issues
- 53:29 Successful Global Query Demonstration
- 57:40 Long-Term Storage with Object Storage
- 58:10 Connecting Thanos to S3 / Minio
- 58:15 Demo Setup: Running Minio (S3 Compatible Storage)
- 59:19 Configuring Sidecars for Object Storage Upload
- 1:06:36 Introduction to Thanos Store Gateway
- 1:08:30 Demo: Running Thanos Store Gateway
- 1:10:08 Configuring Querier to include Store Gateway
- 1:12:32 Querying Data from Multiple Sources
- 1:13:14 Exploring the Thanos UI & Store Filtering
- 1:14:00 Enabling compaction and downsampling
- 1:14:38 Introduction to Thanos Compactor
- 1:15:05 Demo: Running Thanos Compactor
- 1:17:08 Thanos Bucket Viewer & Downsampling Explained
- 1:18:53 Recap of Thanos Components & Deployment
- 1:22:22 Host Summary & Thank You
- 1:24:14 Community, Contribution, and CNCF SIG Observability
- 1:26:15 Final Wrap-up
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:07 Introduction to Thanos
1:07 Hello and welcome to today's episode of Rawkode live. I am Rawkode, your host and today we are gonna take a look at Thanos. Now before we move on to that, I just wanna take ten seconds to say thank you to Equinix Medal, my employer for giving me the time and the resources to be able to invest in producing these materials so that we can all learn together. If you wanna play with Equinix Metal, the bare metal cloud, you can use a code, Rawkode live for $50 in credit. There's a whole array of configurations of bare
1:38 metal machines that you can use and play with. If you can spend that money very very quickly or you can be a little bit more conservative and make it last up to one hundred hours with our smallest instances. If you're not watching this live or you just like this card, feel free to join the chat. We have loads of people in there. We always talk about the episodes and what's happening and it's a really good way for you to suggest potential new episodes too. Please remember to subscribe to our channel. This helps the content get and more people, more
2:07 people can learn and that makes it more fun for me too. And today, we're gonna take a look at Thanos. Now I am joined by who is a contributor and maintainer of the Thanos project. Hello. How are you? Hello. Hello. I'm fine. Thank you for having me here. Awesome. It's gonna be really good fun. I'm looking forward to to understanding more about Thanos and and what problem it's trying to solve and and and how it's gonna help me in my kind of monitoring observability and Kubernetes journey. Do you want to just take a little moment to tell us a little bit about
2:13 Guest Introduction (Bartek Podkała)
2:43 you, what you do, and your involvement in the project? Sure. So my name my full name is actually Bartlomi port Bartlomi Podka, and that's how you pronounce it in Polish. I'm from Poland, but you can always call me Bartek. It's just simpler. Currently, I am a principal software engineer at Red Hat responsible for some observability things for Red Hat open OpenShift and beyond. And to be honest, my career was kind of always around observability, more or less. You know, the beginning, it was at Intel. I was really doing some noise neighbor. I was scheduling for OpenStack,
3:19 Mesos, Kubernetes, whatever was, you know, the most popular back then. Then a series at and closed on start up where we actually were using Prometheus a lot. This is where I started to use Prometheus. We actually use it that much that we wanted to scale it out and we couldn't, so we created Thanos on the on the site, let's say, which growed grown a little bit. And that's how I started to be kind of Thanos culture and and and maintainer as well and getting into this community. And right now, we are we are continuing this journey.
3:58 And, also, there is new special interest group in the CNCF space, which is called observability, which I'm tech leading. So please join us there as well if you want to contribute more to those projects around observability. So that's me. Yeah. And I live in London if we have time for and and kind of opportunity for some meetups at some point. Hopefully, we can continue doing that. Yeah. It sounds like you keep yourself busy. There's a lot of things that you're involved with there. You know, lead tech lead for Kubernetes or CNCF's sake. We've contributed to the multiple
4:39 projects. Like, how do you find the time for all of this? Scheduling, you know. No no time to waste. Well, I guess it's 2020 and we can't leave the house so that must be Exactly. Yeah. That would have contributed to that. Yeah. Okay. So I believe you've got a few slides. You're gonna run us through a quick introduction to Thanos. So while you just get your screen ready, we have a few comments. So Vikas says he is excited. Vanessa, hello. Hi there. Namaste. Oh, you got a hello as well. Hi there. Hello. And there we go. That's all of our
4:50 What is Thanos?
5:17 introductions. So your screen is now visible. So what is Thanos? Cool. So don't be mad if I will be kind of going through slides maybe quicker. I don't want to spend too much time on, like, details. I remember you also David, you had, like, previous talk with Cortex, which is, you know which which really described the problem with Prometheus and kind of similar goals as Thanos have. So I won't go very into details. But at the end, all starts with, you know, your application. You are running applications somewhere, some infrastructure, and you want to, you know, have some
5:24 What is Thanos? (Problem it Solves)
5:59 information about that. You want to leverage and and take some data and know, you know, how is it working? Is it actually working with everyone? How fast? You know, what it's doing, resources, and also some some things trying to predict future. You know? How much you know, is it going to maybe be saturated soon? Maybe don't know if that doesn't have enough disk and stuff like that. How it performs long term versus maybe how it perform at some point before. And all of this is much more, you know, complex to gather and and learn once
6:34 you are in the communities. Right? You you essentially have all those layers that you need to monitor as well on the way. Lots of sidecars, lots of proxies, and and so on. So this is why observability and monitoring is is crucial. It should be kind of your your default, and it goes even even further to saying, you know, there is no application, no product if you have no monitoring because it's, like, not running the product at all because you you don't know if it's running, right, ever. So and and we can, you know, kind of
7:06 it was mentioned a couple of times in the SRE book as well. Like, it is kind of foundation of running an infrastructure and monitoring observability to detect detect stuff, detect issues, to automate stuff, really. Right? And going further, we we know, you know, in observability, we have different signals, we call it. And this is kind of the, let's say, the formal kind of description of this. We have traces, logs, and metrics. And but that's not the end, by the way. We have a couple new signals coming on, continuous profiling, and all of these are kind of part of
7:19 Observability Concepts and Signals
7:44 the observability that we we are responsible for in in the CNCFSY coxerality, for example. And we have lots of different projects supporting those. And one use case of those observability again is, like, troubleshooting journey. Right? You've probably seen that graph a lot. You first getting an alert for your app being broken or something is going on. You go to dashboarding. Maybe you query your metrics first. Then you go to log aggregation if you don't know yet what's wrong and then to distribute the tracing if that fails. Right? And, hopefully, you have some solution at the end.
8:24 And we will focus kind of today the first part. So alerting metrics, aggregation, and, like, kind of numeric values, numeric informations that you can you can gather over time using primitive Thanos systems. Right? And what is Thanos? A Thanos is essentially a horizontally scalable metric monitoring system which solves couple of limitations that that that single Prometheus have. And, really, as as Ganesh mentioned in a in a Cortex session here, it's really exactly the same. It's it's really around running multiple Prometheus servers. Right? Because it is just designed as a single, you know, binary that is very, very powerful, Prometheus,
8:41 Thanos: A Distributed Prometheus
9:13 I mean. But it at some point, you need to scale out. Right? You need to run multiple clusters maybe. You'll need to run maybe some higher availability of the prompt use. Maybe you need to functionally shard it. And and so and maybe you want to have to tell the other system that just pushes and backfills metric from from other sources. Right? And all of this was was the reason why Thanos was created as well to solve all those issues. And on top of that, Prometheus was not really designed for long term metrics kind of storage and and and querying and
9:52 using. That's why we we kind of designed something like downsampling, which allows much, much faster using of this long like, one year or two years worth of data, for example, and overall kind of cheap and a nice experience of using those. So very quickly, you can think about Thanos as primitive but distributed. Right? So we pick all of those components and just scale it out and, you know, horizontal scale them. So just from high level view, primitive is nothing more than just single box with just, you know, some rule and alerting engine, query engine, compaction,
10:13 Thanos Architecture and Components
10:37 scrape engine, storage, much more probably, but, you know, those are the core things. And what we did with Thanos, we we split really and and, usually, we we're really using the same code sometimes. And we split that into separate services, separate component that you can configure and flexibly flexibly deploy in different kind of model. So you can run them in different clusters, different, you know, kind of spots, and and horizontal scale them. So, you know, the query engine becomes Thanos querier. You can you can put it anywhere. You can find out the queries and and have global view.
11:17 The scrape engine becomes really just Prometheus. You can use Prometheus as a scrape engine and very so it's it's a very lightweight in a very lightweight way as well because usually, Prometheus is buffy. But with Thanos, you can just have, like, super short retention of your data on the Prometheus side, keep it very light, and store kind of your data in the object storage of your of your choice. And and allow PromptUs to to be really just collector. Right? Thanos alerting and and rules engine becomes Thanos becomes global compactor because now it has to compact and downsample and do this managing
12:03 maintenance stuff on top of bucket. And we have some object storage that permit to use uploads the data too, which so we need some component that fetches that data from the buckets. So that's that's really it in the in a very high level view. And, hopefully, during the session, we can we can learn more hands on, really, with running those those those components and and playing with them. But at the end, it's it's really a monitoring system that that is flexible enough to to be composed as you want with those components, query start with compact sidecar receiver.
12:43 Receiver is the one that that probably we won't have time to cover, but, essentially, receiver allows you to remote ride the data from Prometheus to centralized Thanos. Maybe we'll have time to cover this. Let's see. But at the end, I want to leverage and and highlight kind of one important kind of flexibility point here. It's all possible because we have, like, consistent API across this those components. So you can think of it that those core components of Prometheus were split into, you know, horizontal scalable services, but they all kind of communicate with with the same consistent gRPC interface,
13:01 The Thanos gRPC Store API
13:27 which really looks like this. It has something for healthiness and metadata info method. It grabs series. It also grabs label names and values, and it scales a lot because now every component speaks with this API. So you can integrate Thanos, for example, querying to query from totally, you know, separate systems like OpenTeesDB or or some other remote try remote read component or totally different database rating as long as it speaks this API, which is which tries to be very efficient by by by doing streaming and and compression and so on. So, really, that's that's most of it.
14:13 Thanos UI and Community/Mentoring
14:16 Obviously, Thanos has its own UI. Actually, recently, we wrote we rewrote this in React, which is kind of a bit more powerful than the Prometheus UI. But at the end, everyone just uses Grafana dashboards. But, still, this is very useful for ad hoc querying. There is this bucket viewer, so another kind of interesting UI that Thanos adds on top of Prometheus. But, actually, we we we will play with that hopefully today as well. And I want to, last but not the least, I want to mention, you know, other kind of aspect of this Thanos community.
14:51 On top of, you know, contributors, we are super proud to have, like, kind of strong mentorship programs. Thanks to the CNCF as well and other mentors. And so we have really, you know, diverse amazing people at SM and T. Some of them, you know, end up being maintainers. So just for all the old projects out there, open source project, please please mentor because it's actually a good good way to get to give opportunity for for these people to have career good career and actually contribute to Thanos in the same time or, like, your projects. So
15:24 it's pretty pretty pretty amazing. And, yeah, I think we we might be ready for some action. I think yeah. Let's let's talk for that. I think that's it. Yeah. Alright. Awesome. That was a that a great introduction there. I really like that. I wasn't aware of that mentoring program. That's awesome as well that you're, you know, helping people break into open source and development, and then they're even becoming maintainers or contributors. So really cool program. Yep. It's really hard in the beginning to kind of jump over this first barrier of of the time and sacrifice,
15:29 Transition to Live Demo (Katakoda)
16:06 but it's super worth it long term. Like, really, like, open source is really the same. Right? Yeah. For sure. So let me pop up our screen here. So we if anyone is curious, wants to, you know, go into the Thanos website, it's available at Thanos.io. There are guides that we're gonna ignore for today because we're living life slightly on the edge with a demo that you have prepared that you think works. So we're gonna see how we get on with that. And if people want to play with this demo as well, it is available on katakoda.com/bwplotka.
16:45 If I click on start course here, there's actually a few which I'm assuming people can use as well, but we are gonna play with the secret Rawkode demo resources. Do we wanna just talk about why we're going down this road rather than the documentation for people? Like, what what what is different from the approach that we're gonna take today? Yeah. That is the very good reason behind that. And by the way, just maybe let's start. Like, there is officially, like, katakobah.com/Thanos where we are publishing, you know, production ready ready, let's say, tutorials and demos. We
17:23 have kind of large base of those demos which are in progress. It it takes a bit of time to prepare them. And so they will be like, all of those that you you've seen even on VWPodka, they will be published soon as well. So, you know, if you want to, like, nice experience, just just let's wait for them or help us contribute and and and do them. But at the end, we prefer doing those Katakata tutorials versus documentation, like, you know, guides and whatever because it's super interactive. Right? You don't need to have your own machines, really. You can just run
18:00 all your comments, all your Docker, you know, machines, ports, even to to to some virtual Kubernetes that is running, and you have just the interface and just terminal connected to that through the browser of Katakoda. So it's extremely, extremely amazing. And we are abusing this a lot because every time we have conference and we need to create a demo, we used to do it on our laptop, but this is that this doesn't scale. You cannot play with this demo easily and maintain that because your laptop is different than, you know, Mac OS running somewhere or laptop that someone
18:36 else is using or or Windows machine or anything like that. We totally break this barrier by just every time we do demo conference, we do it on Catacoda. And, actually, we're literally streaming Catacoda kind of browser and how we do this and explain that maybe with our own words. But after this, we just describe the what what has to be done in this tutorial, kind of nice, maybe kind of cleaning this up and publishing this as a proper tutorial. So it's not wasted time. That's kind of our goal. So really the same we are doing here
19:12 now today. Yeah. I just echo everything you said there. I think KataCode is such a much more tangible learning experience for people, you know, being able to actually run the commands and get hands on it rather than just maybe reading the code in a a tutorial. I think it provides a really good platform for learning. And you're getting a lot of love for the mentoring program as well. So firstly, you got a hello followed by the Thanos has the best mentoring community of open source and then a wee plus one. So you know, whatever you're
19:42 doing there is working, that people love it. Keep it up. That is awesome to see. Alright. So the difficulty on this is is is YOLO. So I'm excited. I like it when things work, but I like it better when things break. So we'll see how that goes. It's worth to mention that we want to take it a level higher. So we don't want want don't want I mean, we will not use Catacoda to run the comments. We just take and copy them into your VM, which makes it more interesting, I guess. Yeah. I mean, if you want to up
20:00 Generating some fake time series data with thanosbench
20:14 the difficulty a little bit more, I can go spin up an ARM machine on Equinix metal, and we can see how the ARM compatibility is. Yeah. The only problem, do we have connection to that? We can, like, we can spin up UI and and see those things as well? We'll we'll we'll keep it simple. We'll do it on my machine. I I just Oh, yeah. No. If if you wanna update difficulty, we we No. No. No. Let's start with the machine. At the amount of times, I've had people on this this show, and I said, oh,
20:41 should we do it in our machine? And there's just this fear. They're like, no. Not yet. Alright. So step one is we are sorry. On your go. Yeah. Let's how to do how to do it properly? I think we can I can describe essentially what is happening here? So first of all, I think it will be best to have some data, like some metrics data that we can play with and maybe, you know, query against and stuff. So let's generate that. That's that's kind of the the comments here. I think you need to first have this card here variable
20:48 Demo Setup: Generating Sample Data (Thanos Bench)
21:18 in your lap on your laptop or something. You can indeed copy this. Yeah. Yeah. It's yellow, by the way. Yeah. Oh, that's the wrong command. Yeah. Yeah. So it's actually this is not copying. This is executing. So I'm sneakily yeah. You need to just copy it manually this time. Rest should be fine. Okay. So the first command is to pull down and run a container with the Thanos bench image. What's the Thanos bench? Correct. The Thanos bench, you can actually go to maybe GitHub page for that. So it's githubcom./Thanosio. Thanos, actually, with hyphen I o. Thanos hyphen I
22:04 o. Yep. Yep. Click it. So it's our, like, benchmarking slash testing library or, like, kind of client CLI. So you can generate yourself from queues blocks or from to use wall even or things like that. You can generate some load. So we will be using that to generate tons of blocks because, you know, we want to test unlimited retention, so we're gonna generate one year of data. Nice. Okay. Cool. And only five series, so we don't need to wait too much. So we created some profile. Yeah. I recommend you to copy this and just just run on your machine.
22:47 And this is essentially showing how the planning works. So this is, like, command block plan, and block plan will just generate, essentially, a configuration for further Thanos bench command that will actually generate blocks. So you can actually cut this block, yeah, block spec dot YAML, which should be generated in your directory right now. Yeah. And this is, like, just raw YAML that specifies how the block should should look like. And this is kind of the input for the for the further comments. So this is, like, really a row method JSON. How how the just, like, low
23:26 low level format of the totiz d b block, which is exactly the same block that well, this is, like, Prometheus format, but Thanos is exactly the same format as well. Actually, Cortex uses that as well. So you specify some kind of meta JSON metadata of the block. So how what's the retention of the block? And then if you scroll a little bit down, you kind of define series in a very low level stuff, and this is like your load generation stuff that you just provide some names, some, you know, type goucher or counter, and maybe the variance.
24:00 And, like, this is, like, whatever is to kind of whatever is easy to to to to kind of generate somehow realistic data. Right? So we're gonna generate couple of blocks with five series. So there is no need to kind of scroll it more. Yeah. It should be on your data. So now if you copy another command, this will generate the same plan and just pipe it. So pipe actually the same JSON into the block again. So another command of Thanos bench and output this in directory, which will be called, yeah, from e one replica zero because
24:40 I I should mention, I I it would be nice to kind of create to showcase all or show showcase kind of the benefits of Thanos. So we need more than one Prometheus. Right? So the idea is to create free Prometheus and simulate kind of two clusters. One cluster e one and the one e US One. And the first one, e one, will have two replicas of Prometheus. So we can imagine they kind of scrape the same applications from your cluster, and then U U S One, it will be just one replica scraping, you know, U
25:16 S One cluster. Let's let's imagine. So what we want to do, we want to create data backfill kind of data artificially for three of those. So that's why you need to copy three times, and it will but just generating data into three different directories. Right? Yep. Okay. So you should have, like, three directories, and all of them should have, you know, one year of data, hopefully. Yep. We got from US One, from EU 1, and from EU 1 replicas. Okay. Super cool. K. And next, I think we will try to get configuration for the from users.
26:00 Running three Prometheus servers
26:02 Now you can also copy this file, the whole thing, and just it will produce a configuration file into, yeah, Prometheus e one, replica zero config, and three of those, essentially. The interesting part of this Prometheus configuration, if we look on that, is really what it is doing and what external labels it has. So you can kind of talk for that a bit here. Yeah. Perfect. So this is, like, the basic configuration you need to have. Global, so the global per all scrape configurations, you need interval. We just make it faster, five seconds, so it collects
26:46 the data, you know, faster. And we need external labels. And this is very important point because Prometheus have this as an optional thing. You can just mark Prometheus as identify Prometheus as something, so using external labels. So whenever you query this Prometheus from outside, like federation or remote treat or something, those labels will be attached. But to in Thanos, we use it extensively to either identify permit uses among, you know, know, humongous amount of users you you you're you're running. So it's really recommended to specify, you know, the cluster, maybe replica, maybe zone, and tenant because we kind of you know,
27:35 we have tenancy model based on those labels as well. And what is Prometheus what this Prometheus is doing is essentially scraping itself because it's running as 93, so it's scraping itself. So, yeah, that's that's the whole thing that Prometheus is doing right now. Okay. Awesome. I really like the Thanos bench tool. I can see loads of different applications for that. Yeah. You can actually yeah. It's not Thanos only. Generate only stuff for prem users. Yeah. Because how else you can test down sampling, right, or stuff that you write for long term duration metrics. Right? Cool. It was it's a it's a challenge.
28:12 Like, I used to work for influx data, the the company behind influx d b, and trying to give people datasets where they can walk through these contrived situations for down sampling, alerting, and notifications are normally detection is is actually a really difficult problem and Thanos bench, you know, I wish I had known that existed a while ago. So we've got our data written for three different Prometheus, Promethe I. We've got our configs and now we are going to expose some port configurations for each of these. Well, actually, I like that you've got Promethe c's, Promethe I, and Promethe
28:48 I. You can choose you can choose how you call it. Actually, it it was trimmed. There is prometeous instances as well. You can free there are free ways how you can call multiple prometeouses. Yeah. Okay. Okay. So we have some variable environment variables for ports. You run locally, so please copy the the second one. It's essentially some some variables useful for UIs just to not copy them around always. Yeah. Yeah. My computer was playing. There we go. My old tabs are broken. Let's see if it crashed. There we go. Right. Alright. So we have exposed those previous addresses.
29:43 And now I'm assuming, yeah, we're gonna run the Prometheus. We just show a help if you want, or we can just go through and and run Prometheus. I think it's not Prometheus deep dive where you can do that as well. Up to you. No. You you you choose. You're the one guiding me through this. What makes more sense? What do you what do you want to do? So let's let's copy the docker run with help so we can see what's going on. This is, like, official image version two twenty. If you do it, you should see some help and actually pull it.
30:17 It will probably take some time. And this is how you run permit uses. There are lots of configuration flags, and tunnels follow the same path. You mostly have flags to configure stuff. So okay. Let's not dive into this too much. We can dive on on the maybe first from queues that we run. So let's run from queues first and talk about the flags that we actually use. So we have three of those. Right? Again, you can actually run three of those, and then we can talk about flags if you want. Yeah. Let's get them I'll get them running,
30:53 and then we'll Mhmm. One, two, Three. Alright. So let's quickly talk about this configuration. Right? So we have for this one, this is, like, u s one promotuous because it gets the configuration from from US One replica. We need to share those file configuration file with from the local machine to Docker, the same with the data directory that we generate the data to, again, from The US replica zero. We name it this okay. But the flags are the most interesting. So this is really standard kind of command, let's say, and how you deploy it. The tricky part is that the retention time,
31:50 I increase that to, like, thousand days because we have one year of data. And by default, prem queue starts with two weeks. So if I when I started this by default, I just seen two weeks because it just deleted, you know, immediately. And there is this tricky one, max blocks duration, mean block duration. This is essentially disabling local compaction. And what it means is that Prometheus is compacting those blocks periodically to optimize for Prometheus use. However, because I arranged that and and generated those blocks as as I wanted with whatever yellow durations, it will just start compacting for no good
32:31 reason, and we just don't want this. But, ideally, you can if you have prompt queues with such long duration like one year, you want to keep this option like, remove this those flags because you want compaction to be enabled and even with Thanos. So this is only for demo purposes. And really nothing else like those enable life cycle admin APIs. I think this is I don't think this is really needed. It's it's really to get nice APIs like delete series and snapshot for Prometheus. So this is it. This is just running Prometheus with longer retention rate. Yeah.
33:12 Okay. We can actually play with it. Just just just go you don't need to click this because it's hard, you know, to open in your machine, but just go to local host and nine zero nine one, which is, like, the first sprint use. We can see if we can access our data, and we don't. This is Yeah. There's no ports exposed. I think because this is the we know what's happening because this is Mac. Right? Mac OS. But there's there's no dash p on the docker. That dash dash net equals host, but probably on Mac OS that doesn't work.
33:20 Accessing Individual Prometheus Instances (UI)
33:55 Ah. That's the problem. Okay. That's fine. We we Oh, we we sorry. It's okay. We can do docker for Mac. I think it tries to resolve that for us. That would be nice. There is a thing. Yeah. Let's either we can do this or we can modify with minus piece everywhere. Let's Yeah. I run it with a minus piece, so that might be quicker. I can't remember what the Docker for Mac name is to resolve to the let's see. Or Mac domain Mac. About host.docker.internal? I'm not familiar with that flag. That's no. No. Sorry. That you have, like,
34:49 docker.forward for Mac. It's like host.docker.internal, I can read, but, you know, I'm not a Mac user. Neither am I. I'm a Justin Painter. Have Most. Sorry. You I I I have that in the chat as well. Oh. Let's try this out. A host. Sorry. Yeah. Alright. Okay. Host docker internal 9091. Okay. So let's pull up code. I'm just gonna do this. And this is the 01, so we'll do 90919091. Copy. 909. Was it just 909190929093? Yeah. Yeah. Yeah. Replica one. Replica one. What about you've got 1111. And then I'll just copy the US One so that I only need to change the
35:58 network. Let's try gateway one more time. That's the last thing I found on Internet. Gateway to docker.internal on the web. And then because this will kind of help us a lot. No. Okay. You can arm. Okay. Let's put minus piece. Well, this it can't be magic. Right? Docker must write something to the host's file. Nope. Docker binned is the Docker internal. Yeah. Kubernetes dot Docker. But this is weird. Yeah. Maybe it's by cube. Yeah. There's nothing else. Chat, please help us. That doesn't work like this. I should know what that is. Maybe it's in the DocuSign. We'll we'll just spend a minute
36:53 on this and then I'll rerun the containers for that. Yeah. Exactly. So settings. Leave it down from here. Network. Subnet. Docker for Mac DNS name. Yeah. Let's just say goodbye to dot dot go and use Google. Host dot docker internal, we tried. Gateway, we tried. And none of those are working. Alright. Let's just rerun the containers. Did I copy the US One? I did. So we take off net host. P9093. Alright. There is, like, reverse panel thingy that we can start. Actually, that might work. Do you want me to go with the tunnel tunnel approach before I blow these away?
38:08 No. No. No. No. Let's start with minus p right now. Alright. Okay. So let's paste all that in. Because the tunnel has before all parts. Okay. Alright. So we should now have those all exposed on our local machine. Alright. So local host name o name one. There we go. Cool. Alright. Let's check any of the metric. Insert metric cursor cursor, it will there is, like, a yeah. Or I can start typing. Yeah. So we have, like, five series, as I mentioned. And then if you go to graph, tap here, and just go one year instead
38:48 of one hour because why not? Usually, this is impossible, but we generated stuff and not many series, so it's totally fine. So we can we have five series, yeah, for one year, and this is what we can do in for now. So let's try to kind of solve the problem of kind of maintaining those premises on scale. So we want to have a single query endpoint that will span over those permit users and get all those c five series, but from each of those. And in the same time, we allow us to de duplicate this replicated one because as you remember, 9091
39:31 and 9092 are replicas. So if if you go to 9092, it will have exactly the same data. It will look exactly the same. So you can actually, yeah, just do this. Yeah. It will be exactly the same if you could kind of map because this simulates the situation where those two Prometh users are scraping scraping the same applications. Right? Cool. Let's go get that. To our tutorial. Alright. So now Yep. Now kind of this is the time when we when we feel this kind of horizontal scalability and adding different services around microservice, really, architecture. So we start with sidecar to
40:02 Introduction to Thanos Sidecar
40:17 expose our, you know, powerful gRPC store API, as I mentioned. So let's add the sidecar to each of those from users to expose this gRPC endpoint really. And, actually, before that, maybe let's go through this help, dash dash help, and we can spend a little bit more time here because Thanos was always, from the very beginning, just single binary. So we have all components, all microservices here. So you can see you have, you know, sidecar, store gateway, like store, courier, ruler, compactor, kind of many, many tools. Where extends extensively adding more tools that you can
40:29 Demo: Running Thanos Sidecars
40:57 kind of maintain in your data in easy way. And then at the end, we kind of added kind of new components. If you scroll down, receiver and query front end for caching. It's not even properly described even yet. Alright. But this is how you run all comments, all service by just describing a sub comment. Right? Okay. Nice. So single binary, but each of those can be run-in isolation. So Correct. Okay. So right now, we will deploy sidecars to each of those. And you need to probably copy and make sure we add m minus p. Yep. So
41:41 dash p and this is Well, we we want gRPC we want gRPC endpoint, to be honest. So just just one port is fine. Nine one nine one nine. So correct. So we don't need the HTTP port at all? I I I mean, exposed. Like, so what what's what's the difference between the gRPC? What are the responsibilities for those two ports? Yep. So first, HTTP is responsible for any UIs you have because, unfortunately, our browser doesn't speak gRPC yet. However, it is useful also for debugging endpoints. Right? So, like, metrics, traces, debug, whatever. The gRPC endpoint is for our store APIs
42:29 And and, actually, not only store APIs. We have other gRPC APIs that maybe other components can can expose. And because we want to kind of communicate for this gRPC, we need kind of two ports. And Let me just grab this last one. Okay. I can see one kind of issue we need to solve. This sidebar has to speak through the HTP with the Prompteus sidecar with the Prompteus. You know? So you can see we specify Prompteus URL. So we need to find a way to actually get that routable to Prompteus. Yeah. Okay. So we can do an alias.
43:01 Debugging: Sidecar Connectivity Issues
43:16 It's actually okay. You need to do this, above, I guess. Not be with them because slide power is a sub command already. Yeah? Yeah. Here. Alright. So what we need this one has to speak to this container here. Correct. And we'll call it just prom, and then that just means we can use prom here. No. No. No. Not here. This is exposing. This is exposed, so let's expose properly. And then from this yeah. What? Super nice. Alright. And then we need this container ID for this one. Prom. I like to talk to myself when I
44:01 type. It keeps me right. That's good. That's good. And finally, we'll use this one. Oh. Well, that was that one. Okay. Let me make sure it didn't break the last one. Yeah. I skipped one. Okay. Cool. Mhmm. Mhmm. Prom. But yeah. And we need to replace prom here. Alright. So done, done, and done. Let's see what happens. Let's see. Double check. Yes. Looks looks fine. It's not alias. Let me link, isn't it? It's link, not alias. Link. Link. Take two. There we go. I'm assuming if they did not exit, that's a good sign. No. I mean, they will exit in the
45:09 background. So we will never know until you, yeah, talk about us. Do you want me to put a logs on one of them, or are we are we No. I mean confident? No. I mean, it should be I mean, we will see on the further step. Right? Then we will start debugging. So the last point is to have our query engine to be put in a separate microservice. So this is kind of a querier, so a query subcomment. Yeah. And with the exports. And so sidecar configuration is pretty pretty basic. I think we didn't touch on that yet, but
45:40 Demo: Running Thanos Querier
45:44 it's really it is actually capable of reloading configuration from use of the of from use configuration and evaluating the environment variables as well into this configuration. That's super useful in Kubernetes. And then and then you can even compress using gzip your configuration of from to use. Sometimes it's really, important, especially if you want to use Prometheus operator. And this, you know, uncompress evaluate environment variables and poke Prometheus to reload configuration. So this kind of nice, small, tiny feature of sidecar as well. And for courier, we need to expose to a p to two ports. Yep. And
46:28 we are exposed properly, and we need to leak to to sidecars. Yes. Correct. And So this got linked to the sidecars, not to the Promethe I. Right? Sidecars. Correct. To the gRPC of sidecars. Yeah. So this is prompt zero zero. We need prom zero one and then prom u s. Prom Is it is it right I think it's not right place you are putting it, but that's fine. I mean, you need to probably fix it. Is it it should be in the link. No? Ah, you are correct. So that should be oh, undo. Okay. Alright. So prom zero
47:07 Querying with Thanos Querier & Deduplication
47:21 it's prom e U 1 0. Yep. That's the first one. So we call this prom e u one, and that becomes Prom E U 1 here. Yeah. Correct. So k. The the thing the thing that you you can specify, the dash dash store, is the, you know, endpoint of story API that you should target in order to fetch metrics from for this query evaluation query engine. Right? So you can also put, like, in more complex environment, you can put DNS endpoints so you can actually resolve those automatically. You don't need to put it manually. So that's a choice as well.
48:03 Okay. Let me run this and see Yeah. If it yells at me. It looks okay. I think that's okay. Alright. Let's it it so Queryr actually exposes the UI, so we can look on that port. Nine zero nine zero. Yep. Okay. And stores, if we do click stores page on the on the head. Yep. We can see that we have free store APIs and only one kind of works. And we can see that by I mean, I assume so because we have announced labels set only on one. And this is our those those are actually
48:44 external labels that we and they are correct. But for some reason, we don't have from other stuff. But okay. Let's query. Let's see how query behaves, what we can what we see there. Yeah. Let's fetch this metric again. Alright. Looks like we can what what is happening is that probably. Yes. So I mean, I I guess I to the Prometheus as well as the sidecars? We already no. It's a sidecar that proxies to Prometheus. So this link doesn't work for those two sidecars, and for one sidecar, it works. And it doesn't work for yeah. I didn't get that which one work, which
49:37 not. Yeah. So I would mention Prom 9093 anywhere here. That was only in the sidecar. That that's okay because sidecar is is what matters here. So sidecar somehow doesn't work in this configuration. So let's see. We have okay. Let's attack US. This from US 10 Port, this looks fine. Actually, we can also look on logs if you want of the sidecar that might tell us something. Maybe this idea is wrong. Oh, yeah. It's wrong because you took it as yeah. That's fine. It's actually yeah. We don't need actually from curious logs. It's a sidecar that is broken. Okay. So what is happening
50:01 Debugging: Querier Connectivity Issues
50:41 probably is that you took this in order, but, actually, the order was switched. So the container you pointed to is wrong, so it exposed in a different port. So what is happening is that sidecar with for for the ending with free points to one and and so on. So you need to replace Okay. Well, you added names to these containers, so let's just use them instead of the IDs. So PromUSDash1. Okay. Let's just not use the IDs at all. PromU11. And this one, prom u one zero. And just so the names are consistent and I don't trip myself up,
51:29 e u one one say car. Let's call this prom. You can just copy maybe. Yeah. I should have copied. Okay. So PROM EU 1 and PROM EU 0. Fine. Yep. Perfect. PROM EU PROM EU PROM US. And this one here, from US One Zero. Yeah. And there are hyphens as well you need to fix on the from e u prom hyphen e u hyphen zero. But down below, you have without hyphens. So Perfect. Yeah. Should be fine, of course. Yeah. See, every time something goes wrong in this show, it's always my fault. Always me. I should learn to be more
52:40 careful. Alright. So let's Mac OS fault this time. Yeah. We'll blame Mac OS. Okay. So let's delete So we'll delete the query here. We're deleting all the sidecars. Mhmm. And that's it. Right? Yep. 55 is the wrong ID. Oh, yep. Because I meant to type 66. Which means we should only be left with three Prometheus running. Okay. Correct. Now we've got three new sidecars and a new quidier. They seem happy. So the the name in Twitch worked. Run to stores. Oh, we have labels now. So Give them give them a little bit time to to
53:29 Successful Global Query Demonstration
53:32 to so just refresh maybe. Maybe it should be So I broke in the US One. Okay. Let's focus on this US One then. Oh, that's not got sidecar in it. Okay. So I can just delete the query and run that again. Yeah. I will fix my code. Don't you worry. Alright. It's gonna work. Ta da. Yes. Super nice. Okay. So so this page shows you, as I mentioned, you know, what's the APIs we have in the system. We have only sidecars. You can see that on the header. It's a sidecar. You exposed every replica. Sorry. What are the labels announced from each
54:25 of those premises so you can either identify kind of your state store API. You have meantime and max time. Those are the durations that you durations of what time range you have of the data in those in those print users, kind of, and then, yeah, like, health kind of status. So if you go to graph and try to grab those things, we have right now two responses. Without going to graph yet, we let's let's talk about this. So we have one series, but, you know, just to continuous up metrics, but just from EU and US cluster. But as you remember, like, we
55:13 created three Prompt users. However, I mentioned that replica sorry. The the U U One cluster from users are in this mode. So they have they are scraping the same data. So we are a courier and Thanos is duplicating this in runtime. And we can see that if you click the duplication the duplication button. So turn the duplication off and then query again, execute. So now we can see that those were masked, and they were no. Like, the the rep the replicas are masked, thanks of this replica label. So because one was stated as replica zero
55:56 in external labels of Prometheus, one replica one, This was reduced essentially and the duplicated. So this is how you define h eight mode. Right? And if you go to the Thanos querier command that you run, so the Thanos querier docker, maybe if you just go to that Docker run thing, you can see that we specify query replica label equals replica. It will query sub command addresses, and then this is how you tell what label you should kind of choose to duplicate from by default. And, also, you can specify many of those. Right? It doesn't need to be just one
56:36 label. Nice. So this is a global view for querying. We are planning to add even even more kind of resources you can federate because this is kind of federation. Let's check if we can actually retrieve one year of data as well so we can graph them. Cool. So we have still one year, and the mode is is super nicely kind of transparently duplicated so we don't need to worry about that from the user experience. So it's pretty accurate. That's that's what I'm saying. And it also fills the gaps, potential gaps that PromptUs will have. So
57:21 if you would kill PromptUs, one of those, it will user will not spot it. Right? So you can totally do rolling restarts of your Prometheus. You can upgrade them. You can change configuration, whatever, and this will not disrupt your query and availability kind of. Right? Very cool. Cool. And that was the easy part. Was the easy part. Because this is just global view, and you can run Thanos in this mode. You can just add sidecars, have a global for the rated endpoint, and and keep it like this for your system or for your part of your system.
57:40 Long-Term Storage with Object Storage
57:58 However, we can go farther and talk about, you know, cheap storage and long duration of those metrics as well in a cheap spay in in a cheap, yeah, space. Cool. So are we going back to the tutorial? Yes. And continue. So last page, second page is around long retention. So now let's spin up some object storage. Thanos supports, you know, of any any kind of object storage you want, like, very, very many clients. But Minio is one of the s free endpoint kind of API supported object search. You can run locally on your disk. So let's run it.
58:15 Demo Setup: Running Minio (S3 Compatible Storage)
58:42 Cool. Cool. Cool. You can create Thanos bucket, or you should, actually. And we can actually check if Minio is running. Yeah. We can don't click that. You need to probably go to local host 9000. Yeah. And the access is Rawkode and Rawkode something. Rawkode loves OBS. There we go. Correct. So, yeah, we should have a bucket ready, but it's empty, right, because we just created buckets. So let's go and check how we can back up our data from Prometheus that Prometheus collected into this object storage. What we need to do, we need to reconfigure our sidecars.
59:19 Configuring Sidecars for Object Storage Upload
59:29 And first, we need to create this configuration file. It doesn't need to be filed. It can be passed through the flag as well, but let's configure a file. This will be easier. You specify, you know, the type as free, maybe GCS, maybe Azure, whatever, and some configuration dependent on the on the object storage. So Okay. So Menio has a name of Menio. So that means I'm I probably wanna change that configuration file. Or Yes. Yes. Yes. And then we need to make sure we have Menio everywhere. In actually, there will be only couple of components that will access object storage,
1:00:05 but then we'll refer this as media. Perfect. Yeah. Okay. Cool. So we wanna stop three saved cars, and then we're gonna Reconfigure. So run them again, but with a couple of flags. And shipper I mean, how we gonna do? So maybe let's Just because you have them flags? Exactly. Actually, there will be one more, but let's copy them first. Okay. So So this this is the those two flags are configuring or well, passing object storage configuration to sidecar. So now it knows there is object storage, and it will upload every block that will appear on the
1:00:48 Prometheus local directory into the object storage. And then you need the no. No. No. Sorry. No. No. No. The actually, both of them. So there are two minus did you copy all of them? Alright. Alright. Because right now, we need to without the name, but that's fine. Yep. Yes. Correct. Perfect. So what is happening here, we added configuration to Menio, and, also, it has to access the directory of of Prometheus that is is accounted for. Right? Because it needs to know whenever Prometheus is actually kind of dumped any block or not to have it uploaded to to the object search.
1:01:35 Okay. Let me just check the US One. US replica tier. Okay. That's fine. So I just need to add one more link, which is menu. Yes. Menu. Easy. Mhmm. I shouldn't say that before it works, should I? Okay. Alright. So we wanna stop the three say cars. See in fact, let's just do docker container l s. GREP say car. 30817. 3 0 C C. Alright. Good. And run. Yeah. Oh, wrong And the object for the flag was we need to be after sidecar subcomment. Yeah. Those two flags are the Thanos flags, not Docker flags. Yep.
1:02:56 Alright. Let's try that again. Okay. So it's maybe working. Correct. They've not crashed. So how do we confirm? Can we just look in the bucket? Yeah. Maybe. Just open the Thanos. Click on Thanos. I think you need to click on Thanos bucket, sorry, to see so if you go to menu browser again, just click on Thanos. Yeah. Okay. Something is not exactly working. So let's check sidecar logs. Thanos ship No. So ship projection is just side effect. I think the problem is read open dot data. No such file or directory. And this is because it's not data. It's
1:04:09 Prometheus. Right? It's we this is because you copied and kind of manually added the flag, but that's fine. Let's this is not the problem. The problem is that the sidecar has wrong direct so we shared the directory with with host directory with from e one replica zero directory into Prometheus directory, but let's do it slash data. Sure. I put because I altered how yeah. And looks like the default is slash data, and we need to do that for all sidecars. Yeah. Because if not, we need to manually change the flag into prompt queues to point it
1:04:53 to prompt queues, but this should work. Yeah. Yeah. I hope so. It's Uh-huh. 306. B 9 9 9. Cool. No. It looks like we should see that in the browser now. Yep. There we go. So they're using menu as an s three compatible object store, and Thanos is writing data to it. Right? Yeah. Normally, it's recommended to use s three and GCS and Azure whatever it's much it's actually quite cheap, and it's reliable. So they do backup. They do you know, they scale it for you so you don't need to maintain your own hard hardware ready.
1:06:03 It's very cool. What are the I guess there's a performance trade off there for storing my my data here. Right? Like on the query side, that must I guess it's is it slower? Yeah. You need to kind of maybe add more caching and and kind of essentially use more memory and CPU than normally you would use just to have those caching mechanism and and avoiding great limits. And this is the whole complexity is an in another component that we will run right now, which is store gateway. So this is the component, the microservice that kind of understand the object storage and the
1:06:36 Introduction to Thanos Store Gateway
1:06:42 format inside and expose that as a store API. So when with some gRPC API that allows you to query this data, and this is where the complexity of trying to avoid this latency penalty is there. I mean, is yeah. The whole complexity is there, but you are right. Like, Prometheus or, like, Thanos, well, you you want it's not like low latency APIs. Right? You you can wait seconds for your query. Over ten seconds and, like, ten seconds may be painful for experience, so you want to, you know, scale it accordingly. But you can totally have sub seconds or
1:07:22 sub five seconds with a kind of results. I guess the the best practice here would be to layer your data. Like, would you use the SSD desks for all data that's, like, two weeks old? And then and if an older than two weeks, you could store an object store where maybe you can accept that extra latency on the read side. Is that Correct. And you can totally do that. And, actually, this is naturally by default what is happening. Right? Because as we will see, the uploaded data is for, you know, older data that was stored and collected by Prometheus. Well,
1:07:57 we generated, but normally, it will be collected by Prometheus. The the the current data of Prometeos, so the first three hours of the data are stored in memory in totally different format, not in the block, TSDB block. So you only have data in the object storage after three hours. So that's why you need to still and we will see that. You need to still query prompt queues for this fresh data. At least in this deployment model when you have just sidecar uploading the data from prompt queues. Right? So, yeah, let's actually try to fetch use
1:08:30 Demo: Running Thanos Store Gateway
1:08:32 this data. Right? So we have kind of two more components to go and to one to actually fetch the do the querying site against the object storage, which is store gateway. Yeah. And this guy needs to have access to menu, just menu, and expose gRPC port. Okay. So that's Exposing. One yeah. For yeah. Correct. And it needs link menu menu. Mhmm. This fail is fail. It's correct. Yeah. Oh, that's it. Yep. Okay. So this is a component. You can scale that accordingly. You can you can have store gateway per each interval of time. You can have
1:09:33 store gateway per different blocks. You can chart store gateway and tell that, hey. Have some consistent charting and and allow five, let's say, store gateways to choose what blocks you are you are responsible. The the store gateway is for responsible, so you can, you know, horizontally scale like this. But let's actually adjust the querier so you have that on tutorial as well. Now we need to adjust the the querier to know about this new store API that just appeared. So here, we just added a new dash dash store. You can just copy this one, just last flag,
1:10:08 Configuring Querier to include Store Gateway
1:10:13 and then we need to add accordingly the link as well to this one. Yeah. So this is an gRPC endpoint of the store gateway. So let's do that. Okay. So we are doing this. Oh, yeah. I'll just type it. Store. Yeah. This actually is yeah. Yeah. Good. Yeah. Fine. Correct? A store And then we called that. So And we need link as well. Yeah. Yeah. Link. Store gateway from store. Mhmm. And let's, yeah, let's run store gateway. I think we run it already. Yeah. And then Yeah. We do. Query and run again. Yeah. Yeah. Usually, it's just easier in Kubernetes where
1:11:05 you define this infrastructure using code, unfortunately, or, like, some YAML templates, whatever. But there is lots of manual work if you are running Docker. Yeah. We have start. Right? And you can see that announce labels are now three of them because we store in one object storage, like, we call it streams of data. One stream per each unique. Right? And store actually properly even says how old data you have. You have data from year old time until today. Right? Cool. So let's play with it. Can okay. Okay. Let's go to graph graph graphing, and let's graph again one of the
1:11:55 metrics. Cool. And now what is happening is that okay. Well, this is actually expected. Do graph. Yeah. And I will tell you why. Because you are checking you are checking the last the recent data. However, Prometheus is not collecting this artificial series. Right? It's collecting their own. So this is why, you know, this is in the block. Right? Yeah. Okay. I understand that. So if I change that to year, we're gonna get data which is coming from the object store, but that one hour failed because we have now that data now lives in the Prometheus memory.
1:12:32 Querying Data from Multiple Sources
1:12:45 Well and and to be honest, it's not collected anymore because we artificially created the series and we create artificial blocks. Whatever is collected is actually its own Prometheus metrics. It scrapes its itself. That's why we don't have this particular. So there are two reasons why we had don't have recent data. However, let's go further and ensure this comes from the bucket. Right? Because right now, this query or funnels query into all of these sources. However, I can, let's quickly showcase the debug kind of way of of of doing this. So if you go to new UI,
1:13:14 Exploring the Thanos UI & Store Filtering
1:13:21 this is kind of new React UI, which is pretty similar. Unfortunately, you need to type the metric again. Continue. Yep. You need to go to graph again. Maybe I because it's recent data, so you won't see it. Correct. And now you need one a year, and enter doesn't work into to execute. Well, I think we need to I have this fixed. Okay. And now we can see the same data. Now let's do this. Let's go above and scroll up a little bit. There is, like, enable store filtering. This allows us to filter stores. So let's
1:13:58 pick any of those and, like, four is the store gateway, or or we can pick one yeah. Pick this just to show that we'll only route to to first from queues. And as you can see, the series is e one team u. Right? Mhmm. This will route only to one of those store APIs if you want to debug what is coming from where. Right? And if I change this to this one? Then it's only a bucket data. Right? It's fine. Okay. And we have twelve minutes, I guess. Right? Yes. And we could quickly, and it'll be super nice to add
1:14:38 Introduction to Thanos Compactor
1:14:44 a last component, which is kind of interesting here for long term maintenance, compaction, and the duplication offline duplication as well and downsampling, lots of features. So this is a compactor. So kind of, you know, local compaction of raw materials just taken into microservice again. It has to be singleton. It has only HTTP address because there is not no gRPC kind of it doesn't expose any metrics. It just works on the bucket. And and periodically, it's like a batch job, but just periodic that performs those operations. Now what we need to do, we need to probably link to menu.
1:15:05 Demo: Running Thanos Compactor
1:15:24 That's the only thing because it operates on bucket, and we should be fine. Okay. Before we start, let's talk about quickly the the flags because they are not trivial. The compact as a is a subcomment, then we have wait. So make sure it is, like, in a in a long running service manner. It's not like a just quick batch of, hey. Do the operation and done. And then we make it more often. So the interval is pretty quick. It's half a minute for all those compactions and downsampling. We change consistency delay. That's I could talk,
1:16:04 you know, long about this, but, essentially, this is feature for buckets that are not eventually con not not strongly consistent because not all of them are. Right? So to resolve this without strong coordination between components and stuff like that, which makes it super complex in distributed system world, we are introduce other mechanics like delays to wait until the consistency will be up. Right? So we disabled that. So we expect we essentially use the blocks no matter where it when it were uploaded. So we assume that the the the bucket is consistent because it's file system. Right? Almost.
1:16:47 And that's it. Let's try to run this, and it will allow us to so it will try to compact things, but it should actually quickly downsample data, hopefully. And I want to show one more kind of feature of it. Cool. Looks like How we do we see the compact? I I guess this is what Okay. No. No. No. Yes. We can see operation, but there is cool cool stuff. Go to the HTP port of the compactor, which is oh, I don't remember. One One nine zero nine five. Yeah. And, sorry, and slash load load it.
1:17:08 Thanos Bucket Viewer & Downsampling Explained
1:17:34 Perfect. So this is a and kind of and bucket view, but from the perspective of being block aware. Actually, go to new UI. It's much nicer. We are kind of in transition period. So you can click on each of those block. Stores are blocks and eyes of it. You can click of them, and it will show you all the metadata. So its ID, duration, how many series it have, how many chunks, samples, what resolution you have, and labels. So you can see we already within one stream, we have three things. One is the raw data, but what you
1:18:13 clicked is actually a resolution. And this is one hour resolution. I can see by resolution equals those are seconds, something something seconds. Right? And so we can see it already downsampled everything because, yeah, the blocks were super small, so it was quick. So this is how you dev up stuff on your object storage. Was super cool. It's super easy. We want to extend this view as well. We want to make sure you can even download block from this point of view or maybe do some actions like delete series, delete block, or but you it's essentially a bucket
1:18:49 viewer like you have on AWS, but block aware. Right? Awesome. So can we I I I'm trying to understand this. What do the colors represent here? Different resolution. Right? So this is the resolution zero. So raw data, whatever scraping interval was, this is raw resolution, which means that you have all the samples. When you go to the lower resolution, it has it is still accurate. However, it only covers the aggregates. Right? So it, have all the averages and maximum and minimum to make sure your queries are accurate. However, it doesn't cover, all the samples. Like, you don't know exactly
1:18:53 Recap of Thanos Components & Deployment
1:19:39 in what moment this happened. In the resolution five minutes, you only have one sample per five minute. And then in the blue thing, it is one hour resolution. You only have one sample for one hour. And this is super crucial when you are querying long time ranges. So, actually, let's go to the querier, and I I will show you. So okay. Now this is this is good. We are actually only querying the bucket because only in bucket we have downsample data. So that's okay. Let's keep this filter on. And now we are querying only raw data because you have this
1:20:19 only raw data selection. You see? Yep. Try to execute it a couple times first, and then you see the load time. Let's remember this. This is, like, six hundred milliseconds. Right? Mhmm. It's pretty quick because series, there are you know, blocks are small. Now take it to the only raw data and max five minutes downsampling. So this will use five minute resolution and execute couple of times as well. It should be actually quicker. For me, was quicker. But okay. Six times quicker. Right? Because, actually, there are six probably around six samples less times less. Right?
1:21:00 Yep. And if you go to one hour, it will be broken, and this is the same. Somehow, this artificial data doesn't work well with one hour down sampling, and I need to debug this Same moment. Okay. Let's just I'll go back to five. Oh, look. Yeah. Yeah. It would be cool. Let's cut it. Anyway, this is what I wanted to show you, really. And as you can see, if you go to to through roll roll data or downsampling, within one year, it's accurate. You don't need so many samples. It it cannot render anymore even. Right?
1:21:37 So there is no need to get and select from the back end so many samples. That's why it's it's it's it's much efficient and and actually cheaper as well. Right? Okay. So one one more thing I want to mention is that, you know, this is like a global view tunnels deployment. There is also this approach of configuring permit use to push data, remote write data, like Cortex is doing, to to your component called receiver, which is then uploading to back end and connecting to query on all of this. So this is also flexible on that, but this
1:22:18 is kind of the default mode you can run Thanos with. Yeah. Okay. Is that is that everything? Looks like it. Yeah. That? We can well, we can we can if you want, we can play more. We can break up. We can Oh, no. No. No. No. I was I was being entirely sarcastic. We have covered so so much. That is a lot of awesome information and I I love the categorical work that you provided here to walk us through each each of those steps. I think that's gonna be infinitely valuable to people. I just wanna let's try and see if we can
1:22:22 Host Summary & Thank You
1:22:53 surmise this in in a couple of minutes. So what your categorical tutorial has walked us through is one, using Thanos bench to generate artificial time series data at any scale that we want. Like we've seen the flags there, we could have tweaked that and played around with it. That was really cool based on a YAML definition. We then started three different Prometheus, one in a European replica situation, replica zero and replica one for high availability of that Prometheus region. And just another Prometheus, which was our US region data. The tutorial then guides us through deploying a
1:23:29 sidecar or another container in this example, which acts as a proxy to each of those Prometheus followed by a deployment of the Thanos querier, which speaks to the sidecars and fetches the data doing de duplication based on the label configuration that we provide. See that alone, would have been awesome. But then you just you kept it coming. We then looked at how we could use the new UI. We got block storage through many or any other s three compatible store and then you even threw in some awesome downsampling at the end. Like there is so
1:24:04 much valuable information here that I'm gonna have to rewatch this episode a few times I think just to consume it all. That was awesome. Thank you so much for that. If anyone has if anyone has questions, you've got a couple of minutes to get them in before we wrap up and we'll do our best to tackle them. I will make sure the show notes have all these links as well, links to the docs. That was really really cool. Thank you for taking the time to put that together and then not only that but walk me through
1:24:14 Community, Contribution, and CNCF SIG Observability
1:24:34 it and live debugging our container setup as well for a Mac, which was just made it all a little bit more fun. Yeah. Yeah. You're welcome. All of this should be part of the official Katakoda tutorials for Thanos, hopefully. So this work was not wasted, not a waste of time, and only one of timing thing. We it will be useful for people because we went for that. Right? So that's that's pretty long term useful as well. Yeah. I I I I can immediately see the the problems that, you know, Thanos is is bringing to the Prometheus setup. I think,
1:25:11 you know, everyone running Kubernetes is in this situation where they want to improve their observability and their monitoring. And, you know, the the tools that are coming out to help us do that, I think, is just phenomenal. So, you know, I also thank you for your your work on these projects as well. I'm making this almost really consumable to anybody that just wants to improve their setup. So that's great. Yeah. Yeah. And and but, you know, there are still lots of things to do. So please, you are welcome to contribute, give feedback. We are exploring way to get those metrics
1:25:42 data into analytic platforms because since you have long term, you know, duration of the metrics, you want to leverage that for reporting, billing, or really just artificial machine learning, whatever, AI. So we are doing lots of work on that. And we all we do all of those in the seek observability meetings, which are kind of announced on the CNCF page. So please feel free to to join us as well and learn more about those. But yeah. And, yeah, thank you, David, for for having me. I think it's super, super useful for everyone to keep doing this.
1:26:15 Final Wrap-up
1:26:20 I can learn about CNCF project I never seen as well. So That's cool. I know. I I I love that I've been working and focusing so much effort on the show the last few months because there's just so many awesome projects in the CNCF landscape and you know, like getting people like yourself, the experts that are doing this day in and day out and having them not just, you know, do the technical walk throughs, but bringing your knowledge and your experience and sharing that with people, I think is incredibly valuable as well. So Yeah. Totally. And I think the problem is
1:26:48 that they the those projects and and maintainers between I mean, you know, there's really no communication between those projects. Everyone has its own echo chamber, and this is really bad. This is really sad as well. And not that they do that on purpose. It just people are busy. Right? So Yeah. It should be amazing to just to share the same problem and solutions we have, and we solve the same problem. So I'm doing your work. Actually, yeah, pretty sure I'm pretty sure it helps a lot as well on this on this point. Okay. Well, thank you for your time. We've been
1:27:21 a little bit over, but I think it was definitely worth it. Remember, people check out the sec observability of the invites in the calendar. Get contributing there. Keep an eye on the Thanos mentoring program if you're new to open source at Thanos and want to go through that program. And again, thank you for joining me. Have a great day and I will speak to you again soon. Thank you. Have a great as well. Cheers. Bye. Bye bye.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments