About this video
What You'll Learn
- How Parca scrapes pprof endpoints to ingest continuous profiles from applications.
- How to compare two profiles and spot regressions across application versions.
- How flame graphs and icicle graphs reveal hot paths in profiles.
Frederic Branczyk of Polar Signals joins to install Parca on a fresh Kubernetes cluster, walk through the eBPF agent and server, and explore flame graphs, profile comparisons, and pprof scraping for continuous profiling at roughly one percent overhead.
Jump to a chapter
- 0:00 Intro
- 0:48 Introduction
- 1:28 Meet Frederick
- 1:42 Guest Introduction & Background (Frederic Brancic)
- 4:20 What is continuous profiling
- 4:25 What is Parca? & Continuous Profiling Explained
- 6:30 Benefits of Continuous Profiling (Cost Savings, Performance)
- 7:18 Profiling Overhead & Sampling
- 10:25 Running Parca in production
- 10:45 Running Parca in Production
- 12:23 Parca Components (Server & Agent)
- 14:15 Why Parca
- 16:15 Audience Question
- 17:37 Hands-on Demo: Installing Parca on Kubernetes
- 18:45 Parca Agent
- 21:14 Accessing the Parca UI (Port Forward)
- 21:25 Port Forward
- 22:05 Dark Mode
- 23:00 Exploring the Parca UI
- 24:19 Viewing Profile Time Series Data
- 25:47 Filtering Profiles by Labels
- 26:05 Understanding the Flame/Icicle Graph
- 26:25 Compare
- 26:51 Comparing Profiles Feature
- 28:32 Merging Profiles (Holistic View)
- 30:25 Compare Versions
- 31:01 Comparing Different Application Versions
- 32:05 Flame Graph 101
- 34:23 Other Use Cases: Latency & Incident Response
- 37:54 Audience Q&A (Metrics, eBPF, Compare)
- 37:55 Does Parca profile everything
- 39:25 Can Parca compare
- 40:15 Do we only get CPU approved fails
- 40:18 Audience Q&A (Other Profile Types, pprof)
- 41:35 Ingest data into Parca
- 42:50 Embedded entrepreneur
- 42:51 Demo: Scraping from Go pprof Endpoints
- 46:10 Scoping
- 47:35 Runtime
- 48:25 Memory Usage
- 50:30 TLDR
- 50:31 Explanation of Different Go Profile Types
- 55:00 Autocomplete
- 55:20 Parca UI Feature: Query Autocomplete
- 57:30 Microservices
- 1:07:37 Debugging Microservice Profiles (Symbolization Issue)
- 1:10:48 Understanding Symbolization & Debugging Info
- 1:13:18 Kernel vs User Space Profiling (eBPF Benefits)
- 1:16:14 DWARF Debugging Information & Compile Flags
- 1:18:21 Summary of Parca's Value
- 1:19:44 The Future of Parca (Storage & Scaling)
- 1:21:45 Future Integrations (Autoscaling, PGO)
- 1:24:35 Conclusion & Call to Action
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:48 Introduction
0:48 Hello, and welcome back to the Rawkode Academy. This is Rawkode Live and I am your host Rawkode. I've now said my name three times in ten seconds and I hate it. Today, we are taking a look at an open source project to bring continuous profiling to everyone. The project is called Parca and we are joined by the CEO. Hey, I turned that little video off. Alright. I'm just gonna ignore it. I'm gonna pop over here where I'm just put myself off now. I'm joined by the CEO and founder, Polar Signals, Frederic Brancic. Hey, man. How's it going?
1:26 Hey. Pretty good. See, you know, I say at the start, don't worry if things go wrong. It happens all the time. It's usually me. In fact, it's it's always me. I just break stuff every single episode. So thank you for for joining us today. For anyone who's not familiar with yourself, can you give us just a a little bit of a hello and an introduction, and then we'll take it from there? Of course. Yeah. First of all, thank you for having me. So, yeah, I'm I'm Frederick. I founded Polar Signals. I've been part of the, like, cloud native
1:42 Guest Introduction & Background (Frederic Brancic)
1:58 ecosystem for, gosh, six, seven years. I was, part of the, I joined the Prometheus team in twenty sixteen sixteen as a maintainer. Basically, like, I think a a few months after Prometheus had joined as the second CNCF project of all time. So, yeah, I've been around for a while. And as I said, Prometheus was kind of my entry point, but I kind of was in charge for all things that connected Prometheus and Kubernetes. And so to this day, I'm actually still maintainer for the Kubernetes integrations in Prometheus. And over time, I I also I I co created, but
2:45 also eventually became the tech lead of the special interest group for instrumentation in Kubernetes. After, I think, four or five years of service, I recently handed off that that position to make room for some new blood. But, yeah, kind of that that that intersection of Prometheus or more generally observability and Kubernetes has been has been my thing for a very long time. I originally started doing all of this when I joined CoreOS. I don't know if people still remember CoreOS, but after the CoreOS acquisition by Red Hat, I kind of stuck around at Red Hat for
3:30 a while. I became architect for all things observability. And kind of long story short is that I eventually quit Red Hat to found Polar Signals because I felt like there was something about continuous profiling that was worth pursuing. Awesome. Yeah. That's a a really good overview of your history. You're you're you've got one of those usernames and faces that because you were so active in the Prometheus ecosystem and community. That, you know, I see your little avatar on GitHub all the time, and then, yeah, you started showing up on a doing a lot of KubeCon talks and other com
4:06 talks talking about observability instrumentation. I think the Prometheus operator as well. It's great to have you on the show and I'm really looking forward to taking a look at Parca and what you've been working on for the last little while. So I guess, I've said that Parker brings continuous profile but could we dive into that a little bit more and just understand what is Parca? What are its main components? And and what do people get out of deploying Parca to their infrastructure? Yeah. Maybe maybe before we dive into Parca as a project itself, maybe let's talk a
4:25 What is Parca? & Continuous Profiling Explained
4:40 little bit about continuous profiling because I think it's still an emerging methodology. So while hype hyperscalers have been doing this strategy for almost a decade, it was pretty inaccessible to to most other companies, and that that is what we're trying to fix with with the Parca project. But continuous profiling in its essence is basically that we always profile absolutely everything in our infrastructure. And this is useful because because of the same reasons why we generally do profiling of our software. Right? We want to understand where our resources spend, where is my CPU time spend, where is my memory
5:30 being spent, where are my memory allocations happening, or, really, you can think of profiling data as anything anything pass that you can possibly think of where you can you have a stack trace, and you attach a number to it. That's basically what a profile is. Right? Just a collection of many stack traces with a number attached to it. In the case of a CPU profile, it's how often have we seen this particular stack trace, and we translate that to time spent in that function. And so as I said, CPU profiles is something that is really interesting because CPU tends
6:12 to be the most expensive thing in in cloud environments, but memory is certainly a next contender or network network IO or, like, disk IO. All of these are things that contribute to cost. And this is kind of a a good segue to why why do people want to do this? And the the most common answer to that is people want to optimize their cloud builds. And that's that's exactly why companies like Google have been doing this kind of methodology for over a decade. They kind of identified, and I think it's it's obvious once you say it. But once
6:30 Benefits of Continuous Profiling (Cost Savings, Performance)
6:53 you have that data about where your resources are being spent down to the line number, right, like, that's how granular profiling data is, you can actually do something about it. If you don't have that data, you just don't know where to start. And continuous profiling, as I said, is the methodology of just always doing profiling. And there are a couple of things involved in actually making that possible because maybe maybe this is, like, something that's already, you know, a a question that was queued up in your head. When typically, when we talk about talk about or think about profiling, people immediately
7:18 Profiling Overhead & Sampling
7:32 think about overhead of profiling. Right? And they wanna know, like, how much is doing this actually going to cost me. Right? And the the answer to that is essentially we we we've developed a couple of strategies to reduce profiling overhead so that we can always do this. And there are there are kind of two main things that are that are contributors to this. One is the type of profiling that we do, and what we do is called sampling profiling. And that essentially we can think of it as how often are we looking at this the current stack
8:22 traces, let's say, for a CPU profile. And the obviously, the more often we sample those stack traces, the higher the overhead is going to be. But we found that for, like, looking a hundred times per second is still like, offers enough granularity, at low very low overhead, something less than a percent of overhead, of a CPU core. And the reality is, once you have this type of data, almost everyone we find has easy very easy optimization potential of, like, 10 to 30% of their infrastructure. So the trade off is, am I spending 1% overhead to gain 10 to 30% optimization? I think
9:12 that's an easy calculation to make. Right? Yep. Definitely. There there was a lot of information there, so thank you for for getting into that. And you're right, one of the things that I had on the back of my head was like, what am I paying for this for my own infrastructure? And you said the overhead is I don't think you said negligible, but I'm gonna throw that word out there. Just a certain point where it's not that much because of your sampling strategy, which I think is is really interesting because I think the more visibility people can have
9:43 into what's running in their clusters, it can answer two of the hardest questions I think people have in the Kubernetes space. Like, I don't think there's been a conference or a user group that I went to where someone hasn't asked me, how do I set resource constraints on my pod? I'm like, ah, yeah, you have to profile your application and understand what normal looks like. And it's just, it's just quite a scary thing. I don't think a lot of people have profiled their applications before, well, probably ever. And the second question is, how do I
10:16 debug garbage collection and memory leaks and other things? And again, you're back to the profile and I was like, I'm sorry, you're gonna have to learn this stuff eventually. So it's nice to see that Parca can come into this space and do the continuous thing in a way that isn't gonna cause too much overhead and just increase that visibility without it being too daunting, too scary, too to never acting like we're all very careful, I hope, about what we deploy to production. And there's my next question. Can I run Parca in my production infrastructure, and
10:45 Running Parca in Production
10:48 is that something that is encouraged? Yeah. I mean, the you you it's it's really just like any other observability data. You can certainly run it in preproduction environments or something, and we encourage that as well. But in a in a similar line on a similar note as, like, your metrics are just never gonna behave the same way in your local dev environment or even a staging cluster. It's always gonna be different in production, and so you you're gonna always have to measure these things in production. Yeah. Definitely. If you're trying to work at what normal
11:27 looks like in the staging environment, it's never never gonna correlate to what you have in your your real production environment. I couldn't agree more. Yeah. And in in terms of, like, how production ready is this this this, like, open source project, let's say, like, I think it should be at least a little bit taken with a grain of salt because it is a very young project. Right? Like, it's existed for maybe a couple of months. But it's every week. We we have an entire company working on this and improving it. Right? Like, every week, it improves drastically
12:02 from literally every dimension. Like, on overhead, it continues to improve. On the storage, it continues to improve. Yeah. Just on on every really really on every dimension, which kind of, I think, is also a good point where we can talk about what you said earlier, what makes up the Parca project. Right? Because there are actually a couple of moving components that I think are good to understand. So, the Parca project, largely, you can think of it as having two components, the server and the agent or the collector, you could say as well. The the agent, the collector is the thing that
12:23 Parca Components (Server & Agent)
12:44 actually does the profiling part. It's the thing that captures the raw data, and then batches it up essentially and then sends that to the server, which then stores it and where you can also query it through its API and and UI that is all integrated. It's all very inspired. At least the server part is very inspired by Prometheus. It's probably not to a not much of a surprise given my my background. It like, we really focused on having this extremely simple, really great first experience where it's a single statically linked binary. You know, all the container images exist. There
13:27 are, like, manifests available for Kubernetes. Hopefully, you know, your your first interaction with the project is very, very smooth and very simple. Sweet. Well, we're gonna take a look at it in just one second. Before we do that, I'll throw one more question at you. Like, it's it's an easy question, you know, but no stress or anything like that. But I feel like we're in a really great position these days in the cloud native ecosystem, you know, where Kubernetes is almost ubiquitous with people. Maybe that's it's a good thing depending on who you're talking to. But we're at
14:02 a stage where it's ubiquitous. Almost everybody has a metric server. Everybody's got a premium face. Everybody is doing some monitoring. We're seeing a bit of a push with observability and people looking at distributed tracing, etcetera. Do you feel that Parca is one of these components that if you have a production cluster, then Parca should just be installed? Like, is that we looking at a project here that we're saying to people, if you have production infrastructure at this Kubernetes base, yes, you must have this installed, like, next to your Prometheus. Absolutely. The the whole point of or the the
14:15 Why Parca
14:38 the the reason that actually motivated me to found an entire company around this is entirely based on the premise that I believe that this type of data is completely complementary to your existing observability stack. As a matter of fact, kind of I didn't know it back then, but I think the seed for all of this was planted in a, like, an invited keynote that I gave at KubeCon in Barcelona in 2019 where Tom Wilkie and I kind of gave, he's the VP of product at Grafana now. He and I kind of gave an a like, some guesses, let's say, what the
15:23 future holds for observability. And part of that was where where I predicted that I thought that continuous profiling was going to become a major player in the observability world simply because it just like any other observability type of data, it gives you a unique kind of insight into your running application. It's unlike metrics. It's unlike tracing. It's just just like metrics and tracing are complementary to each other, continuous profiling data is also complementary to to those. Sweet. I love that. You make a prediction and then go out on a multiyear journey to prove the prediction correct yourself.
16:10 It's it's as they say. Like, if you wanna predict the future, you gotta build it yourself. Yeah. Perfect. That's awesome. Okay. So we're gonna jump to the screen share in just one second. We do have a question from the audience, so I'll pop that up on the screen and read that out. But Ivan Ash is asking, can Parca also collect metrics and be used for dashboards and stuff, or do we need separate monitoring agent? So it it's not exactly like metrics, but you can think of it as a super high, highly granular CPU metric, for example. Right now, you probably
16:15 Audience Question
16:49 only have, CPU being measured on a per process basis. Maybe maybe if you're talking about Kubernetes, maybe on a per container basis or on a per per pod basis. And in this this case, we would actually be getting data down to the line number and not just to the process, but also with process metadata. And we'll see in a second just how similar the experience actually is to to Prometheus because the entire, like, labeling strategy of of Parca is very intentionally based on what Prometheus does. And, hopefully, it'll all feel natural to to someone who's already used Prometheus before.
17:34 Alright. Let's get straight over to the screen share and get Parca deployed then. Here we go. So this is parca.dev. And we have the website, we've got some basic instructions. We have prepared upfront Civo KCS cluster. This has nothing installed. It wasn't installed out the box by Civo, so we're starting completely from afresh. We'll jump back over here. So, I mean, I could ask what the next step is, but I'm going to assume it is to install Parca. Yes. Yeah. Step one's always the easy one. After that, I get a little bit I'm not as
17:37 Hands-on Demo: Installing Parca on Kubernetes
18:14 confident with my predictions. But so that's just let me zoom in on this actually. Just because some really simple instructions here for Kubernetes. We're not using many cubes. I can skip this. We're deploying Parca to its own namespace. I assume that's the preferred approach. And then we deploy the two components that you mentioned just a few moments ago, the Parca server and a Parca agent. I'm assuming this just runs as a daemon set. This is a deployment. And by magic, the thing just starts working. Yeah. What I am curious about is, like, when I deploy these, has the profiling started
18:45 Parca Agent
18:48 already? Yeah. So that's a that's a great question. The the Parca agent, as we said previously, is the thing that does the profiling, and it actually starts to automatically profile absolutely every container in your Kubernetes cluster. It kind of discovers everything that's there, and it uses eBPF to attach an eBPF program to each of those containers. And that's essentially how the entire construct of the profilers are built. You said the magic words. Like, I can't believe we're nineteen minutes seventy has just popped up for the yeah. It's we're seeing such a crazy adoption of that right now. It
19:30 just seems to be solving a lot of these observability and performance. Not problems, but, you know, questions that people have. It it it's it's funny because when we when we actually founded Polar Signals, we didn't really want to concern ourselves with the collection side of the problem. We felt like the the majority of the value was in storing and querying this data in a useful way. But we pretty quickly realized that we were gonna run into kind of two problems. The first one was the overhead problem, which we kind of already talked about, but eBPF kind of adds an additional benefit
20:10 here because everything runs in kernel. And because we can collect exactly the amount of data that we need and exactly the format that we wanted in, it allows us to do the profiling at actually an order of magnitude less than, quote, unquote, traditional profiling techniques. And there there are a couple of things that maybe we can get into later that kind of make this make up this strategy. But, yeah, eBPF has has definitely been a a really amazing fit. And just the nature of how eBPF works makes it really simple for us to attach these profilers to every container on a host
20:54 because eBPF is literally able to do anything or inspect anything on the host. Right? Yeah. We got that question from Moz, Ray, and just as we started talking about eBPF. But Moz was curious if it used eBPF, and I hope that we've answered that for you there, Moz. Yeah. Sweet. So we've deployed the agent. We've deployed the the daemon set, the no. The server and the daemon set, which is the agent. Like, what is the the way that we interact with Parca? Should I go back to the docs or should I just start guessing? Like
21:25 Port Forward
21:28 I mean, you can you can you can hit the, like, Parca in Kubernetes tutorial five five minutes button, and you'll get the entire, basically, manuscript of what we're what we're doing here. But you can check out here, you see the, like, a port forward. We don't make an opinion of we we don't wanna take an opinion of how people are gonna expose this. You know, you can put an ingress here or Yeah. I don't know. Whatever your preferred way is. Right? So here, we're just telling people to do port forward. Alright. Well, we have a three node clusters.
22:05 Dark Mode
22:05 We have three agents. We've got the server. We could do a port forward to the service on seventy seventy. And let's see. Hey. We have Parca. It worked. The fir first first feature that's always had the most important to people is, on the, top right, you can see the little toggle for dark mode. It was actually one of the first feature requests that we got. So, yeah, definitely important. But no. It's always nice when you go to a blog, though. Right? And you see that little sun and moon in the top corner, and you're like, oh, good. I can
22:51 set my own preference. Like, I think as developers, we get a lot of pleasure from the simple things like that. So it's good to see. Yeah. So when when you when you start and this is actually something that we're actively trying to improve that you don't need to know how to use this just yet, that you immediately see some data. But for now, this is how how it works, and you start by selecting the type of profile that you wanna see. And right now, the the Parca agent only supports CPU profiling. We're we're going to be adding more of
23:00 Exploring the Parca UI
23:26 this, but the the entire Parca project is actually based around an open standard called pprof. This was a format that, Google had originally created for, representing profiling data, essentially. And anything that is p professor formatted can be written to the Parca storage. And so in this case, we're only sending CPU samples, but there are integrations. For example, the Go runtime itself has support for various types of profiling, like memory profiling, Go routine profiling, and you can also ingest those profiles into Parca. In this case, we are only running the Parca agent, though. Okay. So I just select on CPU samples
24:17 here? Yep. And then You could already hit hit search at this point, and you'll start to see some data. Yeah. Yeah. Like a graph. Nice. So now we can already see some some similarities to Prometheus here. If you if you hover over a dot, for example, you'll see that there there are, like, labels attached to this series of profiles. So here we have the Parca container itself and the idea of it, its namespace, the node that's running on, and the pod. And you can hover over each of them, and whenever you're to a closest to a
24:19 Viewing Profile Time Series Data
24:59 point, it'll tell you what what container we're looking at. You can also while you hover over something, maybe it's maybe a little small for the audience to see, but it says hold shift and click a label to add to query. So if you hold shift while you're hovering, you can now click one of these, let's say, the names Parca namespace, for example. Then it filters everything down, and you can see that our query was updated in the query bar. And now we're only seeing processes from the Parca namespace. Cool. And so so far, we're only seeing
25:43 metrics, basically. Right? But we actually wanna see profiling data. So what we could do now is any of the dots that you were hovering over, once you click them, you'll see and you you need to scroll down, then we'll see Oh, yeah. Clean graph. So this is showing me I always I should have paid more attention to what I was clicking on. I think I clicked on this. So this is the Parca container. So this is actually the Parca server container. Okay. And this is all the server itself. All the functions being called within the application?
26:25 Compare
26:25 Yes. Exactly. At that point in time. So At that point in time. The the the profiles are always taken over a ten second interval. And so within those ten seconds, of which the, profile of of the profile's time stamp, these are the things that were being executed during that, that time. And there's something that I I I've been incredibly excited about ever since we first implemented this. If you scroll up, you can see that there's a compare button next to the search button in the in the bar. Yeah. So what this now does is it it should open
26:51 Comparing Profiles Feature
27:07 maybe oh, yeah. No. There we go. So what we can now do is we can compare two profiles in their points points of time. So for example, on the left hand side, select one of the Parca profiles that, you know, isn't at the peak, Maybe from the green one. Yeah. That that one. And then on the other one, select the one at the peak just after that or that peak. It doesn't really matter. And now you can see exactly what was worse in the in the comparison of these two profiles. In this case, the the the comparison was quite drastic,
27:48 so we're seeing a lot of red. Right? But if it's if it's more nuanced, we can see exactly what was different. And, again, in this case, it's quite drastic, so all of the red is pretty dark, but it's actually shades of red that are telling you what got how much worse. So I I'm curious. Like, I'm, you know, an application developer. I'm shipping multiple times per day. Would this compare feature allow me to compare a standard profile from one version against the profile and the new version I've shipped to see if I've got performance gains or or
28:25 or maybe things have gotten worse since I went ahead? I'm I'm glad you asked. For for for for answering that question, let's take a step back and talk about one more thing that kind of makes continuous profiling so great. And for that, if you just hit profiles at the top, this kind of just resets our our query. So if you select CPU profiles again and then maybe filter just down to the Parca container, if you want, you can hit search and and then click the click the label, or you can do a, label search. Yeah.
28:32 Merging Profiles (Holistic View)
29:12 So now other than, compare and search, what we're seeing in the in the search bar, there's also merge. And what this does is it takes all the points in time that we've collected over the entire time of this process, and it compiles everything into a single report. And this is really, really cool because not only does it tell us where CPU time was spent at a single point in time, but actually it tells us statistically over the entire lifetime of this process where CPU was CPU time was being spent. And so the consequence of that is that once we
29:56 look at that report and we can we can optimize something in that holistic report of our entire process's lifetime, if we can optimize something from that, we will actually reduce CPU time of the entire process process's lifetime. Right? So that's how we'll actually do significant cost savings. K. So should I click the merge button? Yep. Hit it. So now now we we see all of the data of the entire time that the Parca process has been alive, and we can we can explore where that CPU time was being spent as opposed to just a single ten second profiling
30:25 Compare Versions
30:43 time. Cool. I like that. That's very interesting. And now now coming back to the question that you had Well, can I ask you a and I really Your your question was to to to come back to your question, which was, can I essentially compare two versions after I kind of deploy them? Right? Yeah. Yeah. Yeah. And the answer the answer to that is yes. And but not only a single point in time, but you can actually compare the entirety of two versions. So not just this this process at this time and this process at that time, but, no, you can actually
31:01 Comparing Different Application Versions
31:29 merge all of the data of a single version of a of a program and compare those two versions. So finally, we'll be able to answer down to the line number what was different. Why did why is this thing using more CPU now after I deployed it compared to previously? Or why did it get better? Right? All of these things that previously, as engineers, we were kind of guessing, Now we actually have the answers. Alright. Awesome. This is really cool. I think this is absolutely game changer. Now we've got one question from Russell in the chat, which we'll tackle in a minute. But
32:05 Flame Graph 101
32:12 I also wanna tackle something, like, just one one of the really simple things here that not everyone may be familiar with this graph and what this is what this is displaying and what it means and why things are are duplicated. Could you give us the the one zero one for this flame graph and what that means? Yeah. So as you already said, this visualization is called a flame graph. And the way that we read it is that from actually, this is called an icicle graph, which is the upside down version of the flame graph. The flame graph is built from
32:47 bottom up like a flame. Right? But the icicle graph is icicles hanging from the ceiling just like the one that we're seeing here. Today, love Yeah. I, you know, I I created the brand Polar Signals without even knowing this, and then afterwards was, like, like, a happy little accident. But, yeah, coming back to how to read this, essentially, the very top bar, the root describes all CPU time that we're looking at. And then every span that we're looking at underneath it is relative to that total. And so if there is a span in here that we can optimize,
33:31 it means that the width of it will, that will be the effect that it will have on the entirety of our program. So let's say one of the larger, spans that we see in the middle here, if we're able to optimize that, we can actually and you can hover over it. It tells you the percentage as well. If we can optimize that away in this case, it's 90%, but, right, the further we go down, the less it gets because it's kind of cumulative. But if we can optimize this one away, for example, we will be saving 36%
34:08 of our of our entire process. And the more we do this, the more effect we see on on our cloud bill. But there are actually a couple of other use cases that I think are also good to talk about when we when we think about profiling or continuous profiling. Obviously, cost savings is one that we've talked about several times by now, but I think there are kind of two more that a lot of people find very compelling. The the the kind of helping you with latency optimizations because latency tends to be we we tend to be
34:23 Other Use Cases: Latency & Incident Response
34:46 able to if we're this advanced, we may be able to detect where latency problems exist in our in our system with tracing data, but tracing data is actually very high level. Right? It doesn't tell us anything about the lines of code that we need to optimize. Yep. Combining that with profiling data is actually incredibly powerful, and we've worked with, like, ecommerce companies that are optimizing their latency through this type of data. And if you might know, ecommerce companies have this kind of target to have every interaction in their system be less than a hundred milliseconds because that
35:32 drives conversion. Because we humans like things to feel instant, and that means that we're more likely to purchase something on a website. So that's kind of in the in the grand scheme, that's use case number two, and it tends to be actually a much bigger motivator for companies to make more money than just to save money. Right? But go ahead. Would it be fair to say you know, we're talking about cloud native and microservices. You know? Hopefully, the people watching have a little bit of experience with distributed tracing. And like you said, it gives you the latency
36:08 response time from service to service communication. Is it fair to think of profile as tracing at the function level, exposing individually within services what is actually happening on that call stack? Yeah. Yeah. That that is entirely accurate. The the difference is that it's not across services. It's only within a process. But we're we're actually working on a couple of correlation techniques so that we can tell you, for example, all of the CPU time that was spent with a particular trace, for example, so that we can jump from a trace to profiling data. This is not something that works yet, but
36:47 we have a couple of strategies that we're exploring to to make something like that work. Right. But yeah. And then then the the last use case that I was talking about is kind of I I think we sort of touched on it already, but I think it can be generalized as incident response. Right? Because we have this data of what our processes are executing down to the line number, we can actually answer some of the questions that we as engineers have kind of or at least I myself have been asking myself ever since before I I had a tool like
37:26 this, which is, like, why was my process spending CPU time or, like, a had a CPU spike here and not here? And almost always, as we as you said earlier, it's GC, garbage collection. But, yeah, sometimes it's more it's more surprising than that or with memory. Right? Like, why did my process use more memory at this point in time versus this other point in time? Yes. Definitely. Alright. Let's tackle the questions we have in the chat, and then we'll jump back to our demo here. And we got a hello, just done with the Parca office hours from Mathias.
37:55 Does Parca profile everything
38:06 We got okay. Question for Russell. Did you say it profiles everything, I e the API server? And if so, does that work on cloud hosted Kubernetes instances where the control plane is managed by the cloud vendor? So probably for most of the cloud vendors that you're thinking of, because we can't get the Parca agent to to get deployed on the wherever the the API server runs. Basically, as as long as we can get the Parca agent to be on the same host as something, it will be able to profile it. As a matter of fact, we're actually thinking
38:51 about a couple of strategies where we can expand the things that we're profiling to not just be the containers on a on a system, but actually every process on that system. Because I think I think it's kind of obvious once we say it, but there can be other interactions with processes on a host that are not running in containers. Right? Maybe they're system d units, or maybe they're not running in c groups at all. So, yeah, hopefully, that answers the question. Yeah. I think so. We also got another one from Mozz. Compare the CPU time consumed by a specific function
39:25 Can Parca compare
39:31 run on two different containers or different nodes. Yes. Absolutely. That's that's exactly what the compare functionality does. We're also working on being able to filter down the stack traces to not only compare by container or by process, but actually down to the stack trace so that you can say, I only wanna see data about this one function, and then you can compare it exactly the way that it was phrased in the question. Right now, you would compare two whole profiles, and you would need to find the specific function that you're looking for, but we're absolutely
40:09 working on on this specific use case already to support it even better. Awesome. Thank you. And we'll tackle this last one that came in from. Do we only get CPU profiles there? So great great great question. So I think I I mentioned it earlier very briefly, but the entire Parca project is based around the open standard p professor. And any p professor formatted profile can be written to the Parca storage and can be visualized and analyzed in the same way as we're doing with this data. So as a matter of fact, the Parca agent also produces pprof compatible profiles
40:18 Audience Q&A (Other Profile Types, pprof)
40:53 and then sends those to to the Parca storage. And so as long as you can produce profiling data in the pPRA format, you can write it continuously to to the Parca storage. So there there are lots of profilers out there. As I said earlier, the the Go runtime has support for memory profiling, for allocation profiling, for threads being created, for Go routines, for mutex contention, for a whole lot of things. And all of those can be written to to the Parca storage and can be analyzed in the same way. So these are these are profiles that you plan to support in
41:35 Ingest data into Parca
41:37 Parca. They're just not quite there yet? So you there there's there there are kind of two ways to ingest data today into into Parca. The first one is pushing the data, and that is exactly what the Parca agent does. Whenever it has a profile ready, it just sends it to the Parca storage. And the Parca server actually also supports scraping profiles in the same way. You can think of it very similar to what Prometheus does, where Prometheus has its metrics endpoint, and it scrapes the metrics from that endpoint every fifteen seconds. Parca can actually do exactly the same thing
42:16 with p professor formatted profiles. Ah. And so I think, actually, this might be a a great time to have a have a look at that. What you can do is you could go back to the parka.dev website, and we can switch from the Kubernetes deployment to the binary because then we can just have Parca scrape itself because Parca is instrumented with a whole lot of profiling, and then we can have a look at those profiles as well. Yeah. I I think that's quick. I don't wanna make too many assumptions, but I think it's quite normal in the goal ecosystem, right, to
42:51 Demo: Scraping from Go pprof Endpoints
42:56 have p professor embedded in your brain. Right? I I remember when I was working on InfluxDB, we had to debug slash p professor endpoint, which exposed a whole bunch of information. And you're saying that Parca can just consume that right away if you tell it where it is? Exactly. Yep. Okay. Alright. So let's jump back over here. So if I go back to the start and the binary one Yep. Just gonna work on a m one Mac? That is a great question. Let's find out is the answer. I do not think so. I think it is. That that looks like
43:42 it works. Amazing. Nice. Alright. Great. So now we get the basic configuration and just run Yeah. Parca. Do mind if I just quickly take a look at this? Yeah. Go ahead. Alright. So It looks like a Prometheus config. Yeah. It it literally reuses Prometheus code. Alright. So you just define the the script configs. So I'm assuming these targets could be anywhere I know there's a container or a pod, an application that exposes p professor endpoints and Parca just pulls them in. Right? Okay. And assuming I just go to seventy seventy local host and Yep. You still have that open, so let's just
44:30 refresh. Is this my old one? I think maybe. Not seeing what I expected to see. Yeah. That that still looks like our Cboe cluster. Yeah. Yeah. This is too much data. We we that's definitely not But I'm not port forwarding anymore. I'm confused. Am I? No. Oh, I am. So did this Is this maybe a virtual machine of some sort? No. Uh-uh. Well, my port forward was still running, and so I actually would have expected an error message Yeah. I I I did as well. Oh, there we go. Okay. Yeah. Now it works. Yep. So now you can see these are
45:35 all the types of profiles that Parca scrapes from itself. Yeah. So I think, like, memory in use is a in use bytes is a nice one to look at. Search. Let's get let's give it a couple of seconds because it's not writing a whole lot of data. Right? Like, it's only about itself. Yeah. So what what was the do we define a scrape interval in that config? I wasn't Yes. Yeah. Okay. I don't have my history there. So that's just I can tell it to scrape profiles on a more regular cadence that kinda just just start up. Is that worth doing, or
46:10 Scoping
46:16 do you think we just wait ten seconds there? You can just hit search again, and there might already be Yeah. There we go. So now you could do the compare, and we could figure out exactly why this jump happened. Right? Yeah. And we can see Badger created a bunch of skip lists. Yep. New skip list. Well, I'm not sure why that there. This so you're scoping essentially the there this is essentially interacting with the with the flame graph. And once you found something that you're interested in, you can see on the left hand side, for example, there are a couple of
47:03 spans that are too small to read. You may you you can click on one of those, and it'll expand, and then you'll see Ah. All of the rest of of its child nodes. Okay. And, essentially, everything that's, like, faded out is the part that we're kind of zooming in on. Right? And from there on, the root essentially becomes whatever we clicked on. Okay. Does it like, this is a a runtime dot do in it function. Like, can I find out how many times this was called? Like, if I had maybe a recursive function that was going about wild in
47:35 Runtime
47:44 my application, is Parca would I be able to show that in some way, or would it just be as I would see it called lots on the on this on the call graph? Yeah. So this depends a little bit on the on the profiler itself. Typically, profilers have some sort of limit on the stack depth. This tends to be sixty four sixty four frames, something like that. So that's that's how often you would see it repeated, and then at some point, it would stop if it's, like Okay. Endlessly recursive. Yeah. Okay. So I wanna make sure that
48:25 Memory Usage
48:25 I understand this right. So we're using a different profile because it's been scraped over HTTP. We're looking at the memory and just bytes. We can see the graph, which has shown us how much memory was allocated to two different points, which is great. We have these red things, which I'm assuming is telling me that these functions are consuming memory Yeah. On the stack. Mhmm. Alright. Okay. I mean, is there a size? Like, I can see we have a figure here. I know that we're using a 20 meg, but does it show me more specifically which one of these functions is the the
48:57 bad actor here if that was, again, a memory leak or something else? Yeah. So you you always wanna look at the deepest span that doesn't have any children anymore. So we can actually tell here that new skip list is actually the thing that allocated the memory here. So if we optimize new skip list in this case, we're intentionally preallocating memory, so it's not actually a bad thing. But let's say it was. If we optimize new skip list to if we if it was truly possible to just not do this at all, we would actually save
49:36 70% of our memory almost or 67667.81% of our memory. Right? Okay. So yeah. And I think maybe this is a bug, or we need to update the version. But we see the cumulative and the difference number here. We know actually that this, data that we're looking at is bytes, so it would, it should be, formatting this as, I I wanna say this is, like, 87 megabytes or something, but maybe I'm wrong. No. I think that's right based on what we've seen here. 40 to a 20. So, yeah, you're pretty much spot on. So, yeah, it it looks like in the
50:22 in the graph, it's working. Maybe we have some bug in the in the front end about the hover state. Okay. Don't mean to put you on the spot, but would you wanna give us, like, the TLDR for these different profiles and and maybe what they're useful for to people? Yeah. Absolutely. So these are these are very specific to Go. Right? These are profilers that are built directly into the Go runtime. So these are definitely going to differ for different languages, different run times, but I think they're still really useful and interesting to know. So block contention,
50:31 Explanation of Different Go Profile Types
51:02 basically, hopefully, the descriptions of of each of the profiles underneath are descriptive enough. But, essentially, whenever we have a blocking primitive being called, that's when we record that stack trace that we understand where does our where in in what paths of our program is our program being stopped and can can no longer do something. And in the Go case, this is actually particularly interesting because the Go runtime tries to move Go run Go routines around to prevent Go routines from actually or threads from actually blocking. So this is a really useful one to know if for what for some reason,
51:54 go routines keep being scheduled onto different threads all of the time, which would create a performance problem. Go routines being created, definitely a very easy mistake to make. And, certainly, in the Prometheus project, we've we've, like, fallen into that trap many times where we leak Go routines. We just keep creating them. They never end. And a go routine means we create a new stack every time, which ends up costing memory, and then we'll we'll we'll eventually see it in both our go routine profile as well as our memory in use bytes profile that something is using
52:40 a bunch of memory and ends up could end up being the go routines being created. The next one is memory allocations. So we need to differentiate here. The one that we were looking at previously was the heap size. So how much memory is currently in use. Right? Like, how much when we look at top or something, how much of that how much memory are we using currently? And then allocations are how much l memory has been allocated over the entire lifetime of our process. And, again, this is a really useful thing to understand because allocations
53:23 tend to be where we spend a lot of our CPU time. I I it's definitely a generalization, but a lot a lot a lot of the optimizations that we see are simply preallocating some memory instead of allocating them in small bytes, like, one after another. Then memory and use, we kind of looked at already, and I I talked about just now as well. Yep. Mutex contention is definitely an interesting one as well. In in case it's not well known, I think, like, outside of the go go community, maybe c plus plus engineers understand this, but, like, single threaded
54:09 environments like Node. Js or often in Ruby, these things aren't used either, maybe in Python. But mutexes are essentially a synchronization primitive where we can say someone has the exclusive right to some piece of memory. And this is great in terms of writing safe code, but it can also lead to con contention when multiple threads want to or go routines in the go case want to write to the same thing, and that means that they're blocking each other and, you know, creating latency potentially in the system or worse, maybe even deadlocks if if it never gets
54:49 freed, the the mutex. And then the last one that we see is CPU time, which is something that we already looked at earlier. So, literally, where was CPU time being spent in that process? Awesome. Thank you for that. You are a wealth of knowledge. Alright. So, Matthew, in the comments, has double checked the release for 07/01 and confirmed that the accumulative value for formatting is indeed broken and will open an issue. There you go. Great. Alright. Is there anything else that you want to look at before we deploy the microservice demo thing? Is there anything we've missed from the from
55:20 Parca UI Feature: Query Autocomplete
55:26 the UI or functionality from Parca that you wanna show off before we do that? I think the only thing that we we kind of glass just glanced over, but something that I was quite excited about because this was something that we didn't have in Prometheus for the very longest time, which is autocomplete of the of the queries. So if you go back and click on profiles again, we can see that we have autocompletion for the label names. And, actually, behind all of this is an entire language of writing these queries, and it's very similar to the Prometheus
56:07 Prometheus query language. But we kind of have a real parser in the front end here that takes the query apart and does suggestions based on where your where your cursor is. Right now, it actually only the the suggestions only work for label names as well as, like, constants, and you you'll you'll see what I mean by constants. If you, let's say, choose instance and then equals or what whatever, really, and then you do two quotes after the equals. And, yeah, now you can see that that comma was being suggested, right, because that is logically the next thing that should follow after
56:51 this. So, yeah, I I I was pretty excited about this primarily because I haven't written a parser, since, like, my compiler's class in university, I think. But, yeah, I think this is pretty cool. And, obviously, the next thing that we're building here is that we have labeled value completion as well. Yeah. That would be awesome. I just appreciate the auto complete when we started clicking onto the boxes and stuff. Yeah. Been able to do that on the values themselves would be a really, really nice touch. Yep. Yep. It's in progress. A lot of the plumbing actually already exists.
57:27 Nice. Alright. So I prepared a small kind of microservice thing in Go that I I found on GitHub. So this is by the user Harlow, and it's just, like, for Go based microservices that display some hotels on a map. They use gRPC to do communication and I thought it'd be good to see profiles across that. Before I knew anything about Parca, in fact, well, I was even pronouncing Parca wrong, a whole hour ago, is that I didn't know the compare feature existed. And now I'm kicking myself because we missed an opportunity to actually use it as part
57:30 Microservices
58:07 of the demo that I wanted to show. But I'm thinking that we're trying to be brave here, which never works out well for me. So the only thing I added to this project was a whole bunch of YAML to do the deployment here. I really I really naively used the latest tag, but I'm hoping if I do an image list, is that we'll still have the content addressable one that I pushed earlier. So I think I'm gonna update as Yowl to deploy the old one before I added the bug for you to find, to see if
58:50 that can help us do Yeah. Okay. So I've injected like a little bit of CPU intensive code into one of the services and we wanna see if we can debug and discover that. So this is the fun bet though. The first time I built this, I built up the platform argument. So this should be the one that won't work at all. Then because it's an M1 Mac, it doesn't deploy my Kubernetes cluster, which is not Arm 64. So I think we need to deploy this one. How do I get the actual shaft from that? You might be able to check out the
59:26 registry where you pushed it to, and we might be able to find just enough of the SHA. I think if you click on the tag, there's a, like, a history or something sometimes. Maybe that maybe only does that. Alright. Well, let's there must be a way to show this chat here. Yeah. Digest. There we go. And we'll make sure this matches up. So the current one is 944, which is this one, which means this one here should be my, before I broke the code. So if I go into my Kubernetes folder, replace latest with the SHA,
1:00:17 do you need the at symbol? I can't remember. Yes. You need the the at symbol, and you actually need to write sha two sixty five colon as well. Yeah. Okay. And we wanna do that in the start of the ML. I guess. I'm I'm I'm feeling how to say it now. That dash I s search replace. Yeah. It should just be filename. Right? Let's just front end on its own. I I think I think there might be a thing where, you know, maybe the at symbol or the colon does something with the with the regex. I I I'm never too
1:01:03 sure either. No. Not in the replacement side. If it was on the matching side, definitely. I see. Yeah. Yeah. You're right. But there, I think it's alright. So this is just me not remembering that which just put the failure. So I is unlined. Maybe we need the e. There's the I e. There we go. Welcome to the said beginner's guide for idiot. I'm the idiot. But now we have the colon for the app. Alright. I'm gonna open it if you ask. Good. Okay. So we just need that. Right? Yeah. Yeah. I don't need to do
1:01:55 said failure. Alright. At this profile service. At this. I'm waiting for someone in the chat to tell me what my my mistake was. Sometimes it's just quicker to do it the boring way. So let's do a cube control apply star. What? Okay. Dot. There we go. No. I can't even remember cube control. Okay. Maybe I got that wrong. Well, it works for front end. Oh, no. Because I alright. See, I told you whenever anything goes wrong, just blame me. Always me. I forgot the five six on the other ones. I see. Alright. I wish I had just deployed that straight
1:03:03 up now. But if this works and we see the difference, I'll be very excited and apply. Okay. That is going to work. Yeah. We can already see them starting to run. Yeah. Nice. Now because we have the server and the agent already deployed, if I do a port forward, we should be able to see k. Port. There we go. I'm not that the front end. Well, I guess we probably no. We don't need traffic yet. We can just Parca. We could just go to here and we should see data already. Wonder if I need to shut down the
1:03:51 other one. Cool. Looks good. So let's take a look at the default namespace. I got a new keyboard and I can't type. Cool. Is that right? Yeah. Yeah. That looks right. Yeah. So we can see Jager in front end. I expected to see a few more containers. Now it could just be that this oh, no. I'm on the wrong cluster. There we go. There we go. So it could just be the sampling hasn't kicked in for the the other ones which started a little bit late. So if we hit search again, we get more colors.
1:04:42 Yeah. Cool. This application is nice and simple. We could browse to local host on port 8080. It could also be so if if there's absolutely no CPU time being spent, obviously, there's nothing to measure. So that could be what we're seeing as well. It depends on what these services do. Ah, okay. Well, every time I zoom around, this is the the back end of this application is talking to each other and checking and searching for new spaces. When I click on these, that's the profile service. So we should see in a little bit of time that we're
1:05:22 gonna get more information. However, I think if we just go straight ahead and deploy the broken one now and interact with it, we should see a, hopefully, spike in the CPU. Yep. So try and keep those two running just now and go back to here and undo. Cool. I dash f go. Is that the wrong cluster again? Yeah. I swear I know what I'm doing. I promise. Alright. So we got some new containers coming up. We'll let that switch over. We'll let Parca do it saying, get some profiles and and then we'll see if you can just
1:06:35 guide me through the process of debugging which service has given us some problems. And hopefully, hopefully it works. Cool. Alright. So we got a comment from Mattias. This is where the fun begins. I certainly wasn't finding my skills are particularly fun, but hopefully, this works. Moz is telling me to use the hash symbol to escape the at char. I still don't think we have to skip the at char because it was in a replace state, but then I've just shown them my set skills are not exactly up to date. So we'll see. It might be that
1:07:10 at can be used for selecting a grouping or something of the first Yeah. I mean, you could be right. I can't remember. There there are a lot of different I know dollar sign allows me to pull things from the match, but, yeah, maybe you you could be right. Okay. Let's click search. Okay, we're getting more data. So let's, I need to report forward again. So this is the new version. If I zoom in a few times, click a few buttons, and we I guess we just cannot oh, there we go. That may be related potentially.
1:07:37 Debugging Microservice Profiles (Symbolization Issue)
1:08:05 I think we should get a refresh every ten seconds button. Is that is that a good feature request? Yeah. Actually, it's already in progress. Nice. Okay. So we potentially have enough information to debug here. Do you wanna kinda guide us what you would do if you were actually trying to work out what was wrong in this system? So in in this case, I wanna say we're probably like, we see this spike that we didn't see before. Right? And I I think I saw that the profile container is what was the contender here. Mhmm. So let's
1:08:43 filter down by that container. So if you hit shift again, you can click that label again. Yep. So now, basically, you can do you can go to your compare view. This is and I just wanna point this out because if this is are these colors to show different deployments of the same container? Yes. Exactly. Sweet. So I can actually already see exactly when that's swapped over and then the contention happened on the CPU. So that's cool. Yeah. And if if there was a particular label that also kind of distinguished the two deployments, in this case, we don't really have that.
1:09:28 We have the container ID, which we could, you know, abuse for the same purpose. But, ideally, we would have a or something that is means something to us as the developer. Then we could, once we go into our compare view, we could actually filter down to to seeing only those, points of data. But in this case, the the the just selecting a point in time will do the same will have the same effect. So what you can do is you can, on the left hand side, select one of the profiles, maybe the one that has a couple of
1:10:13 I'm just I'm just trying to see how back I can get the graph. Okay. Give that ten seconds or something. Sorry. What would you like me to click on there? On the on the left hand side, let's click where wherever that what that spike was in the on the blue line, and then we can compare that to a spike on the the right hand side. Alright. We'll use what we have just now. Interesting. Okay. So in this case, we're actually seeing something interesting. We can pause here for a second. We're seeing a bunch of memory addresses. Right? And
1:10:48 Understanding Symbolization & Debugging Info
1:10:53 what we've just caught here is that Parca hasn't actually been able to keep up with all of the new data that we've been sending to it, and it does a process called symbolization. And one one of the employees at Polar Signals came out. He just wrote a fantastic blog post about what's happening deep down in in in this process. But, essentially, the way that you can think of symbolization on a very high level is all of these memory addresses is what we actually captured from eBPF. Right? So eBPF or the kernel really only sees a stack being built.
1:11:35 And with eBPF, we're reading that entire stack. But we're from from that kind of perspective, we're not actually reading the function names or something. We're only reading the offsets of that code in our binary, and that's basically what we're looking at here. And so, hopefully, once we refresh, Parca has had enough time to to get to to these memory addresses that have not been symbolized yet and can put some meaningful names for us humans behind that. Okay. Looks like something's wrong. I I I can't can't say yet what. Maybe we can jump over to not doing the comparison,
1:12:20 and we can see if we can find any of the if we can find any any symbols for the new container. Okay. So if I just do this That's just the Parca one, I think. Quite hard. Yeah. You you you probably wanna filter it down to Name. The default namespace. Yeah. Looks like looks like something about symbolization is wrong. Maybe we can check out the the logs of the Parca server to figure out what's going on. Yeah. Some things appear to be working. We could see that the kernel space was actually correctly symbolized. If you I scroll down to here. We
1:13:18 Kernel vs User Space Profiling (eBPF Benefits)
1:13:19 can Yeah. We can see this is actually something that also is incredibly cool about using eBPF for this. If we're using a user space profiler only like the ones that are built into the Go runtime, we would never be able to have this depth of detail. Because from user space, we would never be able to tell the kernel stack that we we're currently in. But eBPF gives us that kind of insight, which I think is really, really exciting because sometimes it is things that are that far deep in in our in our infrastructure that we need to understand
1:13:57 that there's some sort of maybe you know, one of the most expensive things on Linux machines is this context switch from user space to kernel space or vice versa. And if we can detect that this is happening a lot, then that's also incredibly valuable performance information. So we were very excited when we got that kind of detail of knowledge for the first time. Okay. There's no logs on the Parca server. Do we wanna look at the agent, or do we wanna get Yeah. Yeah. We could have a look at at the agent. Is this gonna work?
1:14:39 I think it's the, like, canonical Kubernetes app labeled, so it should be app.Kubernetes.io. Okay. Alright. I'll just do a bot. No. Alright. I don't think we're gonna get anything from there. May as well do the last ones since we've done two and three. Yep. Okay. So the what happens here then is when we do the search, we're trying to look at this, is that what we get by default is this object notation for something that happened, and then Parca has to resolve that to a name or a symbol, and that process isn't happening here. Yeah. So this is something that happens asynchronously
1:15:29 in the background. When we when we upload the profiling data, we actually all we're uploading are these memory addresses, and then Parca tries to symbolize them with all of the new data that we're seeing. And it seems like, for some reason, this is stuck or it doesn't have the appropriate information that it needs for this. Again, I think this this format would probably we can talk probably for six hours about how the civilization stuff works. Here here, this it appears to work. Not sure why. It could be the container image I've built. This is the gigger one, and the the
1:16:05 symbols are definitely coming through okay. But if we go to the container images that I built, now is that is there a way when I do a go build that I am disabling? And they say There there are there are ways where you could potentially disable something like this, and that that's very closely related to how how this process works. So on a on a on a high level, if we think about, what our binaries are made up of, they're they're in a format called called ELF. And an an ELF binary has various sections that describe parts of the binary, basically.
1:16:14 DWARF Debugging Information & Compile Flags
1:16:45 Obviously, parts of it are are executable code, but there are various other sections, and one of them contains something that's called DWARF debugging information. And this section, sometimes people strip out of their binaries in order to save space of the size of the binary because the executable code will still execute fine without dwarf information, but you may be, you may be missing out on the ability to debug. And there there are various various case studies that all of the hyperscalers essentially have done, and they always ended up coming to the conclusion that it's not worth
1:17:34 throwing away this this information because Yeah. Being able to debug is just always going to be worth it. I see you you did add a couple of compiler flags, so this might be the culprit, but I I'm not I don't know off the top of my head what dash s and dash dash w do. Yeah. These just came with the Dockerfile for the project. In fact, if I just go here may as well just do the spoiler now and show you. I haven't modified the Dockerfile. I don't know what those flags do either. They could very well that dash s being a
1:18:07 strip could be what's happening there. But I just decided it's a logical routine that that a nice big blip. So that's why I've ever seen that thing. Yeah. Oh, wow. Nice. So I think we've seen a lot from Parca. I think we can all see the value that it brings to our infrastructure. I definitely do believe that if you have a Kubernetes cluster in production, that Parca should be part of that stack because you just don't want to lose the type of information that we're getting here, especially when things go wrong and you need as much information as possible.
1:18:21 Summary of Parca's Value
1:18:41 And as you're deploying on a cloud native way frequently, multiple times per day, that compare feature is just golden. I had no idea we were gonna see that and I'm so impressed and so excited to get that running on my own customers now. So thank you for that. Yeah. My pleasure. I I think the first time maybe Matthias remembers this. He's he's in the in the comments. The first time we were we, like, implemented this, we were so mesmerized by this. We'd like, I I think we spent, like, an hour just clicking around understanding just Parca itself. You know? Like, what
1:19:14 was what was the memory difference here and why? Why why was this what's the CPU spike? And just, like, understanding that depth down to the line number is just some not something we've never seen before. Yeah. So, yeah, I share the enthusiasm. Sweet. Matthias is in the comments saying that dash s is probably the culprit. So I will I'm sure we'll do more demos. And if you turn, we can take a look at that. I'm curious. You know, we'll we'll finish up now in the next few minutes, but kinda what your thoughts are on what's coming
1:19:44 The Future of Parca (Storage & Scaling)
1:19:46 next for Parca, what it takes for Parca to to reach this next milestone, whether that be, like, a one point o or something else, and other crazy ideas that you've maybe got. So let let's start with what's next for Parca. Yeah. So I think the the most important thing for the community that we're working on with Parca is actually the on the storage side. As you can imagine, we're dealing with a whole lot of data, and it's it's not easy to deal with that amount of data, making it queryable, as dynamically as we do with with Parca.
1:20:29 And, of course, the larger the scale, the harder the steps. And so we're working on kind of a new storage iteration that's going to allow us to to scale this to the to the size that we intend at the query latency that we want. So as we saw the the the demo that the demo worked great, but if once we go into the hundreds or thousands of nodes, with thousands of containers, it gets more difficult to to query that data, efficiently. And so we're actually building an entirely custom columnar store just to solve this problem.
1:21:11 So that's something that I think is really important to the success of this project so that we can kind of scale the entire idea. So, yeah, that that's something that I'm also personally very excited about because I think the design and the design is open. You can find the the design on the on the I think it might have only been discussed in the Parca office hours so far, but it's open. Nice. Do you see a future where Parca integrates with, you know, auto scalers on Kubernetes to be able to you know, I can see, like, the vertical
1:21:45 Future Integrations (Autoscaling, PGO)
1:21:52 pod auto scaler and been able to modify those limits based on what Parca sees and for that to evolve over time across versions, even maybe to the point of hooking into remediation systems, the captain, and actually reverting a deploy based on bad behaviors or performance. Yeah. A %. So as I said, the hyperscalers have been doing methodologies like this for for almost a decade. And, actually, eBPF is kind of the reason why this is accessible to us now. Hyperscalers have been maintaining their operating system versions themselves for a long time, and that's why they were able to do integrations like this at
1:22:33 a very low level that just the rest of the world wasn't able to do. So, like, was really kind of a godsend for this. But what I was get trying to get to, Google has actually written a couple of papers where they described exactly what you just said that they they use this type of data for scheduling decisions in their orchestration system board. And they they do various other optimization techniques. One that I'm particularly excited about is something called profile guided optimizations. And this is kind of a technique that has existed, I believe, in my
1:23:13 in my research. I I found it found found the first kind of mentions in the in the nineteen seventies. And the premise is really simple because we understand how our programs are going to be executed in real life, we can make opinionated compiler optimizations for exactly the those situations. Optimizations that may not generally be a good idea, but because we know how the this code is going to be executed, we can make opinionated optimizations. And the the great news kind of is these things do already exist in all major compiler tool chains. And as a matter of
1:23:53 fact, the very next version of Go is going to ship with profile guided optimizations, possibilities as well. So the the the great news for the Parca project is kind of we don't even need to build these mechanisms into compilers. They already exist. We just need to be able to export this data in the format that these compilers expect. And then the the hope that we have from all of this is that, magically, just by giving your compiler some of this data that you're collecting, hopefully, anyways now, your programs just magically get faster, better, less resources.
1:24:32 Awesome. That is all extremely exciting, and I can't wait to follow along with the Parca journey and just see what cool things come out of this over the next six months and longer. So thank you so much for for joining us and guiding us through this. It's been really cool. I think it's awesome to be able to see these new techniques and tools come out and make every hopefully, make all of our lives that little bit easier. Any final words before we say goodbye for today? Just my my final words are please try out Parca and
1:24:35 Conclusion & Call to Action
1:25:03 join us on our Discord server. If you have any issues or any questions, feedback, anything, we're more than happy to, you know, help you through it. If you have any issues or work through feedback. We're always happy to to have any kind of feedback and and, you know, hope to improve Parca together with the community. Awesome. Well, thanks again for joining me, Frederic. Have a wonderful day, and I'll speak to you again soon. My pleasure. Thank you for having me.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments