About this video
What You'll Learn
- Understand how Linkerd 2 installs on Kubernetes and verifies control-plane health with check, dashboard, and CLI validation commands.
- Explore service traffic behavior in EmojiVoto and Books demos using linkerd stat, tap, and top to observe requests.
- Configure Service Profiles, fault injection, and mTLS defaults in Linkerd while applying retries, timeouts, and traffic splits.
Thomas Rampelberg from Buoyant walks through installing Linkerd 2 on Kubernetes, exploring the dashboard and CLI (tap, stat, top), running the EmojiVoto and Books demos, configuring service profiles for retries and timeouts, traffic split fault injection, mTLS, and multi-cluster.
Jump to a chapter
- 0:00 Holding screen
- 1:25 Introductions
- 1:26 Introduction & Sponsor Thanks
- 2:21 Introducing Linkerd & Guest (Thomas Rampelberg)
- 3:10 What is a service mesh?
- 3:14 What is a Service Mesh? (Responsibilities & History)
- 5:15 Microservices: A Human Problem
- 6:47 Demo Environment Setup (Pre-provisioned Clusters)
- 6:50 What are we working with?
- 7:30 Installing Linkerd
- 7:36 Linkerd Installation Process (CLI & Edge Version)
- 11:56 Verifying Linkerd Control Plane Installation
- 12:50 Linkerd dashboard
- 13:00 Exploring the Linkerd Web Dashboard
- 15:28 Introduction to Linkerd CLI Tools (Top)
- 15:40 Linkerd top
- 18:10 Deploying the demo app
- 20:40 Injecting the Linkerd sidecar
- 24:15 Stat command
- 28:20 Tap command
- 31:30 Fault injection / TrafficSplit / Canary Deploys
- 40:50 Time outs and retries
- 49:20 mTLS
- 55:30 Multi-cluster
- 1:09:00 Closing
- 1:18:15 Deploying the EmojiVoto Demo Application
- 1:19:13 Exploring the Demo App & Finding the Bug
- 1:20:41 Injecting Sidecars into the Demo App
- 1:22:46 Verifying Sidecar Injection (linkerd check --proxy)
- 1:24:20 Viewing Application Metrics (linkerd stat deploy)
- 1:28:26 Tapping Live Traffic (linkerd tap)
- 1:30:12 Exploring More Linkerd Features
- 1:31:58 Setting up Books App for Feature Demos
- 1:35:16 Creating a Faulty Backend Service
- 1:35:54 Configuring Fault Injection with Traffic Split (SMI)
- 1:40:53 Retries & Timeouts: Introduction to Service Profiles
- 1:41:42 Understanding Service Profiles (for Metrics & Policy)
- 1:42:38 Generating and Applying Service Profiles
- 1:45:58 Configuring Retries via Service Profiles
- 1:48:38 Configuring Timeouts (Similar Process)
- 1:49:20 Exploring MTLS (Mutual TLS)
- 1:49:52 MTLS is Enabled by Default in Linkerd
- 1:51:03 Validating MTLS (Check & Tap Commands)
- 1:52:19 In-depth MTLS Validation (Using tshark Debug Container)
- 1:55:25 Attempting Multi-Cluster Setup
- 1:56:01 Multi-Cluster Setup: Common Trust Anchor
- 1:59:30 Generating Trust Anchor Certificates
- 2:01:59 Installing Linkerd for Multi-Cluster
- 2:03:55 Linking Clusters (linkerd multicluster link)
- 2:05:46 Troubleshooting Multi-Cluster (Load Balancer Issue)
- 2:09:09 Multi-Cluster Demo Stopped (Network Limitation)
- 2:09:49 Conclusion & Thanks
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
1:26 Introduction & Sponsor Thanks
1:26 Hello and welcome to today's episode of Rawkode live. I am Rawkode, your host also known as David McKay but all the good handles were taken unfortunately. I wanna thank my employer, Equinix Medal. They afford me time to invest and to producing these shows and getting you know, experts around the world to come and introduce us to all of their fantastic software. So thank you Equinix metal. If you wanna try out Equinix metal, can use the code Rawkode live. This will get you $50 of compute time. You can splash that in a few hours with 400 gig of ram and 96 cores
2:01 or more or you can be a little bit more conservative and have that run for up to ten hours or a couple of days depending on the machine and hardware that you use. So check that out. It's good to have a fun. We also have a Discord channel if you want to continue this conversation or you're not watching this live and have questions. You can find that at Rawkodelive/chat. Now today we're gonna be taking a look at Linkerd, a cloud native service mesh by Buoyant And we are very lucky that today we have Thomas, Thomas Rampelberg from Buoyant himself who's gonna
2:21 Introducing Linkerd & Guest (Thomas Rampelberg)
2:35 guide us through our Linkerd journey. Hello, Thomas. How are you? I am doing fantastic. So excited to be here. Awesome. Do you wanna just tell us a little bit about what you do at at Buoyant? Yeah. You bet. I'm a software engineer at Buoyant. Buoyant's kind of the creators of the service mesh. Linkerd version one was the first original one. We went through a pretty big rewrite based off of everything we've learned from version one, and, we've got Linkerd two, which is the product that we're gonna be talking about today. And I do software engineering and help maintain
3:10 What is a service mesh?
3:12 that. Awesome. Now service mesh might be a new term to some of the people that we have watching. So can we maybe just break that down a little bit and say what are some of the responsibilities for a service mesh? Yeah. You bet. So let's talk a little bit about how it all got started. Back in the day at Twitter, they kinda started out as a monolith giant Ruby on Rails application. And as the organization scaled and folks got more and more frustrated with the fail whale, they needed to go split out into microservices. Once you start splitting out into microservices,
3:14 What is a Service Mesh? (Responsibilities & History)
3:49 there's these concerns that pretty much everybody needs to go and implement. You need to have a common observable observability stacks. You need to have metrics that are the same across all the microservices, whatever language they're written in. You need to have reliability concerns, retries, time outs, routing decisions, smart load balancing, and you need to have security, so MTLS. Those three things at Twitter, they created a library called Finagle to implement that because they were writing all of their microservices in a single language, Scala. But most organizations who do microservices, one of the big wins there is that you get
4:29 to write in whatever language you want. And so, Linkerd, the first service mesh, was kind of created as a way to go and pull all of the we're gonna call them operational concerns, like security and reliability and, metrics out of developers' apps and into a proxy that the operators can go and deploy. And so Linkerd two basically does that. It is a sidecar proxy that gets deployed alongside your application and that goes and provides all of those benefits. The sidecar proxy is what we call a data plane, and then there's a control plane that goes and manages all of the service discovery
5:08 and policy enforcement. And then all of that runs on top of the Kubernetes platform itself. Wow. It does a lot. So that's awesome. I always think like when, you know, I'm talking to people about why they should be cloud native or why they should adopt micro services and you know, one of the common themes that comes up is people say that it simplifies the code that they write. Like they get to write smaller services that do very small things, but what they don't realize is that all they're doing is moving that complexity down to the infrastructure layer,
5:15 Microservices: A Human Problem
5:39 down to the network layer and that's why tools like Kubernetes and Linkerd are just so important when it comes to these new these new microservice architectures or cloud native architectures. You're totally getting me set up for a rant. Microservices in my opinion don't actually solve a technology problem. They solve a human problem. The whole purpose of microservices is to go and give teams inside organizations ownership of their own destiny. So it's not about, like, writing little tiny pieces of code because you're exactly right. That little tiny piece of code that runs as a single service may be easy, but
6:13 now you need to think about all the interconnections there. And if you go and write a thousand microservices, you've got a thousand x the complexity of just shoving it into one place. Yep. Exactly. I love that, the human problem line. I'm gonna steal that. I borrow it, of course. I credit you. Please do. I think more folks need to hear it. And it's a very important problem to solve. It it's very powerful. It's the problem, to your point, that all of the cloud native solutions solve is giving teams and developers control over their own destiny, but it's a human problem. Yeah. Definitely.
6:46 So let's you know, that let's this stream is all about getting hands on and showing the technology and and how it works. So let me cover the, there we go. Let me cover where we are. So with our stream I do very little upfront, but of course we do need a few things in advance for this. We want to focus on the Linkerd aspects and nothing else. So in order for that, I have used Equinix metal and the cluster API and I provisioned us two Linkerd clusters, Linkerd one and Linkerd two, not very imaginative naming, but hopefully
6:50 What are we working with?
7:20 something I would forget. That's it. That's all I've done. Everything else we're gonna do live and we're gonna do it now. Now I also have the Linkerd homepage here and I'm assuming like everyone always does, they tell me let's go to the docs. So what is step one for getting started with Linkerd on Kubernetes? Let's go hit that getting started link there and hopefully it will be self explanatory. I knew you were gonna say that, but I always ask it. I always ask. Alright. So I'm I'm not gonna run cube control version. I'm pretty confident with that.
7:36 Linkerd Installation Process (CLI & Edge Version)
7:56 In order to Actually, is your cluster 119 or 118? It's 119 Dot 3. Oh, I should've asked this before. We'll have you use we'll have you use the edge. The TLDR there is that, Kubernetes one nineteen has some interesting upgrades because of Golang, and certificates don't work quite the way that they should. So if you see that step one install command there, instead of doing run Linkerd install, add a dash edge to the end of the the install, and we'll get you a edge version that will work. For those who are paying attention here, our
8:39 edges are unstable versions. But as of probably this week, we'll have a new stable version coming out based off of our unstable. So you'll be seeing the latest and greatest here. Well, we do like to live life on edge here, literally. So oh, I need to change this. So you want me to add Yep. Dash Edge just on the URL? Nope. On the Install Edge. Fix that. Yep. That's it. Okay. So let's see if I can run Linkerd version. This is a COBRA. So yep. Alright. Edge twenty ten six. So I think we're over the first hurdle.
9:27 We're now running the Edge one. Okay. So now recommending that I run a Linkerd check pre. So what what does this do? What's what's it looking for? So we're just gonna validate that your cluster is all ready to get Linkerd installed on it. A lot of that's going to be RBAC and making sure that you've not got any crazy settings on your cluster. We're we feel pretty strongly in our check infrastructure. As I'm sure you've probably noticed, Kubernetes clusters tend to be a little bit of snowflakes. Everyone is different. Even though we say that it's all cattle,
10:08 it's not. Clusters are pets. And so we like to make sure that folks know that Linkerd can get running on their cluster right away. So if that comes back and tells us that our cluster is not applicable, we're just packing up and going to the pub. Right? That's I could I could go for the pub. Looks like we're getting texts. I'm sorry, mate. We're ever stuck here. Okay. So status checks are all green. So this is just creating the Kubernetes API, checking the Kubernetes versions and making sure that our back permissions are in place that it can
10:42 create a proven it needs. Yep. And I see net admin and net raw, which I guess makes sense as a service mesh. We've got some networking stuff that has to happen so. Yep. Okay. So then we go through the Linkerd install and we're just gonna apply that straight to Kubernetes, which is going to install, I'm assuming, our Linkerd control plane and data plane? Just the control plane, though. The control plane uses the data plane. Okay. And then that command, Linkerd install, just outputs YAML. So if you so felt like auditing or taking a look at that, you could.
11:17 And this is kind of meant for kicking the tires and getting started. As you asked before the stream gets started, there is a Helm chart that you can use. But like most things, Helm, it takes a while to go figure out what settings you want and how to configure it. So we kinda recommend getting started with this and then figuring out where you wanted to go for a production install. Yeah. I I can see there's there's obviously quite a lot going on here with this kind of quick start approach. Grafana, there's proxy injectors, Linkerd web, so I'm assuming we've
11:49 got some sort of dashboard or UI. Yep. Lots of lots of stuff for us to poke and play with which is good. Alright. Let's see. So we now have this command. So this is no longer the precheck, but it's now gonna make sure that what we've just applied works. Exactly. And we're gonna sit and wait while the con containers come up. It's pretty frustrating to go and try and run through the get started guide and not have everything ready yet. It'll be alright. I mean, it's it's good to see that stuff running. So, I mean there's not a not exciting television
11:56 Verifying Linkerd Control Plane Installation
12:26 but people have to see what happens here and you know what, I love it when it fails. Don't get me wrong, see when the docs are great and it works and we just we go through it, cool. I love intuitive docs, I love it when I don't have to think but at the same time when things break, it's also good to be able to dig in and see what's actually happening under the hood so. Totally. But we're out of luck even on our Linkerd Edge release here, we're still just getting this little green tick everywhere we go.
12:50 Linkerd dashboard
12:50 Alright. Let's keep going. I'm done. Right? Yep. You have a working install. So if I just run, I won't bother with ampersand. This is gonna do some sort of port forward, I presume. Yep. It's opened in the wrong That that is actually literally just a wrapper around kubectl port forward. We actually use that code. So Alright. Okay. Yeah. It keeps it nice and simple for people with a link of DCLI. So I get that. Yep. So this is a dashboard. It's just showing me my is this like a Kubernetes dashboard? I see that it showed me cron jobs and
13:00 Exploring the Linkerd Web Dashboard
13:32 daemon size deployments. I mean, is this Linkerd specific or what's going on here? Yeah. So this is Linkerd specific. Because the control plane dog food's the data plane itself, if you go click on that Linkerd but or Linkerd link there, you'll actually see the full health of what's happening in the control plane right now, including the service topology and all of the deployments. And so you've switched over to Linkerd namespace now. You could go click on deployments. We keep it super close to the Kubernetes dashboard so that it's easy to understand what's going on, but it is Linkerd specific
14:08 functionality. And then, while this is super Linkerd specific, we also have Grafana links, if you notice that link right there that you can go to for prebuilt dashboards. So if you wanted to put this up on a TV, something like that, you can do that as well. Alright. There's a lot then. Right? So this quick start installer has given me Linkerd running on my cluster as a surface mesh. Yep. But it's also got a Grafana, which we've seen an install. It's get I'm assuming there's then Prometheus being deployed also. That's how the metrics are getting from Linkerd into
14:48 the Grafana and you're deploying all of these pre canned dashboards. Yep. I guess, to give me confidence that Linkerd is doing what I expected to do. Right? Exactly. I I believe pretty strongly in letting folks validate what they just got done doing and making sure that it's all healthy. And so that's kinda how we get to this point here. Okay. I mean, this is pretty sweet. I like in this. So I'm assuming that by looking at this, I mean, there's not really anything running in my cluster yet. So Yeah. Right. This this is just Linkerd
15:21 set an idle test at a certain degree? Mhmm. Okay. So if I I'm just I'm just gonna guess that maybe there's demo app, but there is a demo app. So Sure enough. No. I kinda skipped over a step there. I'm gonna assume it's maybe not important. This is just That's pretty fun to take a look at. Pull it up, and we can chat a little bit about it. Okay. Can I close the port forward, or should I leave that open? Or Nah. So we come with a tool what is this? Top? This is top. So we come with a
15:40 Linkerd top
15:57 tool called top, and this is watching all of the communications in the cluster live and then showing you the top of what's happening in the cluster. And so you'll see each, path, the source, destination. And so this is a fun way to kinda take a look at what's going on inside the cluster. Here, you'll see that, the the Linkerd installed Prometheus is going and scraping a whole bunch of deployments. Those IP addresses are actually kubelets going and doing readiness and liveness checks as you can see by ready and ping. And so as the scrapes and the rest of that
16:34 happens, you'll see these go up, with success rate and all the data. So this is kind of a fun thing to pull up if you are doing a deployment or something like that. You can kinda watch the health of what's going on as the deployment goes out in real time. Awesome. So let me throw a few questions at you then. Because I'm I'm I'm curious like like right now what role Linkerd is playing in this in this idle state. Like, we're not injecting a SAICAR into anything that is preexisting in my cluster. Is is that
17:05 right? Correct. Your your cluster is working the way it should be specifically because the service mesh goes and gets part of the data path. And, let's be honest, service meshes aren't perfect software, and so stuff is gonna break. And and we really, really encourage folks to do things incrementally. And so kind of where you're going with this, I think, is just add it to one workload and then slowly roll it out on your cluster. So we try pretty hard not to touch or modify anything until you are ready to opt into it. Yeah. I think you've read my mind to a certain
17:42 degree there. What I'm thinking is, you know, not a lot of people are in a position like this where they're they have this fresh cluster, they're deploying their workloads with the service mesh already preexisting. So like, I can take a very heavy production based cluster add Linkerd to it with probably not too much of an overhead and then slowly start to roll out those sidecars wherever it maybe makes sense or wherever the priority for retries and all the other features comes in and just Exactly. Take my time with it. Right? Yep. Okay. So let's get this demo app deployed then.
18:10 Deploying the demo app
18:15 So emoji photo and we're gonna curl and apply. And this is just a gRPC app that's got a couple microservices in it, and we'll be able to walk through how to go and diagnose an issue because we've actually got a bug in the application. Intentionally? Or Intentionally. Yep. Yeah. Okay. It's pretty fun to go and see how easy it is to track down issues that were something that would take you forever before Linkerd. We had one Linkerd user who, rolled out a service, and it was order of magnitude slower. No one could figure out what was going on, and they, just
18:59 ran top. And sure enough, there was the endpoint that was going slow. And so that, you know, got them to the exact piece of code they needed to change within, you know, five or ten minutes. Awesome. Alright. So let's do the port forward and see what we're playing with then. Okay. Let's see. So Yep. I get to vote on my favorite emoji. Yep. And the the spoiler here is that the bug is the donut emoji. If you try and vote for that, it will give you think it's a four zero four. Yep. Yeah. Okay. But any other emoji, let's do I mean,
19:50 everybody likes the monkey. There we go. Where's the So we've got a voice in here. So we've got a load generator that is going and voting pretty much randomly so that you can go and see the data that's going through instead of having to generate one or two requests yourself. Okay. That makes sense. Okay. So that these are just gonna arbitrarily go up then as that load generator is applying the the fake votes. Okay. So we've identified the bug and the donut. Right? Is that the next step? Are you gonna show me how we we we resolve this?
20:29 I think so. Let's go back to the docs. We should it should walk us through what's going on there. We don't need docs. We'll make Ah, that's right. So we haven't actually we added EmojiVoto, but we haven't actually added, Linkerd to it yet. So what you're using is the unmatched service. And so the next step here is for us to add Linkerd in. And to do that, we're just going to get the YAML directly from the cluster. Obviously, this is a demo. I wouldn't do this in production or anything like that. And then we're gonna run the Linkerd inject.
20:40 Injecting the Linkerd sidecar
21:01 And, yeah, why don't you just cut and paste that because what this is gonna show off is the only thing that inject does by default is add that annotation right there, that Linkerd inject enabled, which is then going to go and hit the proxy injector and have when the pods are created on the Kubernetes cluster, the proxy and the in its in a container will get added on. Okay. So it's just that one annotation? It's it's not an NF analysis? Yep. And so you can add that yourself if you so felt like it. Inject, we'll do a manual inject as well.
21:37 If you're in an environment where mutating webhooks don't work for you, you can go and add it directly into the pod spec there. But for most folks, we recommend the injection proxy injector kinda process there. Alright. So we'll we'll just go ahead and apply that then. And cool. So I injected the okay. So there's there's four different no. Yeah. Four different deployments and they've each been injected. So if I run get the place. Emoji photo namespace. Emoji photo. And I'm just gonna just scrape a pod and grab the first one. Let me just see the sidecar should be
22:32 there. Is that right? Yep. Linkerd proxy. Okay. So now less traffic, if I do the I'm gonna assume it's gonna tell me to put forward again. Or not. It's telling me to oh, we've got a checker. Alright. Let's do that. Even more checks. So you've checked control plane. Now let's check the data plane and make sure that that is phoning home and everything's working there, which it is now. So that's great. So these check commands, right, they obviously give me a little bit of confidence and that the changes I'm doing are are doing what I'm doing. Do these have other applications and
23:08 production workloads to check that things are going on or is it really just for this this walk through? That's a really great question. They were kind of originally made for interactive, someone setting the cluster up and working around and doing it. But quite a few folks, I'm sorry. Let me fiddle with this for a second here. Quite a few folks use it in production as health checks as well. And so it can be used for both. Yeah. That was kind of my my my vibe is I ran that. It's like it would obviously be quite nice. But, I mean,
23:51 these check things obviously have a lot of logic to understand that Linkerd is healthy. So maybe I could continuously run them every half an hour and hour, make sure my my cluster is is Linkerd happy to a certain point. We actually have an open issue at the moment to add a JSON endpoint on the control plane to let you kind of automate this whole part of the process, which I'm pretty excited to see land. Okay. Cool. So what let what does this Linkerd stat deploy command? Run it, let's see what gets output. So that is the golden signals,
24:15 Stat command
24:28 of your deployments. The Google SRA handbook says that a golden signal is kinda the most important thing to pay attention to. If you've gotta monitor something, you should monitor the golden signals. And those are success rate, RPS, and latency. In particular, p 99 is kind of the latency that you wanna measure. P 99 means the ninety ninth percentile of latency. So if you take the average, which is p 50, what you're gonna do is end up missing quite a bit of latency. Your p 99 is gonna be pretty sensitive, and so users run into that,
25:06 you know, ninety ninth percentile slow kind of latency. You'll know about it immediately. But as you can see here, we've actually got success rate issues because of our donut emoji. Yeah. I see that. The 83% success on voting and the 91 on the web. So this is our error. Now And without Linkerd, you would kind of be up to a user giving you a phone call and saying, hey. This isn't working. What's going on? And because it's a user, they probably wouldn't tell you that it was the donut emoji or that the web was the app was
25:39 broken or, you know, all kinds of stuff. And this gives you a pretty easy, again, way to go look and immediately figure out what's going on there. Okay. That that's nice. So I'm gonna throw something out there. Don't know if it's gonna come up another part and we should just I'm just being impatient. But right now I'm thinking the load generator that the quick start is deployed here. Can can we wrap that up? Does that come later? What do you mean? Like, it's generating load, but not a lot. Can we, like, have it Oh, sure.
26:09 I mean, we could scale that deployment up. Alright. Okay. Should we do that now, or should we will we stick to the the script that that We can do it now if you wanna take a look at that. I don't see any reason why not. I'm just curious. So if I do all pods It's in VoteBot. So if you just do a, you know, scale replicas on VoteBot to Okay. Okay. Emojiboto. I just added a deployment. I can never remember the scale command. What is it? Scale emoji. Deploy emoji or deploy vote dash bot. Vote
26:51 replica SQL. A dash dash replicas. Yeah. Yeah. Like I said, I never remember the skill command. I always end up just gonna end modifying the spec. But, alright. Let's see. Well, have we got one just now? Yep. Yeah. We've got one pod, two containers. Let's how brave are we doing? Let's do 10. And I'm a run emoji photo, get pods. Let's see. So have I run our stat command? So we'll see the RPS go up here, but the stat command is based off of Prometheus. And so there's about a sixty second lag between when you do something and when you actually
27:38 see it. If you were to do a top here, you would see the rate go up pretty much immediately. What was that top Just do a top deploy. And so you can see here that there's quite a few requests coming in now. Okay. Well, I can be patient and wait for our stack command to update. So But there we go. We've got it up to almost 20 r p s now. Alright. Remind me before we finish the scale up to like a thousand or something just for Nothing good. Yeah. Okay. So we ran the top deploy
28:20 Tap command
28:20 and then we're going to run a Linkerd tap. Mhmm. Let's wrong copy. So TAP is like Wireshark for your cluster. So this is the live requests as they're going through the service mesh. And the big use there would be that you can actually get all of the details that you wanted on what's failing pretty quickly. Tap, will also do full headers, and so you could go and look into, you know, the actual headers that are going through for these requests, go do even more debugging. Ah, okay. Cool. So we can actually see every single thing that's happening on the Mhmm.
29:02 The network. Nice. Okay. Yep. And it's also available in the dashboard. So Exactly. Yeah. We can probably pop open the dashboard in Grafana once we've wonderful. We've done it. Yep. That's it. That's pretty much it. So how do we identify the donut problem? So you could do a tap and just grep for a status that's, like, four zero four. That should might be a 500. Try a 500. I was wondering. Do we also have no. Was gonna say There it goes. Have to wait for a donut vote, but we have five hundreds coming through. Okay. Yep. And so,
29:50 you know, that pretty much tells you right there that those requests are failing both on the outside and the inside on the proxy. And you could then look at the logs here, which has the paths, and figure out what's happening there. Alright. Nice. So that's our demo application finished. Yep. So can we use this demo application to explore some of the other features that Linkerd brings to the table? We might need to install different depending on which way way we wanna go. We can go look at, let's go through the smorgasbord, and I'll let you pick pick your own
30:38 adventure here. We could do, take a look at tracing, which is kind of interesting. We could go look at retries and timeouts. Maybe the easiest thing to do would be a look at TLS and how that's been automatically set up. I don't think we'll need a new demo app for that. Some other fun stuff to look at would be retries and timeouts, or perhaps the most interesting one of all would be chatting a little bit about canary rollouts, why they're important, and how Linkerd helps you out there. But those are all kinda good. Yeah. There's there's pretty much we could just
31:21 go through every feature if we had the time. But let's let's cherry pick them. Why don't we because I think we've got you know, we have access to this Grafana and a dashboard setup and I'd like us to be able to see how we can use that. So what if we start with something hopefully easy, I'm making some assumptions here. But what if we just do fall injection and start making things break and then see what happens in our dashboard for you? Sure. Okay. I don't know what happened there. Okay. So I should probably click on the feature. Is
31:30 Fault injection / TrafficSplit / Canary Deploys
31:52 that gonna take me to the docs? Yeah. We and we've got a tutorial for that. Alright. So But this is a different demo application that we'll need to install. That's okay. We've got a we can do that. So this is just gonna create a Bix app namespace and deploy Bix app thing. So let's just let that go. And if you scroll up, actually, you'll be able to see a picture of what the architecture looks like a little bit more. So there's the architecture for this. We've got a traffic generator. It's a web app front end. There's a
32:24 author service and a book service. And so we'll be injecting we will be injecting errors into the, Books app calls themselves. This is kind of an interesting tutorial because it uses a primitive called traffic split, which is also how we do the canary rollouts. But a traffic split basically allows you to weight traffic. And so the traffic coming out of any service to books will get split. 90% will go directly to books, and 10% will get sent to this fault injector that we're gonna spin up. Okay. So we're kinda we're taking off a few of the features then with this tutorial,
33:17 which is Yep. Good. So we can probably skip the stats. Right? Or should we just take a look at them? So you're gonna wanna do that command just above. The Books app also comes with a failure in it by default, and so let's patch it and make it so that it's not failing anymore so we can see what the fault injection actually is doing. Right. Okay. So we're faxing it so that we can break it again. Right? Exactly. Nice. So what we should see when we are on the stat here, this has just given us just a little bit of confidence that
33:52 we're gonna get that success rate back up to to a %? Eventually. Eventually. Again, status on Prometheus and so it'll take a while for that all to fall off. Okay. Well, books and web app are going down, but offers went up. So maybe we just need to get that a little bit of time. Yep. Alright. Let's see what we're supposed to do next. So we wanna create a faulty back end. So to do this, we're creating an NGINX config that returns five hundreds and deploying it. Yep. Right. Got it. So let's get this copy. Is very complicated.
34:37 And this is just going to always return a 500 for any request that comes through it. Yeah. The so the nginx config here, I mean, it's just saying always return a 500. Right. And we just we're just that's deployment of services just to deploy that NGINX config. Right? It's mean there's nothing Right. Particularly interesting there. But you could imagine making this NGINX config so that it returns malformed responses or error codes other than 500 or, you know, randomly returns responses. And so the testing is pretty extensive that you could do with this kind of a
35:15 pattern. It composes really well. Ah, okay. This traffic split is the service mesh interface component Mhmm. That allows us to say whenever someone requests the book service, we're going to Mhmm. Gather 90% of the request for the working 110% through the error injector. Yep. I can read you all. Okay. So that that is my spec just means that it should work on any service measure. This is not Linkerd specific at this moment in time. Is is that right? At this moment in time. Is is that right? As long as your service mesh implements this CRD,
35:54 yes. And let's see. Console, Istio, and traffic all implement that spec along with Linkerd. Though sometimes it requires components. Like with Istio, you would need to install a extra component. Oh, and the open service mesh from Microsoft also implements traffic splits. Okay. So I'm just I I think we're being too nice here. So let's make things properly feel. So we'll just bump that up. Okay. So that means let me go back to this instead of guessing all the time. Okay. I'm not what what does this do? I'm not sure I'm understanding that from first sight.
36:47 So along with being able to look at the metrics on a per service basis, we also allow you to look on them look at them on a per route basis. So if you tell us what routes you have, we will collect metrics on those. This is just kind of showing off that you can do that. We haven't told the control plane about what those metrics what those routes are, so we won't see a ton. But this will give us the, two. So what this is saying on some level is the web app deployment when it sends requests to the books service.
37:23 Let's take a look at those metrics. Ah, okay. So the Linkerd CLI is pretty much providing a lot of really cool visibility features here into our applications. This one might be interesting. Why don't we do a stat on deploy web app? Or but that actually shows it right there. We can see that the books service has got a 20 50% success rate, and the web app has a 2025% success rate. So that actually gives us pretty good details right there. So why did the roots command not give us anything there? I'm I would need to go dig into
38:10 it. We're on the edge, and I'm wondering if there's a unique configuration issue that we need to look into there. Okay. I wonder if I just set the error rate too higher if I'm I'm messing with it, but we'll move on. So yeah. Okay. Let me run that stat again. Okay. So yeah. So we can see the 50% give or take error rate then within the booked service. Mhmm. Excellent. Okay. I mean, so that was it. That that fault injection this this traffic split was our fault injection. Mhmm. So the fault injection can be that's not necessarily
38:52 something specifically that Linkerd is providing like, you know, to mock HTTP responses. It's something I can just configure my own use case, my own faults, and then split the traffic across that. There's quite a lot more flexibility there than Linkerd is saying, oh, I'm gonna respond but that that's 400 for x percentage. Right. One of the big themes for us as a project is kind of doing composition, giving you the primitives that you need to go build higher level kind of concepts. I can write a NGINX config to do fault injection that does insane things,
39:28 and it's just some a bunch of functionality we don't really ever want to implement in Linkerd because it complicates and adds complexity and makes it harder for everyone to use. Okay. With regards to this traffic split then, this is also how you would do the canary deploy, or is there something on top of that that would be provided? So that's the primitive that we provide. There's a CNCF project called Flagger that is absolutely fantastic that then goes and does the orchestration of the Canary, rollout because a Canary would wouldn't just split the traffic ten ninety. It would split the traffic ten
40:04 ninety and then watch the metrics until they make sure they're successful and then slowly increase the traffic split until it's a % on the new version. Okay. And I'm assuming the back ends list on this traffic split, right now we're doing two, but there's no reason that it could be more? Mhmm. Okay. Nice. Okay. I can see a lot of different use cases for that kind of feature then. Yep. We use traffic split pretty heavily on our multi cluster stuff as well. Awesome. Hopefully, we have a little bit of time to take a look at that.
40:42 Okay. So we're got this amazing cloud native application. We all got the ability to do a canary deploys and traffic traffic splitting, traffic shaping, whatever we wanna call that. What about the retries and the timers then? So we've got a tutorial for that. There should be a link at the top there if we wanna go and configure the retries. Ah, there we go. So this is See if I can work this out. Ah, so if you see the link there right above retries, check out the retries section of the Books demo. So we've already got the Books app running
40:50 Time outs and retries
41:31 there. If you scroll up, what we're gonna need to do is add the service profile. So scroll down a little bit. Keep on going. It's a section titled service profile. Oh, wait. Sorry. Nope. Scroll up. It's the it's the literally that section right there. Okay. Okay. So there we go. So the service profiles are I kind of hinted that you need to tell us what routes you have, and the reason for that is that cardinality is the cluster killer. The more cardinality you have, the bigger your Prometheus gets. The bigger your Prometheus gets, the slower your cluster gets. And so we need
42:19 you to tell us what routes you've got. We can consume, Swagger or open API specifications and, gRPC proto buffs. We've also got a couple other commands to go and help you build out those service profiles. But why don't we grab the Swagger profile here for the Books app and generate a service profile for that and then apply it to the cluster? Okay. I'm just gonna take a look at what this output. Okay. So a service profile. Okay. It's just a list of the roots then. Yep. That are rejects. Yeah. That's about it. I always suspect to like be revealed by
43:06 some massive complexity thing and then you're like, oh, it's just YAML. Always YAML. Okay. It's YAML all the way down. Yeah. It really is. In fact, the service profile was right there in the docs and I just wasn't looking at it. Okay. Perfect. So now that we have a service profile, do I need to run this as well? Yeah. Let's add profiles for the other two services, the author service and the book service. And then we can verify that with the tap command. So what do we have here? That's showing us the route. So if you see the path there.
43:57 Yes. There we go. Alright. Path. Is that right? Mhmm. Is there an r t underscore route in there somewhere? I'm not seeing it. We might be running into some bleeding edge issues. Oh, no. It's there. I see it. Oh, there. Yep. Okay. Cool. So that's, the name of each of the routes in the service profile shows up there. And so if you wanted to put a fancy name in instead of just post books ID edit, you could do that. Okay. So we've added we've we've told Linkerd that each of our services has a Swagger file. We created
44:50 a service profile based on that Swagger file. Now how do I tell it to do retries? Straight to the retries. So if you see, we've got a edit command down a couple steps there. So if we run that edit there. Okay. And then we're gonna make an edit to the root. So we have so we have to okay. So let me see if I if I understand this correctly then. So when we enable the sidecar injection on our pods, we have to still apply service profiles and have retriable. So it's not something that the sidecar just does by default and it sees
45:33 some sort of http error and the thing like we we have to be very explicit about the way that we want that to function. Exactly. And that is because not all routes you want to retry. Imagine having a credit card charge getting retried a thousand times. That would be pretty bad. There's also, retries with great, with great power becomes great responsibility. Retries really are a great tool, but it's something you need to be very conscious about turning on. And so that's kind of where we are here. You really can only do them for, item potent requests, so anything that's not got
46:12 a body, get head, that kind of thing. And you really only wanna do it when you're % confident that that's the thing that you want to do. So that's why we have you attach it on a per route basis. And now, obviously, because these service profiles are regexes, you could write a dot star regex for your service profile and then retry everything. I wouldn't recommend that, but it's something that you could do. Would it be possible to I'm assuming with the regex, it would be. Right? I could just match an old get request and assume
46:44 that a get request is not gonna have any sort of mutation on the server side, and that's probably okay to be Exactly. Retried. Okay. Again, I would strongly not recommend that, but it's doable. Okay. So let's it keeps opening in my other window. So let's just alright. And we want to just add this as retriable here. So Specifically to that author's line. Yeah. Which author's hedge request. Okay. This one here. Yep. Done. Yep. That was painless. So let's see what this tells us now. So this might be our issue with the edge. I don't understand why we're not getting
47:47 stats here. But if you go back and take a look at the docs, we'll see what the output is supposed to look like at least. And so here, we're showing you the effective success and the actual success, which is pretty cool. This is the difference. The retries gets your effective success up from the actual because we're retrying, but it has the downside of increasing latency. Now is this edge bug that we're we're running into something that's only do we think that's likely only to be affecting this command? Like, if I put forward to the dashboard or to the
48:21 confana, would we be able to get a more visual look at that? This would be the right command to see this data. So and it's all getting pulled from Prometheus, so I don't think we'll have a more visual way to take a look at that. Alright. Okay. No worries. So, okay. I think we understand retries there. I'm gonna assume timeouts are pretty much in the exact same vein. We find the path that we want to apply our realtor. Yeah. And we we add on the timeout. So Yep. We won't we won't run through that. Let's see what else we can quickly do
48:53 then. So is there anything that you're particularly drawn to that you'd like to move on to Nick? That's a good question. Why don't just for fun, why don't we go and do the it is let me make sure I can find docs here. Let's go to the securing your service doc. Alright. Let's go. It's under feet it's under tasks. So if you click the task link there, down towards the bottom, there is a securing your service. Okay. And so sorry, Anigo. Oh, yeah. So, we do MTLS, by default out of the box with Linkerd. We actually make it really hard for you
49:20 mTLS
49:54 to turn off. Encryption is great, but it's a little hard to validate. And so this doc will kind of walk us through using, not just or pretty much all the Linkerd tools and a couple lower level ones to go and validate that MTLS is really working between all of the nodes. So MTLS is already running on the pods that I've enabled the sidecar on? Yeah. Ah, okay. MTLS, like I said, is enabled out of the box. In fact, tap, the command we've been talking about that, Wireshark for your cluster, requires, m t l s because we need the
50:32 identity to validate that some rando isn't making calls on your cluster to get all of the traffic going through it. Okay. So that that just means that all of the requests that leave my pods are being encrypted for me and I've not had to do it at then. Exactly. I mean, I'm sold already on that feature. I I I thought we were gonna have to go enable something but the fact that that's just no because I just installed Linkerd and injected the second, that's pretty awesome. So you said it's hard for us to confirm that that works but not impossible.
51:03 Right? I mean, how how do I know that that I just trust it? Well, that was my point. Let's go and take the steps. I don't trust it. Alright. Let's see what this command is gonna return. Alright. So that's okay. All of those are secured. We we have our favorite little green tech here. So, yes, I see the output. Awesome. And we can run the tap again. And it's suggesting is there anything at the top that would suggest it's encrypted? Or is Yeah. So you see the TLS equals there? So there's TLS not provided by remote. That's
51:50 not encrypted. But if there's TLS equals true, then it is encrypted. Ah. And so you'll see some commands there that aren't encrypted. Those are ones we can't. If it's coming from the Kubelet, that's not injected with the Linkerd proxy. We don't control both sides of the connection. But anything that's a request between two services, we definitely can. That's where you see the TLS true. Okay. That makes sense. So we can validate MTLS with t sharp. And this is the like, don't just take our word for it with all of the tools that Linkerd has put
52:25 together. This actually adds a debugging container to your sidecar that adds, or it adds a debugging container to your pod that you can then go and run debugging tools on. We use it a lot to go and debug the service mesh itself, but this is a great way to see the, like, raw bytes as they go through and make sure that they are actually encrypted. Awesome. Okay. Let's just reapply that. So it's okay. So it's injected that. Okay. So it's updated the sidecar with the debug option then on each of those deployment. Specifically, it's added an annotation
53:05 that will go and add another pod or another container to the pod, which is that debug container. Alright. Let's spend the deployment. Yeah. Okay. So it just enabled debug sig capture. Good. Because that means it's quite easy for me to add that then without the actual Linkerd inject aspect. Exactly. So then we can get a remote shell by jumping and say one jumping inside of that debug container and our emoji portal voting service. Mhmm. And t shark. And T Shark is just a command line Wireshark. We're gonna be looking at the port eighty eighty and
53:52 taking a look at SSL and then looking at what's going on on local host. And if a not local host. Right? Right. Okay. So it's actually it's looking specifically for requests which are leaving the service to another service, which is where the MTLS boundaries are, I guess. So this is actually coming in, but it's coming in on the proxy, e g, the exterior IP address, not the interior IP address. Because the way that service meshes work is that we have some IP tables games that go and redirect all of the incoming traffic into the proxy that then forwards it to
54:27 the local host. And so if we were to look at local hosts, we would actually see everything in clear text. But as you can see here, we've got nothing but TLS data going through except for that ready, which is, again, the Kubelet calling in and validating that our service is up. Yeah. The the SYNACK packets. Right? These Mhmm. Yeah. Okay. Okay. That's pretty sweet. I like that. Nice. And that that's it. So we've covered retries, timers with Xenbin, MTLS. I mean and that was safe to say that was all pretty trivial. Mhmm. That does not really require too much
55:14 effort on my part which is good otherwise it would all have been wrong. So I'm really happy with that. That's great. Now shall we is the on a home page, we have this big deal about the multi cluster Kubernetes thing. Mhmm. Is that something we're gonna be to look at with the one nineteen, or do you think we may run into some issues with that? We can definitely give it a try. We changed the way that multicluster works, so we're actually going to need to go and take a look at the pull requests on the docs website if we
55:30 Multi-cluster
55:52 want to get that one up and up and running. But we can definitely give it a try and see what happens there. Sure. Why not? Let's see what happens. So if you go to let me find those docs too. So I'm assuming I'll need to go through the standard install Linkerd command on my other cluster. Right? Oh, so, actually, what you're gonna wanna do is uninstall Linkerd from your current cluster and then do a reinstall. And the reason for that is that to get multicluster up and running, we need to have a common trust anchor.
56:37 And the the reason for that is that the trust anchor, is how we do authentication so that users can't randos can't be requesting services inside your cluster. And so, actually, what you're gonna wanna do is Just change this to the gonna send you a link because it's probably going to be easiest that way. Give me just a second here. Yeah. Of course. That is gonna be the docs for that. Okay. So should I first do the Linkerd install, but delete path to get rid of all this? So I would do an uninstall. Oh, that's easy.
57:32 Uninstall and then pipe that into kubectl delete. And since we've got those two other namespaces, you probably want to clean up the emoji photo and book sat namespaces. The reason why we need to do an uninstall here is that we install a couple API services, which are Kubernetes resources. And if you delete a namespace and have an API service running, Kubernetes loses its little mind. And so, you definitely wanna do that, but you can just delete the namespaces for Books app and Emoji photo. Was thinking about completing it. Maybe I should just detect it. Yeah. Okay.
58:23 Delete it. Let's fill up these docs. Well, eventually delete it. I have confidence. So we're going way off script then. Way off script. Oh, in fact, so way way off strip script that the command that I thought was in there to cut and paste stuff is not there. Let me give you that command real quick. So is that a command I have to run before the multicluster install, or can I go ahead and just start it? Well, so the multicluster install is only gonna install the multicluster components. What you're gonna wanna do is click on the if you scroll up on
59:16 that dock a little bit, you'll see that you require two clusters and a control plane installation that shares a common trust anchor. So click on the, trust anchor link there, and we're going to create a trust anchor certificate. I would recommend doing a brew install step here. But if you are a crazy OpenSSL Kung Fu artist, I won't stop you. No. Brew install step dash CLI. I can always quickly look that up. It's just step, I think. Cool. And I'm assuming established as some sort of utility for generating x five zero nine certificate or something similar.
1:00:17 Yeah. It's actually a great CLI that does a lot more than that, but we use it just for generating the certificates. Okay. So we So that created a trust anchor. And then you're gonna wanna go and do the step certificate and create a issuer certificate. So there is a, trust anchor that goes and installs that kind of maintains is a CA and maintains the trust across your two clusters. And then, theoretically, though, we don't need to do it here. You would have an issuer cert that you rotate on a regular basis that's actually what the proxies get their certificates from.
1:01:02 Okay. That's this is making sense to me now. The fact that we're gonna do multi cluster means so we we need a CA that is trusted on both for the communication to be effectively encrypted across that boundary. Can I do I mean, do I have to use something here? Could this work into tools like Vault to provide that top level CA? Well, so this is an unfortunate colocation of terms. There is a type of certificate that is a CA certificate, and then there is a type of service that is an online CA. So you can use Vault to go and
1:01:38 get the certificates if you wanted to, but this is just a way to add them yourself. While the trust anchor needs to be generated, we also allow you to do automatic cert rotation with cert manager if you so wanted to do that. Oh, sweet. Okay. So is my next step just to keep following through with this? Yep. Alright. And so that'll install it on one cluster, and so you'll wanna then just select the context that you want for the other cluster. All the things. All the things all over again. Alright. So kube config equals Linkerd
1:02:21 two dot JSON. Let's see if that works. We'll see if we got created or configured. Alright. Let's just do this. And then reapply that command. Okay. So we have Linkerd configured with our trust anchor deployed to cluster one and cluster two. I'm assuming we can skip this helm and jump That was just a alternative for him. Actually, just go back, and we should be able to continue the docs as they are laid out. Okay. So now we're doing the multicluster installed to each cluster. Mhmm. Alright. So let's just do cluster two first. I should just split this, shouldn't I? That'd
1:03:11 a bit easier. Linkerd two, then do d one, and then do Linkerd one. Alright. So we have a check command. I love these check commands. And we got all the green ticks. Good. So we've not linked the clusters yet. So See if I can work with that. So what's the context and and this context? So you probably won't need to worry about the context here. This is kind of from a blog post that we've got that talks about two different clusters, the East cluster and the West cluster, and kind of walks you through things. You'll wanna delete the context
1:04:07 from this one, but the idea here is is that you're gonna run the Linkerd CLI on your Linkerd one cluster. That's going to the multicluster link is going from the excuse me. Let me come at this from a different angle. When you run Linkerd multi cluster link on Linkerd cluster one, going to output all of the, required configuration for Linkerd cluster two to talk to Linkerd cluster one. And so what you're gonna wanna do is run the Linkerd multi cluster link on cluster one, get the YAML, and then apply that YAML to cluster two. Got it. Okay. So I can modify this
1:04:47 in line. So let me pull this along. So instead of context east here, which is just a Kubernetes config, I can leave that blank which is just in cluster one and we give this a name. So can I just call this Linkerd one? Yep. And then here, we're actually just gonna point it to the other cube config, I guess. Yep. Linkerd two dot JSON. And that's gonna link cluster one to cluster two. Right. What did I get wrong? Let's it says no objects passed to apply. There might have been an error there. K. What's the
1:05:44 No ingress. So the problem there is that we need to have a load balancer with an IP address public IP address so these two clusters can talk to each other. So this is saying that we don't have a gateway address there yet. So why don't we run kubectl on the Linkerd dash multi cluster namespace and look at the services, figure out what's going on there, dash multi cluster? So we're still waiting for you to get a external IP address. Which will never work. Uh-oh. Yeah. So this is a bare metal cluster that needs a very special
1:06:38 type of load balancer either QVIP or metal l b. Let me quickly grab that and see if I can get that working in under a minute. And if not, we will abort. Cool. Okay. So let's see. So the while you're doing that, let me kind of explain what's going on here. We've actually got a blog post that talks a lot about it. In, our experience, when you start talking about multiple clusters of Kubernetes, it is, very unlikely that they can route to each other. Imagine having a cluster in two different cloud providers, for example. Having a flat routable network is basically impossible.
1:07:25 And, I mean, you can do it. It just requires a whole bunch of effort. And so what we do is rely on services with load balancer IPs that are public and routable, and then we can just, pass traffic through those those gateways into your cluster. And so you don't need to think about any low level networking here. It just gets an external IP address in most cloud providers, and you're good to go. So I did install QVIP. But to be honest, I am not sure if I'm gonna need to be able to set that up with anything.
1:08:03 Let's see if I can see anything immediately in the logs and if not. But data like leader. And we should have a CCM. Cloud controller running. I think we may we we'll give it we'll give it thirty seconds. May get an IP address. In fact, I could probably take a look here and see if we have thing. Yeah. Potentially. So normally there would be one elastic guy. Oh no. Because I have two clusters. Okay. They're maybe not gonna get one then. Drats. Bummer. Failed. Oh, well. It doesn't matter. Like, that would have been cool to show off, but I didn't think
1:09:00 Closing
1:09:04 about the ingress aspect of it. So next time, I'll definitely get that working. Cool. Yeah. I'm I'm sorry. I didn't even think about giving you heads up on that one. Well, I guess you wouldn't have to. Right? Normally, not a problem. Cloud providers have this little add on that that works with the load balancer service type. Unfortunately, bare metal just as a little bit different and does require a little bit of upfront work. I could have installed metal l b, cube dip. I think with cube up actually, I need to tell it the elastic IP that it has
1:09:32 control of so they can pass that And then I just don't have that config and I keep randomly heading that going, maybe it'll just magically come into fruition. But it's not gonna happen. Anyway, I think that's great. Is there anything else quickly you wanna go over before we wrap up with that you think would be cool to show off? No. I think that's pretty good. We've kind of covered most of the feature set, to be honest. Yeah. I mean, there was a lot going on there. And again, it was really trivial, really easy to play with. I I just
1:10:01 like how intuitive the the CLI was itself and how easy it was to work through the documentation. Like it was just well, it was a pleasure to work with actually. Good. Thank you. Well, I wanted to say thank you very much for joining me, taking the time out of your day to come and show us Linkerd. I'm looking forward to exploring and playing with it more. It's a really cool tool and service mesh like we, you know, we we touched a little better on this just why would I want this thing in my cluster, you know, with the complexities that we're
1:10:30 pushing down from the software layer And I think Linkerd plays a big part of that. Thank you for joining me. Have a great day and I will hopefully speak to you again soon. Awesome. Alright. Bye.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments