Microservice Troubleshooting, Built for Developers

Watch / Tutorial On demand

The embedded player needs JavaScript.

Open the video stream (HLS) Download captions (VTT)

Expand player Shrink player

Overview

About this video

What You'll Learn

Deploy Lumigo with Helm to auto-instrument Kubernetes services without changing application code.
Use the system map, transactions, and timeline traces to trace failing checkout requests.
Filter issues by service or endpoint, then assign, mute, or open Jira tickets.

Deploy the Lumigo operator to Kubernetes with Helm, auto-instrument the OpenTelemetry demo app, then walk the system map, transactions, timeline traces, and issues view to debug a failing checkout.

Chapters

Jump to a chapter

Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

0:01 Introduction to Microservice Troubleshooting

0:01 Welcome back to the Rawkode Academy. I'm your host, David Flanagan, also known across the Internet as Rawkode. Today, we're gonna take a look at how to make troubleshooting your microservice life all a little bit easier. And let's face it, this is something we all need. Microservices are difficult. Whether you're adopting a monorepository, building out completely new CICD pipelines, or trying to observe, understand, and react to problems within your production infrastructure across your new plethora of microservices. You're constantly facing challenge after challenge after challenge. So it's nice when a tool can come along and make some of these challenges disappear.

0:48 And Lumigo is the perfect tool to come in and make distributed tracing and observability almost child's play. Let's take a look. So on a surface level, Lumigo is a distributed tracing tool. But when you go deeper, it surfaces so much more. But let's start at the top. Do we need another distributed tracing tool? Well, yes. Lumigo works and supports OpenTelemetry. It also supports zero touch automatic instrumentation of your services. That means you can deploy Lumigo to your cluster like I am about to do now, and you will get distributed tracing without changing a single line of code.

1:14 Lumigo Features & Automatic Instrumentation

1:42 As we explore the Lumigo dashboard, you'll start to see and understand that it is much more than Automagic instrumentation. It gives you the ability to browse and understand these distributed traces in a way that reveals much more information than I've seen from other tools, allowing you to introspect the entire request object, view service maps, and a whole bunch more. I said enough. Let's deploy Lumigo to our cluster. Alright. Let's log in to Lumigo. I've already signed up for an account, but it just takes a second. You can set up with a username and password, or like

2:10 Installation

2:21 I used, my Google authentication. I've already explored Lumigo in preparation for this video. So what I'm gonna do is create a new project. And I'll call this one the actual demo because I have many other similar names. Once this project is created, we get instructions to deploy Lumigo to our infrastructure. Lumigo doesn't just support Kubernetes, although that is gonna be the focus of my video today. It also works with ECS and AWS Lambda. We're gonna click on Kubernetes and use the Helm operator, or at least we're gonna use Helm to deploy the Lumigo operator to get the automatic instrumentation.

3:11 From here, we can start to copy repository add commands followed by helm install, which just wants our cluster name, which can be anything that we wish to call it. And then we need the ability to create a secret inside of a namespace so that the operator can consume it and write to our Lumigo instance. So let's start with step one, deploying the operator with Helm. Let's jump over to our terminal and I'll pop open Versus Code to give us an empty buffer. Instead of changing these commands together, we're just going to split them up. And I like to format them in a

3:52 way that they make sense when I look back at them later. So step one, let's add the repository, like so. I have already done this, so it's already there. Next, we're going to do a helm install. And we'll select the default namespace of Lumigo system, which is a very weird sentence now that I think about it. If this namespace doesn't exist, we're going to create it. I'm going to call this cluster Docker desktop because I'm using Docker for Mac for this Kubernetes cluster. We can copy the installation command and paste to our terminal. Easy. Next,

4:41 we want to create a secret with our token. So now, to create a secret, we need to specify the namespace and this is going to be the namespace where your application is running. So I'm just going to call this app and we'll call the secret Lumigo credentials like suggested from a literal value. And this is my real Lumigo token. Hopefully, it won't exist by the time you see this video. We can jump back to the terminal where we can do kubectl create namespace app, followed by our secret creation. If we run kubectl get secrets at our app namespace,

5:29 we'll see it now exists. Lastly, we copy the code to configure our namespace for Lumigo. Simple. Okay. Let's deploy our sample application. We're using OpenTelemetry demo application, which comes with a single manifest for deploying to a Kubernetes cluster. I have updated said manifest to deploy to the app namespace like we configured when we deployed Lumigo to the cluster. We can now run kubectl dash n app get pods. Now, this may take a couple of moments. So by the power of magic, let's come back. God love magic. Now, all of our demo services are running on our

5:42 Deploying the Demo Application

6:21 Kubernetes cluster and a namespace configured with Lumigo. So I think now would be a good time to jump back to our Lumigo dashboard. So let's validate our installation. Awesome. Alright. So let's give Lumigo a little bit of data to work with. Let's run a port forward to the front end proxy, which runs or makes our application available on local host port 8,080. From here, we can see the OpenTelemetry demo application. Let's go shopping for some binoculars. And a lens cleaning kit. Now, we place our order like so. Let's jump back to Lumigo and see what

6:31 Validating Installation & Generating Traffic

7:26 we can see. And that tells us our application is sending traces to Lumigo. We're currently on the LiveTail page and we can see these requests streaming in in real time. I like to start off by taking a look at the system map. Here, we have a visual representation of all of the services within our cluster or at least the ones that have ambient traffic running through or we head through clicking around on the shopping page. We can see the payment service, the ad service, the cart service, currency service, front end, feature flags, and so forth.

7:42 Exploring the System Map

8:07 We can see that it speaks to a Redis issue is in port 443, and this I believe is our device. So let's go to the dashboard. And now we can see because this application has only been running for a few minutes, we got a couple of failures, some requests, and some information on the duration of these functions or these services. We can actually see the p 99 and p 95 of these services based on the limited information that we have thus far. So now let's pop over to transactions. From here, we can begin to understand the

8:42 Analyzing Transactions & Trace Details

8:46 requests within our application. We can see on this one request, we started off at the front end. It made a request to the cart service which spoke to Redis and that the cart service is written in a dot net language. Our front end is in JavaScript and well, Redis is Redis. We can click on our front end and we can now start to get more information about the service. We can see the route that it came in on here. It's a when we requested the cart and it tells us the currency with GBP. We see all the environment variables configured on

9:30 our pod. And if we scroll down, we can start to see the span information. It gives us the start and end time as well as all the attributes that are important. We get to see the information on the URL, which we already kind of knew, but now we can see the host name that was used. We see the protocol. We can see the browser. All the stuff you would expect to see from a trace within a microservice architecture. Now this view is great for debugging because we can see the services that were involved. We can see information about the pod

10:07 and the span. And then we can click on these requests themselves, where we can see here that a push request was made to the cart service on this path, get cart. If we click on Redis, we can actually see the request or the query that was sent to Redis two. The statement was a GET request for a UUID. If we pop over to this side, we can see that this was actually a request to add item. And if we pop down to these redis, we'll see that we actually have a set of a UUID. So this transaction from a single request ID

10:53 propagated through the system, it's actually what happened when we added something to the cart that also fetched an update and returned it to our client. Now this is great. We can understand a whole bunch of information about the system, but sometimes it is nice to have the classic distributed tracing view, the timeline. So now we can take a look at this transaction as it happened in a linear progression. We can see here that the front end is responsible for pretty much everything. We already knew that. We have our cart service, which speaks to Redis. We can see that the first request to

11:11 Timeline Trace View

11:36 Redis was the add item call, which made actual multiple calls to Redis. And now that we can see the order these requests happened in, we can click on it. Alright. There was a get, followed by a set, followed by an expire. Nice. Once the added item to the cart, the cart service then made one more request to Redis. This time, the GET request. Now we get to understand how over time the request propagated through our system. So let's take a look at something else. Down here, we have some request that took almost five seconds. So let's take a look at it.

12:16 Identifying and Analyzing Slow Requests

12:28 This request goes through a whole bunch more services. We can see here and here and here. We've got a whole bunch of gRPC and a whole bunch of Kubernetes services. Let's take a look at the timeline. We can see that this is as the front end speaking to a recommendation service. When we viewed a product, and wanted to show us more that were just like it. This requires a whole bunch of gRPC calls. The first one to get a list of recommendations followed by subsequent request to get the product information. Eventually, this is all returned and it even

13:10 requires a lookup to the feature flag service. We can click on the feature flag service, and we can see it was trying to understand if a feature flag is enabled. The product catalog failure. In this case, it was false, so we didn't get an error. And right away, we're seeing the core value of a tool like Lumigo. You don't need to understand an application as a whole at a macro level when it is built of lots of smaller microservices. We could focus on just the debugging and observability, and we are going to dive into looking at how that works with Lumigo.

13:55 But as an ability to onboard new people to your team, instrumentation is key. Much like testing is invaluable for learning how code should work, Good observability tools can reveal the same on an architecture level and not just an architecture level. Here, we can see that we have feature flags. We have Redis. We're using gRPC. We have all of these services working together and we have a good understanding of how that happens. So thank you, Lumigo. Now, let's go back to debugging, one of my favorite topics. So I've jumped over to the Live Tail page so we can see things in real

14:32 Debugging Application Failures

14:39 time. Let's jump back to our demo application, go shopping and add something to our cart. Right now, we have three items. It's not really that important. This application has a cool edge case where whenever you change the credit card number, it will cause an error. So we'll hit place order and jump back to our live tail. Now we can see all of the requests that were happening behind the scenes as we added to the car and browsed around the website. And now we can see some checkout failures. So let's pause the live tail. And I click checkout a bunch of times

15:22 so we can see them coming on over and over. But we're here, we can see that we tried to checkout checkout check out. These failures are up because I changed that number. Well, let's take a look at one. And now we have issues. These can also be discovered via the issues page. And even on the dashboard, we can see an occurrence of the issues here. So, we go back to issues. And now we can see here we have failed to charge card. Let's click on that and take a look. Now, we get an overview of what went

15:43 Exploring the Issues Page

16:06 wrong during this transaction. We can see the front end, try to place an order over gRPC with the checkout service. And we can see the actual error message or the exception thrown by the application. We can see here that it failed to charge the card and that's because the card number didn't work. And just to show you that this definitely does work, let's change it back to the correct number and it works. So we get more information as things happen even when they are exceptions. With this integration of logs right next to the trace. Now there are some really nice value add

16:44 Issue Management & Collaboration

16:50 features here too. One, I can assign this to somebody. I can create a Jira ticket if I have it integrated. And if we know about this issue, maybe we wanna mute it for a few days while we work on a fix. Now there is a ton more features we could take a look at in Lumigo, but we wanna keep this video short and sweet. But before we go, let's click on Kubernetes. Here, we get a good visual representation of our Kubernetes cluster. We can see all of the workloads deployed, the cluster name, the kind, namespace,

17:19 Exploring Kubernetes Workloads

17:30 when they were modified, any issues that are relevant or happened in the last span of one hour because that is what we have the window set to. If we click on one of the workloads, we can see the Kubernetes events from deleting pods, scaling replica sets, and even creating the deployments and so forth, as well as any application issues at the bottom. We have a very nice visual representation of that timeline too. Let's go back and click on our front end service. As well as the visual fling, the less we have the application issues at the bottom

18:10 too. So we can understand how our services are behaving on a per service basis also. The last thing I want to show you before we wrap this up is explore. One of my favorite feature, bot would require much more time to go into in-depth. Here we can filter by service. If we just want to see Redis issues, we can take a look at Redis. Here we can see all the requests propagated through the system. If we want to take a look at Kubernetes clusters or Lambda functions or any of the other supported resources, these are filterable here

18:18 Using Explore for Data Filtering

18:46 too. Now, let's say that we just want to take a look at our app namespace. Well, we can do that too. Now we have all of our services and we can drill down even further. What if we want to take a look at one single endpoint? Well, let's take a look at that checkout endpoint. Here we can see the request and the results. Quite a lot of failures. Whoops. What if we just wanna filter by long requests? What about anything that takes more than four seconds? Well, we can do that too. And you can see that we are actually

19:23 joining these filters together with and statements. We could also use not, The ability to query and understand the traces across your system is up to you. Lumigo is such a cool tool, and here is the good news. Lumigo has a free tier. They don't charge you per seat or user. You can have unlimited users on the free plan, but you will be restricted to a 50,000 traces per month. And even if you do need more traces, up to a million is only a hundred bucks a month. Awesome. I hope you enjoyed this quick demo of

20:04 Conclusion & Call to Action

20:06 Lumigo. Go check out their website at lumigo.io. And don't forget, tell them the Rawkode sent you. Thank you for your time. Have an awesome day.

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Documentation

Lumigo Documentation

Demos

OpenTelemetry Demo Application

More about Kubernetes

View all 172 videos

Hands-on Introduction to Yoke

Hands-on Introduction to Yoke

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Kubernetes Security Scanning: The 4 Tools You Actually Need

Kubernetes Security Scanning: The 4 Tools You Actually Need

More about Helm

View all 49 videos

Hands-on Introduction to Yoke

Hands-on Introduction to Yoke

Platform Engineering: Asking "Why"? with Evelyn Osman

Platform Engineering: Asking "Why"? with Evelyn Osman

Build a Production-Ready Kubernetes Cluster with Spectro Cloud Palette | No-Code Tutorial

Build a Production-Ready Kubernetes Cluster with Spectro Cloud Palette | No-Code Tutorial

More about OpenTelemetry

View all 4 videos

Observability for Developers: What You Need to Know?

Observability for Developers: What You Need to Know?

Cloud Server-Side WebAssembly

Cloud Server-Side WebAssembly

Hands-on Introduction to Quickwit

Hands-on Introduction to Quickwit