About this video
What You'll Learn
- How Ambient Mesh removes sidecars while keeping Istio's control plane
- How ztunnel and HBONE provide layer 4 mTLS between nodes
- How waypoint proxies enforce layer 7 policy, tracing, and Gateway API traffic rules
Marino Wijay and Matt Turner unpack Istio Ambient Mesh, walking through ztunnel, HBONE, and waypoint proxies, the trade-offs against sidecars, and how eBPF, CNI plugins, and the Gateway API fit into Istio's post-graduation future.
Jump to a chapter
- 0:00 Introductions
- 0:16 Welcome & Guest Introductions
- 1:43 What is Istio Ambient Mesh?
- 1:50 What Ambient Mesh?
- 2:01 Ambient Mesh: A New Operational Mode
- 3:31 Addressing Sidecar Challenges (Race Conditions, Privileges)
- 4:15 Why Ambient Mesh?
- 5:14 The Role of Istio CNI
- 6:43 Ambient Mesh Components: ztunnel & Waypoint Proxy
- 7:08 ztunnel: Layer 4 Policy, MTLS, and Hbone Tunnels
- 8:17 Waypoint Proxy: Handling Layer 7 Policy
- 9:09 Debating Sidecar vs. Ambient Implementation Details
- 12:07 Running Sidecar and Ambient Modes Together
- 12:28 Ambient Mesh: Avoiding Rollout Disruption
- 14:18 Challenging Ambient's Deployment & Resource Claims
- 16:29 Recap and Waypoint Proxy Functionality
- 18:20 Waypoint Proxy
- 18:30 Waypoint Proxy and Layer 7 Enforcement
- 21:40 Security Trade-offs and Shared Components
- 24:29 Clarifying Networking Layers (OSI Model)
- 24:55 Identity, Tenancy, and Node Compromise in Ambient Mesh
- 25:00 Trade Offs
- 27:23 Node-Local Traffic and Threat Modeling
- 31:48 Historical Parallels: Service Mesh vs Traditional Networking
- 34:20 Why Not eBPF?
- 34:32 eBPF, Cilium, and Istio's Approach
- 38:30 Istio Graduation and Ecosystem Impact
- 39:50 Istio Graduation!
- 41:01 More Choice, Real-World Testing, and Standards
- 42:05 Evolution of Cloud Native Networking
- 44:00 Guest Plugs
- 47:33 Warning about Business Logic in the Network
- 48:51 Conclusion & Outro
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
0:00 Introductions
0:00 Hi. Just a quick message before the next episode. My audio is unfortunately terrible. However, my cohost Laura and our wonderful guests' audio is perfect. So listen to them. Don't listen to me. Enjoy the episode. Welcome to Cloud Native Compass, a podcast to help you navigate the vast landscape of the cloud native ecosystem. We're your hosts. I'm David Flanagan, a technology magpie that can't stop playing with new shiny things. I'm Laura Santa Maria, a forever learner who is constantly breaking production. Nervous about cloud native networking? Intrigued by Istio. Today, we're chatting with Matt Turner and Maureen
0:16 Welcome & Guest Introductions
0:39 Obiget, both contributors to the Istio project and advocates for the new ambient mesh architecture provided by Istio. With the graduation of the Istio project and the alpha release of the ambient mesh architecture, We have a lot of questions to ask that Matt and Marino can't wait to answer. Alright. Hello, everyone. Welcome to this, to this little session here on Cloud Native Compass. My name is Marino. Am a platform slash developer advocate at Solo, and I focus in on network technologies spanning from the lower levels all the way up to service mesh. And I'm I'm here
1:13 I'm very excited to talk about ambient mesh with y'all. So excited. So excited. Tell. Where's the excitement in Marino? Like, come on. Let's go. So I'm Matt Turner. I'm a software engineer at Tetrate. I focus on fairly similar stuff to Marino, I guess, looking at service mesh. I've been into Kubernetes containers, cloud native stuff for for quite a long time now. Yeah. Also super excited to to talk about AmiMesh. It's a really interesting development. Yeah. Lots to lots to cover. Alright. Well, I mean, we'll let's just kick it off. You both said AmiMesh now. So
1:43 What is Istio Ambient Mesh?
1:46 I have to set it for machines we can actually properly start. But for anyone who is not aware of what is happening in the service mesh space, particularly around Istio, can we give them the thirty second of an overview of what is ambient mesh? Sure. Yeah. So ambient mesh is a new mode of operation in the world of Istio, and it was it was developed because there were a lot of different interesting patterns that we were seeing with workloads that just never really saw a benefit to using a sidecar. A service mesh normally would deploy a sidecar
2:01 Ambient Mesh: A New Operational Mode
2:20 alongside your main application container, but we realized that there are certain patterns that just don't fit with the sidecar model. So enter ambient mesh where we're effectively removing the sidecar but still presenting the service mesh like capabilities but from a different lens and in a different architecture. At the end of the day, ambient mesh is still service mesh. It just creates operational simplicity and allows you to cut back on resources while still retaining all the benefits of what a service mesh offers. I guess that's probably what we'll get into. I mean yeah. So, absolutely, ambient mesh is is a new a new
2:54 mode. So a new, you know, install option deployment topology for for Istio. It's still still still Istio. Yeah. That's the service mesh that's been done in. There's no reason this couldn't be applied to to other meshes, but this is an Istio project we're talking about. I probably have some, you know, some comments on on pretty much all of the sort of, you know, features that were listed there. Like, we can we can debate those. But, yeah, the, like, the the 50,000 foot view is that it's a new deployment mode, a new architecture for for Istio that, like
3:23 Marino says, removes sidecars and replaces them with a different way of doing things on the on the data plane. It's still the same Istio control plane. Okay. So I'm being managed to move sidecar. I would hope that this brings substantial maybe not directly performance, but at least what computer memory optimizations across the cluster because, you know, people are running this throw out the wonderful 10 number. Right? A 10 pods and everyone else have nodes, and they've got 10 nodes. We're talking about oh my god. I have to do math. It's only multiples 10. 12 hundred
3:31 Addressing Sidecar Challenges (Race Conditions, Privileges)
3:58 12 hundred pods, which could be potentially 2,400 containers after and sidecar mode. Right? So while I'm sure these sidecars have historically been optimized to be very lightweight, they're still containers at the end of the day, which means they still do have a memory footprint, a compute footprint, and so forth. What are I mean, is that directly the benefit of ambient mesh, is there more to it than that? There's much more to it. So let's let's think of the classic example of the race condition. So you have certain applications that need to respond almost immediately, but when you inject a sidecar into that
4:15 Why Ambient Mesh?
4:35 model, one of the the challenges that you have is that now you have this race condition of the sidecar versus the application container that have to come online. Who comes online first? There might be situations where you have to set a, what they call, a hold application timer in the configuration to prevent your main application container coming online before your sidecar. In that situation, what ends up happening is the Istio in that container cannot redirect IP tables so that traffic can go through the sidecar first before it gets to the application container. Actually, that doesn't even happen. Your application container
5:09 is directly interfacing with other services, and it's not even part of the mesh at that point. So one of the approaches that ambient mesh takes is you could still inject services into a mesh without having to deploy this sidecar. And what ends up happening is there are resources that sit outside of the application itself or the pod itself that set up the redirection of traffic. Now there are some components that are involved that make this possible. For example, are you all familiar with the Istio CNI? The Istio CNI is is more of a it's not a CNI replacement, just to be
5:14 The Role of Istio CNI
5:43 clear. It's not a container networking interface plug in. It's actually an add on that you associate with Istio because there are certain situations where the security teams do not want you to hop into your your pod manifest and specify privileges for those application containers from a security standpoint. So you would use the Istio CNI to do a lot of this, give give it access to be able to modify those changes at the the host level, IP tables rewrites, to allow for pods or applications to communicate in ambient mesh. Even outside of ambient mesh, when we're talking
6:18 about having sidecars present with that particular security constraint, the Istio CNI will help with the redirection as well without having to modify any sort of pod manifests. Now when you think about that for a second, when you actually think about what that truly means, it just means that we're just redirecting traffic no differently than we would do inside of the actual pod itself if a sidecar was there because that's the operation that goes on anyways. Now here's the other consideration, though. So you don't have a sidecar anymore in ambient mesh. Who's actually doing the mesh stuff? Where do
6:43 Ambient Mesh Components: ztunnel & Waypoint Proxy
6:52 things like MTLS show up? How do we gain our observability? What happens to things like authorization policies? Now that's what's been addressed in ambient mesh through some some artifacts, two new artifacts. One is called the z tunnel, and the other one is called the waypoint proxy. Now the z tunnel is a node level proxy, you could say, which actually used to be Envoy a long time ago, and they've rewritten it, in Rust just to focus in on things like layer four policy and connectivity, providing the MTLS, and even providing authentication to other services. But what really interestingly happens here is that
7:08 ztunnel: Layer 4 Policy, MTLS, and Hbone Tunnels
7:28 when you have traffic that has to go between nodes and you have, you know, service to service communication between those nodes, your traffic actually traverses these z tunnel pods that exist at the node level. And these z tunnel pods, again, because they exist across all the nodes as a daemon set, they are forming tunnels to each other. Now are any of you familiar with tunneling technologies like IPSec or or VXLAN or okay. So it's a very, very okay. So these these tunneling technologies basically create a virtual private laneway or network, VPN, basically, to allow for a dedicated lane for your your traffic
8:05 to move about without folks being able to inspect and sniff that traffic and know what's going on behind the scenes. They won't be able to understand what payloads are being sent because all of that is jumbled and encrypted anyways. Now we wanna maintain the encryption because we wanna maintain that security posture altogether. But the creative way that we approach this or that the open source community approach this is by creating a brand new tunneling protocol called Hbone or the HTTP based overlay network encapsulation protocol, which is another fancy way of saying, hey. Let's just deploy VXLAN slash Geneva,
8:17 Waypoint Proxy: Handling Layer 7 Policy
8:39 four z tunnels to form the tunnels to each other and create that laneway, which is all what's going on. But here's the here's the kicker here. So all of this is layer four. When you start to realize that, hey. I need to inject some layer seven policy. I don't wanna run get operations against the service. I need to provide some sort of mechanism to prevent that. This is where I deploy a waypoint proxy, which is still Envoy based. Now I'll stop there because there's a lot more technical details around it, and, you know, maybe Matt might wanna expand on that a
9:07 little bit more. Yeah. I mean, I'd probably like to get right back to the the beginning. Right? So, I mean, the the race condition that was mentioned is I I mean, say that's an implementation detail. It's it's definitely real. It's definitely a concern. There is a at the moment, yeah, there's the the solution to that is to hold the application back. There's the Monzo open source, like a a wrapper that does that, like an idiot wrapper that does that. You know, there's there's various sort of ways to fix it at the moment, but there's a, you know, there's a KEP out to
9:09 Debating Sidecar vs. Ambient Implementation Details
9:35 to fix this in Kubernetes. So this is this is kinda well known. It's understood. A network sidecar is not the only kind of sidecar container that has this issue. So there's a there's a in progress to identify, like, the main application container and the sidecars and ensure that ordering. Right? So to me, that's an implementation detail. Like, it's it's being fixed. I'm not sure the thing about the init containers is actually true. Like, they are guaranteed to run before any of the sidecar or any of the sort of main runtime containers run, even the sidecar. So it's the init
10:06 container that sets up the interception, the IP table's interception. So so and that is guaranteed to run to completion before the application container runs or the sidecar. So, again, I'm not quite sure about, you know, what's what's maybe going on there with, with the security. Like, those IP tables rules would always be in place. Sure. The sidecar might not be ready by the time the app has come up, so the traffic will get black hole, which is an availability concern, but it's not a security concern because we know those interception rules will already be in place.
10:35 What Istio CNI is, yeah, like Marino says, it's not another, like, CNI, like another, actual overlay network, but it's a CNI wrapper because you can stack these these, CNI plugins. So it'll, what it does is it basically runs so with the job of a CNI the job of an actual, like, CNI plugin, is to just provide a network interface into the pod. Right? When the pod's being made, your CNI plug in will make a a TAP or a Veth or whatever your, you know, whatever your underlying or or do some e b p f, whatever your underlying network
11:04 wants. And you can stack them. So if you put the Istio CNI in first, it'll just run first. It doesn't actually make a network interface, but what it does is set up those IP tables interceptions rules so that you don't need an init container. And the argument for that is folks don't want to give the init so the reason the init container exists is firstly to fix that race condition, and secondly, so the init container can have the high privileges you need, like the CAPNET admin that you need to set up IP tables. Some folks aren't even happy with that, with,
11:31 like, with that init container having that those permissions. You can use the CNI just to run that previous code, like, somewhere else, at the host before the pod comes up. So, again, kinda implementation detail, but both of those, like, you, get rid of that, get rid of that race condition. And, like, the pod manifest is is, you know, like, never needs to be modified for either of those solutions. Like, it's modified at runtime with the with the AnyContainer, but that's done transparently by by IstioD. So, yeah, there's there's, like, a lot of implementation details you have to you have to
12:01 understand to to really kinda grok what's what's going on here and then why you might choose between the two different modes. So it's clear that it's going to be a choice between specifically. That's what you're mentioning. Like, you don't really run them together on one cluster. Correct? Like a sidecar implementation and an ambient mesh implementation. You can actually yeah. Namespace by namespace. Is that right, Marino? Is that the yeah. I guess. Yeah. That's right. I mean, the way you go about enabling sidecar versus nine side non sidecar is just a label that you affix to the namespace itself.
12:28 Ambient Mesh: Avoiding Rollout Disruption
12:33 So Mhmm. In the case of sidecars, you do the Istio dash injection equals enabled label, whereas the, for ambient mesh, right, in another namespace, you actually specify a data plane mode equals ambient. Can't remember the actual label, but there's actually a very interesting difference between the two. So, the race condition was one particular example of an issue, very corner case, but there's another issue as well. If you're already running workloads that are running in production today, to inject the sidecar means you have to do a rollout of your deployment so that the sidecars can be injected into
13:09 new pods that get deployed, which which actually can be a little bit of a hit in terms of being online. So in the situation where you're not deploying sidecars, you're using ambient mesh, none of that ever happens because all we're really doing at this point is and I'm gonna use an example from way back in the day of IPsec. In IPsec, right, when you form tunnels, one thing you had to define was interesting traffic. Those interesting traffic would be defined through access control lists. And as long as your your, whatever proxy or gateway that you're using can identify
13:44 that this traffic has to be matched to be encrypted and tunneled, then that traffic would proceed to go through encrypted. This is very much the same way how ambient mesh would work is that you have a namespace that has been labeled for direction towards ambient mesh, ambient mesh enabled, but also direction towards the z tunnels. So traffic will get redirected, but this doesn't happen inside of the main application container at all. In fact, this is where the Istio CNI comes in to to help with that. So it's not intrusive, especially when it comes to to production workloads.
14:17 You can just deploy ambient mesh. Hold on. There's a caveat here because ambient mesh is still experimental at this point. I wouldn't say just run it in production. But having said that, the goal and the mindset here is to prevent production outages or downtime as you're deploying services into your mesh. So, I mean, yeah, the, just to clarify, I guess, the, like, sidecar is never intrusive either. Like, you never need a change to the application code or even to the application, like, deployment YAML that you submit to the API server because the mutating webhook admission controller
14:18 Challenging Ambient's Deployment & Resource Claims
14:50 will, will go and alter it for you. So, yes, those IP tables exist, but they don't affect, like, the container. They exist within the pod because a pod is a set of c groups and a set of namespaces. IP tables route rules are scoped to a network namespace, so they exist in that namespace. And they live because they've been set up either by the Istio CNI or the or the init container, you know, they persist within that namespace. And, actually, when the application comes up, it has no idea that that they're there. The other thing about, you know, rolling it
15:18 out, like retrofitting at runtime, like, yeah, it is you know, if if you or if you're running an application and you're in production, and you want to retrofit SideCast to something that's that's running, I would probably say as an ops person, that's that's maybe a bit of a bad move. You should, you know, try it in staging stuff first. But, yeah, if it gets to the point in, you know, in production when you want to, you wanna roll this out, I I guess I'd kinda challenge, you know, the idea that that's that's gonna be disrupted because you've got a
15:42 Kubernetes deployment. So if you just if you change the manifest, you just do a rolling update. It's gonna honor your your min unavailable, your max unavailable. It's gonna honor your pod disruption budget. Like, if you're in a position where you can't, you know, one by one in a controlled manner, restart these pods. I mean, remember Kubernetes builds in a surge to keep your, like, your availability, your your load carrying capacity. If you're in a position where you can't do that, then you can't do any disaster recovery. You can't release any new versions. You you can't do
16:08 any upgrades. So I think, you know, anybody who's deployed into Cube, anybody who's cloud native is not kind of staring at, like, that Windows box in the corner that they that they really can't touch. So, yeah, like, Ambien does let you sort of retrofit the networking under the under the carpet while it's running, but I've not come across anybody for whom that's necessary, anybody who can't just do a a Kubernetes running update. Yeah. Lots of interesting details there. And I'm gonna do my best to recap all of this in, like, thirty seconds because it helps me understand. Right?
16:29 Recap and Waypoint Proxy Functionality
16:36 I I wanna make sure I understand what's actually happening. Although I did first, in fact, there's two things I wanna cover first. Let the record show I was not the first person to say Rust. Yeah. We're literally there. Just throw it out. That's true. I was waiting for to interrupt, like, the whole conversation. I was like, yes, even as I I wasn't first. Anyway and secondly, obviously, the comment about the Windows box that we can't touch. Like, it just made me happy, like, a flashback to, like, when we used to work at offices, and I used to do a lot of operational
17:05 stuff myself. It's like, there was always this I g IP address because Peng, but nobody knew what it was. And I always felt like, you know, every office had one of those at one point. Anyway so recap. It's just so that I understand. Right? We have Istio. We have SAICAR based container mesh. It uses any containers to do all of the IP table rules that are needed to satisfy the constraints for the pods to able to probably be part of the the mesh network and take advantage of all the mesh stuff. There there's a whole bunch of reasons of
17:35 why that may have to change. There's always the compute memory things that we can optimize. There's some security concerns, and there's just application availability things that disappear if we don't need to worry about it. Of course, Now ambient mesh wants to change that model. And while you can run-in hybrid is what we've heard, I suspect most people would just want to move towards ambient mesh. And we said we'll ship it to prod, but I'm already thinking of doing it today anyway. So, I mean, maybe a little bit too late for my cluster. But it runs
18:03 a z tunnel, which is a new proxy written in REST that is assuming it's just a demon set, which is part of the SDO deployment. And instead of injecting sidecars, you have mission controls or otherwise, it handles all the IP table tools as the pods come up and come down, etcetera, and we still have a complete mesh network. Or maybe the z tunnel doesn't handle IP table rules. I'm not entirely sure at this point, but it does things. There was also that second component that mentioned, which was The the waypoint. Yeah. The waypoint. Yeah. Can we get what what does that
18:30 Waypoint Proxy and Layer 7 Enforcement
18:34 do? So the waypoint is the one that actually processes a lot of the layer seven heavy duty lifting or heavy lifting. Lifting. So for example, you specify an authorization policy normally in Istio that, prevents services from doing certain things to other services, and we're talking about HTTP. This is where the waypoint proxy come in because a sidecar used to do that for us. So the the thing about that waypoint proxy is it has to run Envoy to be able to achieve this. I mean, I'm sure in the future, they might change up the implementation, but I I can't confirm that because I
19:06 don't even know what the road map looks like. But I will say that anytime that you have that requirement to deploy layer seven, what ends up happening is you actually deploy this at this at the destination workload. So for example, you have workloads that exist on two nodes, and you have workload a that needs to communicate with workload b, but you need to do some sort of HTTP request, some methods, whatever it is. But you also wanna prevent service a from, you know, deleting any sort of information off of service b. So you would implement a
19:37 policy, normally a layer seven auth policy. Now who enforces that actually ends up being the waypoint proxy, which gets deployed at, at the the destination node, for that destination workload, and this is based on a per workload basis. Now here's the other interesting thing, that waypoint proxy actually leverages the the latest gateway API standard. So the gateway API standard is one that allows us to effectively bring in a a lot more of a cleaner ingress into our cluster, plus be able to do things like TLS termination and even things like, we could even do the releases that we
20:17 would want to through this gateway API spec where previously, it was a little bit harder to harder to achieve because you need ingress controllers, ingress resources, and a lot of interesting math. That being said, with that waypoint proxy, it's deployed on a per needed basis. If you don't need it, if you're not running any sort of layer seven services or don't have that requirement, you drop it because it is one of those resource intensive artifacts that get deployed to the cluster. And when you get to that level of scale where everything needs layer seven proxy or layer seven authorization or
20:48 something, this is where you're evaluating your design. Like, maybe we should probably be using a combination of sidecar and sidecar list for certain situations. This is gonna be a design versus, you know, let's just do it. Now one other bit about the waypoint is that it's actually in line of traffic with the z tunnels, and it actually terminates a z tunnel at the the local node. So for example, service a talking to service b between two nodes, service a would traverse one z tunnel on its local node, collocated with it, and service b would receive that traffic. But first,
21:23 that traffic has to pass through the destination node z tunnel, which then tunnels to the waypoint proxy. The waypoint proxy is what releases that traffic and then basically passes it along to the actual workload. Yeah. So the way yeah. The waypoint, actually is Envoy. Right? So current in a sidecar model or the sidecars are are are are Envoys. With Ambient, you've got a z tunnel, which does layer four stuff, one per node, and then, yeah, there's a waypoint, the per service account, right, for, for for isolation. So there's a there's a waypoint per service account, so kind of per
21:40 Security Trade-offs and Shared Components
21:58 per workload, that you jump through. I guess the thing is you kinda you kinda always have to jump through it because, you know, okay. If you don't do layer seven policy, then then fine. But this is an identity based mesh. Right? If you're not doing, you know, like, four policy is is the reason you're running a service mesh is probably because you're finding that insufficient. And, also, if you want any anything that, like, interacts with HTTPs, if you want any observability of particular API endpoints, you wanna even even understand what's a request response, what's a success
22:31 and a failure. If you want any kind of distributed tracing, you've gotta be involved with HTTP, passing HTTP. So realistic while the model is theoretically good, like, realistically, basically, everything is gonna go through a waypoint. So it into, like, those those latency saving claims, those kind of resource saving claims maybe don't apply for for real world workloads because you, you know, you kinda always want them. And then you get into the issue that the waypoints are shared. So all of the security guarantees around, you know, having a dedicated, like, firewall, a dedicated policy enforcement point agent per per workload
23:05 go away. Like, it's not even really it's not zero trust anymore. Right? Because a party is trusting something. It's trusting a shared z tunnel, which is actually shared across tenants because they're per node. And it's and that thing is holding the keys and the certificates for, like, all of the workloads it represents. And they're also trusting a shared waypoint. And Envoy, remember, was designed as a sidecar proxy, so it that, like, makes no attempt to mitigate noisy neighbors. It has no resource management, you know, of its own. It, again, represents a confused deputy, like a a place for for lateral movement. So
23:35 I think that's the big trade off that that I talk to folks about is you can get some some theoretical resource, you know, wins on on ambient. But, again, I, like, would kinda challenge that with this sort of sort of kernel internals to think about. Modern systems are quite complicated. But the trade off is that security posture. I just was going to say real quick, for people who are listening and are very, very confused by the references to layer seven, Just to note, we'll put a link in the podcast description about what the OSI model
24:09 is and what all these layers are. So don't worry, you can read more about that. I don't want to go deep down into some very, very basic stuff. But this is at the very, very top of your networking model on the application layer we're talking with layer seven. I don't know how far down the layers they go with sidecar and ambient mesh. So if you want to just I I I think I heard, like, there were some stuff at layer four and layer three, but just to double check. I mean, there are all the options. Right? The resource
24:29 Clarifying Networking Layers (OSI Model)
24:39 what? Layer three and four are pretty much, Steven, and layer seven. Nobody else even knows what the list is. Right? Yeah. None of others. Oh, well, depends on who you're talking to and what the her history is. But regardless, just so that you know, we will put a link in the descriptions. Alright. Go ahead. Yeah. So one thing I was just going to to add, and I agree with Matt. Like, there are there are definitely trade offs when considering something like ambient mesh. Right? And it really will come down to your own personal require or your business requirements, what
25:00 Trade Offs
25:07 you need, if you have the the suitable amount of resources, if you don't need layer seven authorization. But the one thing I wanted to clarify is the element of identity. So even in that shared service model, there is tendency that exists both at the zTunnel as well as the the waypoint level. And this is derived through the whole, you know, Kubernetes service account token generation and how zTunnel and Waypoints will assume the identity of those given workloads for that given traffic path. Even if even if, for example, you have that shared service model, the the idea behind what zTunnel and and
25:42 Waypoint are offering up is that that slice of that part of that laneway. So it still maintains separation and isolation, but here's the other consideration. Right? Like, this has come up a lot. You know, what happens if your node gets compromised? You know, now I can have access to zTunnel. Now I can, you know, impersonate zTunnel and then direct traffic elsewhere to another environment that you wouldn't even know of. If you're at that point, you haven't spent the time to lock down your Kubernetes environment, and you've just basically given access. Root access is probably the most dangerous thing
26:15 you can have inside of your cluster. So in those sorts, you know, situations, it's not about service mesh at that point. It's about how you've built security practices and posture into how you run Kubernetes. So in the situation where you have workloads that exist within a node that are communicating, the the fascinating part about it is they're never gonna pass through a z tunnel. Right? They're gonna communicate directly on the wire through the CNI. And they're even though they're a part of the mesh, I mean, there's no MTLS going on. There's plain text traffic going on.
26:49 But that's not the issue. The issue is, is your cluster secure? Have you prevented unauthorized access? Do you have the right controls in place? Are you using certificates? Who has access to your your KubeConfig? In fact, almost no one should have access to that if you're using principles of GitOps at that point. So there are a lot of other considerations that fall well outside of what ambient mesh can control. Like, it's not gonna save you from your infrastructure challenges. It's gonna solve parts of the network challenges that you're trying to solve with service to service communication.
27:22 Yeah. No. I agree. Like, if you yeah. Like, Marina says, if you're at the point where, you know, your your node your host has been popped, then they'd they'd be in the sidecar the same way they're in the the waypoint and the z tunnel. I mean, there's different ways of popping things. Right? If they've got something like root access, like access to kind of the kernel, then all bets are off. I guess the thing about the the z tunnel and the fact that, you know, no local communication does bypass it, there's no NTLS. Like, if you get compromised in the sense
27:23 Node-Local Traffic and Threat Modeling
27:47 that somebody can break out of their container, get into the root namespace, like, maybe they've got you know, maybe they don't have the root user, but they're in the root namespace or they've they've got privileges to it because you ran a privileged container or you ran in host namespace or something. Then, you know, sidecars will give you a mutual TLS on the wire even locally on the node, whereas, yeah, the z tunnel, the z tunnel doesn't. It gets bypassed. So there's, like, different ways of it. It's a it's a complicated thing to threat model. And I
28:13 anyway, and I I feel like it gets more complicated to threat model with Ambient because the the topology gets more complicate. You've got these two moving parts, and your topology gets a lot more complicated. It's more difficult to reason about. But, yeah, absolutely. Like, if you've, like Marina says, there's so many other controls you should be putting in place. We maybe are getting into the, you know, into the weeds a little bit. Well, except unless you're you're regulated, and the regulator, like, requires that you have to have TLS on the wire. It's the same with the resources. Like, you
28:40 know, I'd, I'd probably wanna challenge a bunch of the, like, numbers that have come out around resource usage. I think, actually, the kernel, like, helps you. I think sidecars can actually be equivalent to ambient mesh in in almost all cases. But by the time, you know, by the time you're looking at that, have you actually got the requests for your application right? Have you rewritten the Ruby and Rust? Have you sorted out your HPA and your cluster auto scaling? Are you using spot instances correctly? You know, what's the stat? Like, the average e c two is 3% utilized.
29:09 There's so much more low hanging fruit than than looking at that resource that resource question at all. You know, if you get there, can talk about it. But, yeah. Yeah. Absolutely. Yeah. I agree with Marina that she's probably not excuse me. The low hanging fruit. Yeah. I don't know if I misheard something there or maybe something I don't know. Someone mentioned node local doesn't go through the z tunnel. Is that one sentence correct? Yeah. So if you have services that would exist within the same node, right, pod to pod communication won't pass through the z tunnel.
29:41 So think of a traditional firewall and you have a a subnet behind that firewall. Do two computers ever pass through that firewall if they're gonna talk directly to each other on the network? No. And that's the same situation as to how it works inside of ambient mesh with resources that exist on the same node. Now there are ways to work around that too, but quite honestly, you know, this is where we're starting to get complex and we start to realize maybe we should start using sidecars again. So this is why the you know, it's situational. It's not gonna be either or. It's
30:15 gonna be and and, you know, bits here, bits there, pieces all over, and now you have this, like, kludge of a mess a messy mesh at this point. Right. Yeah. I mean, so, yeah, side cards, you know, were reaction, I guess. The idea probably came out of Google originally, right, with envelope and and whatever, client side load balancing. They were a reaction. Like, firstly, the middle proxies, which nobody, I think, has had a good experience with. Right? So you do your client side load balancing instead in the same way that, like, a, you know, a history library or a,
30:43 like, gRPC client does. And also a reaction, like, Reena says to those kind of firewall setups, right, where if you're in a trusted subnet, then, you know, any workload in that subnet just talks to another just talks to its neighbor and doesn't go through the firewall. This is where the whole, like and that's not sufficient in a bunch of cases. So that's where the whole zero trust thing came from. And then see, there's non, you know, non cloud native, non mesh implementations of this, like Cisco ACI will will sell you that kind of dream as well. But the idea is
31:07 taking that boundary and making it making it real small. You know, there's still there's still a gateway. There's there's still a boundary, and there's still a border device. You know, with with sidecast, with a service mesh, you know, the boundary is the network namespace. The boundary is the one part. It's the smallest it can be, and the the border device is the Envoy. And that, I guess, with z tunnel, that's what you're that's what you're kinda giving up, because, you know, yeah, node local, you don't have any encryption at all. And even if it between nodes,
31:33 then you're plain text from the, you know, the application to the z tunnel. So you're leaving that network namespace. You're leaving that that isolated network and those guarantees. So, yeah, it's it's you gotta understand how it works. It's something to to consider when you're looking at the trade offs. I'm smiling because you you basically brought Cisco into the chat. And and it's funny because And it couldn't it's really fun. No. No. It's it's great that you did you brought this up because back when I was working at VMware under the networking security business unit, we competed heavily against Cisco and ACI. Right?
31:48 Historical Parallels: Service Mesh vs Traditional Networking
32:07 But they were talking about the exact same things that we're talking about today. Like, nothing I'll be honest. Nothing has changed in networking other than the fact that we're dealing with smaller artifacts. Nothing has changed. You go look at a networking namespace. What is it? How does it get constructed? We're just building, quote, unquote, a VLAN at this point, attaching endpoints to it, creating, you know, connections to the outside world, and there you go. Now your pods have access to the outside world or other pods. So nothing we've done is different, and it's it's interesting because I know Cisco
32:39 and what they're up to right now, and they're breaking into the whole cloud native space. They they have, like, their own CNI. They have an Istio offering as well or a service mesh offering. But they haven't adopted their own mindset and methodology for those kinds of systems that they're developing for cloud native, which is funny because they're the ones that championed a lot of these architectures that we see today. We're just, like, repurposing them in cloud native. So it's it's it's interesting you brought that up. And, you know, quite honestly, like, I when I think of companies like Cilia oh,
33:13 sorry. Isovalent and what they're doing with Sillyum and how they're bringing up service mesh into their own offering, I am I feel like nostalgia, and I feel like deja vu as in terms of, like, going back in time with networking all over again. Yes. Same. Yeah. The reason I brought ACI because I couldn't actually think of anything else that offered that. I may well be completely wrong, but I couldn't NSX. VMware NSX. Yeah. Of course. That's why I worked at Cisco. Right? So I guess yeah. That's why I thought of, ACI. But, yeah, that idea of, like, of east
33:40 west controls without having to, like, headpin through the firewall. And it was always a choice, like Marina says. It was always a choice back in the day. We're having the same, you know, the same discussion, like, how small do I make my broadcast domains, my VLANs? How small do I make those subnets? Because the smaller I make them, the more traffic goes through the firewall, but I then I need to pay for more firewalls and the more administrative overhead I get. So it's the same kind of trade off. But, like, with I think the difference is with with an
34:01 ACI or NSX or a or a sidecar based service mesh, you can get to that, like, you know, minimal one workload, you know, one Unix process, one, you know, network domain, one subnet. And you can have you know, you can get to that proper zero trust, you know, zTunnel is is trading that off for other things. But, yeah, same conversations we've been having, what, ten, twenty years. Even before that, even before the NSX stuff, like, folks were just buying a lot of firewalls with a lot of ports and making real small subnets. Right? Alright. So
34:32 eBPF, Cilium, and Istio's Approach
34:32 I love that you also mentioned IstioVeal and and Cilium. Right? Their CI implementation is purely driven with eBPF and the IP tables. Like, I guess the the question just floating around these are right now. It's like, why is eTunnel in a gateway? Why why is eBPF not the choice here? So I I will say that from an innovation standpoint, I don't know how much is going on in the Istio world with eBPF. I know that there's some PRs that are open to use eBPF as an optimization for redirecting traffic. At Solo, we're doing it. So
35:04 if you decide, hey. You know, I'm done using open source. I need something a little bit more elaborate, and I wanna take advantage of eBPF optimizations. Well, there is an offering, through Solo, through their Blue platform. But having said that, like, even when you look at Cilium and Isovalent, they basically are the ones that pioneered sidecar list through their eBPF offering. Now having said that, right, when you're looking for completeness of a service mesh, maybe they don't have it all there. It's not already. They'll give you parts of it. And if you're already deeply ingrained in using Cilium CNI,
35:40 I mean, it doesn't hurt to turn on the service mesh functionality and test it out. But then when you're trying to leverage a lot of the capabilities, fault tolerance, resiliency that that is built into what sidecars offer, what even sidecarless offers, that's gonna be a bit of a challenge working inside of, you know, a you know, look. I I'm gonna say this, but this is just my opinion. A switch that does multilevel or multilayer networking versus a different control plane that handles specific elements of of, you know, what a service mesh does. So the way I look at it is, okay.
36:15 It's nice that you have this single control plane that does it all, but then where do you compromise versus taking the other approach where you use a few control planes that work together to provide you that full stack network, and then you use something like GitOps to control your network or deploy and manage and scale your network. Yeah. No. Yeah. Yeah. I think I'd agree with that. Cilium's an interesting one. Good eBPF has got is great. Right? It's a great technology. It's got a it's got a lot of hype. It's not just the networking stuff. Right? It's so you have
36:42 FOLCO, for example, uses BPF to, like, hook a bunch of, you know, instrumentation points and do essentially EDR. But, yeah, like, BPF programs can call this XDP library, express data path, and set up fast networking in the kernel. So Istio, yeah, Istio does use it. I think it's an option. It's maybe default now. Like, don't quote me on that, Marino. You might know. But for yeah. For the acceleration. When I was saying IP tables before, I mean, like, the interception of traffic, you know, comes into a pod, comes into a network namespace. It's gotta go through the sidecar back out again.
37:11 Can be IP tables. It can also actually be BPF, you know, and plus XDP, which is which is usually quicker. I think that might even be the default now. BPF also gets you folks get confused, think, because the same text used in a couple of places, but, yeah, Cilium uses it to implement a like, the CNI layer, to build that overlay network. It's when you're outside of the pod, when you're outside of the network namespace, to build that overlay network in that, you know, WeaveNet setup tunnels. Flannel relies on you sending your own routes in the host. There's various ways to do
37:40 it. Citium uses BPF and gets gets a bunch of advantages from that, but they are kinda separate. I think to your original question, David, of, like, if we have BPF is so powerful and we have it sitting there implementing CNIs, like, why do we need a z tunnel? One could imagine that a more sophisticated enough networking layer could do that. One could maybe imagine a cloud provider who had a sophisticated, you know, VPC technology might actually be able to replace the z tunnel all the z tunnel layer four functionality with their own SDN. I couldn't possibly comment, but, like, you can
38:16 you can see how that technology would would plug together. Right? And then, absolutely, that's great because, you know, nothing nothing wrong with that. So, yeah, that's I think that's gonna be an implement, interesting development to to watch. Yeah. So I just wanted to add to that. So there's a company called Merbridge that actually developed the EBPS solution for Istio. And I think the Istio community has decided to adopt that approach altogether. I don't know when it's gonna show up. I think there is a PR open for it right now. But that being said, like, it it's an
38:30 Istio Graduation and Ecosystem Impact
38:47 option. If you are comfortable and are willing to use eBPF for sidecar optimizations, then absolutely go for it. But, you know, the other side to that is if you decide, hey. I wanna get fancy with eBPF programming, that might not be the best place to do it because it's pretty static in some ways, right, in terms of that that whole configuration stack. The other side to it too is when you start off, you know, when you start off deploying a service mesh, you don't wanna just throw everything at your applications. Start small. Right? Maybe you you realize that
39:22 you don't need eBPF to optimize because you don't have a lot of that cross traffic going on. Your latency is not as high as you think it is. So it will come down to a decision point as well. I mean, there's the other side to it too around stability. Are you going to take something that's been experimental, run that in production? Probably not. Right? Which is why ambient mesh is still experimental right now. But having said that, I'll plug something. So Istio has officially graduated. In fact, you should see the announcement very shortly, which means there's gonna be a lot more
39:50 Istio Graduation!
39:52 traction around the consumption of Istio. You're gonna see a lot more vendors pop in and say, hey. Maybe we want to contribute more to the ambient side because it fits our model very well. So that's a super exciting thing for us. I mean, the community has poured in a lot of work to make this possible. You you see how service mesh and networking in the cloud native space have become prioritized for your workloads. It's become important much more so than it was four years ago. And it's only because of the manipulations you can do inside of both a mesh and
40:24 a CNI that you weren't able to do before. So it provides a lot of design flexibility in terms of whether or not you deploy ambient, whether you decide you're gonna use just the ingress gateway functionality, or maybe I just wanna turn on sidecars because I want that observability. I wanna be able to do distributed tracing very effectively and see all my different service requests and the paths and see, you know, how these services are tied together and where things are failing. Right? So it'll come down to the use cases. There is no one size fits all
40:54 model. There is no one solution or silver bullet that'll do it all. No. Absolutely. That's why I'm excited about Ambient. Right? It's, you know, it's more choice. Choice is always good. And how how long has the Istio project been out? I was trying to think. Six years or something? Six years. Yeah. Six years. So and it's graduated. Right? It's in one dot x, one dot 18 now. It's it's pretty mature, but that's still, like, this is a real change. There's real innovation. There's real dynamism. So, yeah, super excited to see that happening. Super excited to see the choice.
41:01 More Choice, Real-World Testing, and Standards
41:20 But like Marino says, you know, it's horses for courses. You gotta understand what, you know, what you're turning on. But I would say test it with your own, you know, workloads. Actually, look at your own numbers. You know, do your own threat models. Look at your own regulatory environment. Yeah. David, when you said you were gonna just push it to prod, like it is alpha still. It is still you know? I mean, I missed the journalist episode. So Well, depends how long you take to to edit it. Yeah. But I do feel like there's some obligatory x k
41:50 c d that we need to put in here, though. Like, I mean, I I have to admit. Like, I listen to all this, and sometimes I just kinda think, we have 16 standards. We should have one standard that unifies everything. Couple months later, we have 17 standards. Yeah. So that's all I kinda think about sometimes when I hear, well, we're using this and that and the other. But I can tell you. I mean, gateway gateway API is that same. I mean, let's be honest. Like you said, networking hasn't really changed. We just changed the scale of it. So maybe there's an x
42:05 Evolution of Cloud Native Networking
42:22 k c d in there as well. Who knows? But I think, like, Marino says, like, folks are I I personally feel like we're getting more mature. The actual cloud native thing. Right? Like, folks, we're going into cloud, and, you know, I was at various consultancies and whatever, and we were always saying, like, you've got like, lifting and shifting, like, being in the cloud is not enough. You've gotta be cloud native. And I think the act the sort of lift and shift or the, like, the naive cloud architectures happen for a lot longer than I maybe
42:48 thought they would. And folks are now realizing that they've got all their microservices and all of this, and it's great. And they've built a distributed system, you know, what used to be, like, one monolith with dependency injection and and, you know, an ORM. We we got quite good at building monoliths, right, really, with, like, interface driven design and stuff. And, like, if if namespace a calls namespace b, it can't fail. You're just putting a program counter on the stack and calling jump. Right? Like, basically, instant never fails. You've now built a distributed system. You've probably got distributed transactions. You've got
43:18 failure modes you never thought of. And the network is, yeah, is the way to fix that. Right? So I think we're finally becoming, like, actually cloud native in this exactly like Marino says. We gotta realize we we need to leverage with there's a whole bunch of stuff. We've gotta get that observability back that we have by just attaching a debugger to our massive process. We've gotta get the control back that we had with dependency injection systems and the sort of, you know, different routing they could do in test builds. And like a service mesh is is,
43:45 I think, the way to do that. And we're yeah. So networking and meshes are coming forefront as people realize that they absolutely need this stuff, and you you can't build a system properly without it. So it's yeah. So as somebody who's always been into networking stuff, it's it's fun to see. Alright. We're very, very close to the end of the ever. So I'll now give you both the opportunity just to show us a plug, and if anything you want, feel free to take, you know, couple of minutes, whatever you need to share your Twitter, your website, your company, your products,
44:00 Guest Plugs
44:14 your OnlyFans. Go for it. Have fun. Marino, why don't you take it away first? Yeah. So if everyone can see the video, I've thrown my Twitter handle on there, so you're welcome to kinda see some of the the shitposts that I put out there about networking and perspectives and even service mesh. But one thing that I noticed last year was that there was a lack of understanding of how networking worked inside of Kubernetes. And I, you know, sought out to build a workshop, built the workshop, delivered it a few times, and it's become a thing. So if
44:45 you really wanna understand what's going on in the OSI model, I've built a workshop called network foundations. It can be accessible at academy.solo.io. Head over there, register, and then you could take the workshops in two parts. You get the first bit that covers, you know, IP addressing, subnetting, understanding routing, DNS, and even working with a little bit of HTTP. And then the second part focuses in on why we use proxies, how to how to set them up so that you could start doing things like load balancing or filtration of packets or or traffic all the way to
45:17 understanding how networking namespaces are built and why you wouldn't do this today. Instead, you'd use a CNI because it takes care of all the IP address management and the onboarding and offboarding of pods. And then we end at Kubernetes networking to understand how pods communicate on a network in Kubernetes and how we expose them to the outside world. So go check it out. Academy.solo.i0networkfoundations. Alright. You're talking about Matt. What do have? Do you want me to screen share the only fans? Or There we go. Let's do it. In this visual medium of a podcast, it's gonna be
45:49 great. Yeah. Maybe I maybe I maybe I won't get It's on YouTube too. It's fine. But if I also put the link to the course on the Twitter and it show up as well You can you can get fans back. But go for it, Matt. Yeah. Cool. Yeah. So I'm Matt Turner. If you don't wanna follow the my shit posts, fairly similar flavor, I guess. I'm at m t one six five on Twitter, and then there's links for there to the the other nascent socials that are coming along. Yeah. I've done a a, you know, fair
46:16 amount of talks. I got a website, mt165.co.uk. That's got links to the videos with all of those. I talk about networking and service mesh and Kubernetes and stuff. I work at Tetrate, so we have a, like, a management plane. We didn't get into it, but we've got a management plane over various Istio control planes. So all the stuff we were talking about, you know, where do you, which control plane do you use? Where do I configure layer four versus, you know, layer seven? Do I use how how do I use Istio CNI policies in sorry, Kubernetes CNI, like network policies in addition
46:45 with Istio features? You know, the management plane kinda transparently takes care of all of that. And we've also involved with a bunch of the open source projects. So we're we're begin to they get the new gateway API stuff, which I think is super exciting for everybody. And, hopefully, you know, it's the seventeenth standard that will actually replace them all because it is looking really good. Istio supports it. You know, lots of other stuff is supporting it. So so that's exciting. And we've we've we're doing the main contributors to the reference implementation of that, which is, like, Envoy gateway. Like, real
47:13 real simple ingress gateway based on Envoy talks to gateway API. Like, it should hopefully become the de facto standard there. And we do Wasm stuff as well. So if you wanna plug Wasm into Go, if you wanna build Go into Wasm, you know, all of those tool chains are other things that we're working on. So, yeah, we're we're helping out the ecosystem in in a bunch of different places there. Cool. David, considering that I just heard your keyword, Wasm. Yep. Wasm. Here we go. It's your turn. No. You plug in. No. I'm just saying. I know. I'm just
47:33 Warning about Business Logic in the Network
47:50 having fun teasing about Wasm. Yeah. I could just run an episode where I talk about a cool amplifier. Hey. You did really, really good. You didn't, like, get into it the whole time. It was good. Say something about it, but I did think that's a bit of a a rabbit hole. But, yeah, Rino's probably right about the network becoming, like, powerful and a big lever, but please don't put business logic there. Like, we've done this before with the ASPs. We've done it before with, like, Lua scripts in your NGINX reverse proxy. Wasm gives you this amazingly powerful tool and
48:16 a great big gun to shoot yourself in the foot. Like, please don't use BPF or Wasm to, like, hook the network with business logic. That's all I'm gonna say on that. I mean, like, if if we don't learn the history Right. What are we doomed to do? So Need a bunch of grumpy people on a call like this to, yeah, tell you how bad it was. Yeah. Exactly. I mean, they they try that without WebAssembly with cloud native network functions, like, that where every every network thing was supposed to be a container. I mean, that was the thing
48:42 for, like, two minutes. Right? There's many ways to do it. There's also people who trudge up this uphill in the snow both ways. You know, we can argue about that. That'd be fun. But thanks for joining us. If you wanna keep up with us, consider subscribing to the podcast on your favorite podcasting app or even go to cloudnativecompass.fm. And if you want us to talk with someone specific or cover a specific topic, reach out to us on any social media platform. Until next time when exploring the cloud native landscape on three. On 3. 1, 2, 3.
48:51 Conclusion & Outro
49:16 Don't forget your compass. Forget your compass.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments