From Kubernetes to Cloud Run: Chainguard's Journey

Watch / Cloud Native Compass On demand

The embedded player needs JavaScript.

Open the video stream (HLS) Download captions (VTT)

Expand player Shrink player

Overview

About this video

What You'll Learn

Identify when mostly stateless, spiky-traffic workloads are better moved from Kubernetes or Knative to Cloud Run.
Use Knative-compatible container patterns to move services while removing node and cluster management.
Implement least-privilege IAM, R2 blob storage, and BigQuery event logging for secure image-serving operations.

Jason Hall walks through Chainguard's migration of its image-serving infrastructure from Kubernetes and Knative to Google Cloud Run, covering multi-region Terraform modules, least-privilege IAM, R2 for blob storage, and BigQuery-backed event logging.

Chapters

Jump to a chapter

Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

0:00 Introduction

0:00 Welcome to Cloud Native Compass, a podcast to help you navigate the vast landscape of the cloud native ecosystem. We're your hosts. I'm David Flanagan, a technology magpie that can't stop playing with new shiny things. I'm Laura Santa Maria, a forever learner who is constantly breaking production. So, apparently, Cloud Run is better than BigQuery. That is according to principal engineer at ChainGuard, Jason Hall. In this episode, we get to chat with Jason about ChainGuard's migration from Kubernetes and Knative to Cloud Run for a pillar of their infrastructure. And in this episode, we didn't even mention Rust

0:38 until we did because Laura brought it up. Well, you know, I had to do it. Enjoy this episode, and try to spot the deepfake, Jason. Welcome to the episode. We have a wonderful guest with us. Jason, would you like to introduce yourself? Yeah. Hi. I'm, I'm Jason Hall. I work at ChainGuard. I'm a I'm an engineer. I do all kinds of stuff, generally back end related. Also, a lot of image building stuff. That's kind of what we do there. Yeah. I mean the place. I've known you for all of five minutes, and, clearly, you're an

0:51 Guest Introduction & Container Background

1:16 awesome person. So we're very excited to have you. Yeah. Five minutes. I can I can learn about some of you in five minutes? Yeah. I've known him for longer, so I know better. Right? Yeah. Yeah. Seven minutes. No. I'm about yeah. No. I have we followed each other at socials for a long time, and I've always always loved the work that you put out. And what prompted this conversation today is an article that you published to the ChainGuard blog about a migration, some cloud adoption, containers, and Cloud Run. And we're gonna touch on all of these different

1:50 aspects. But I I I I love your your simple introduction. I'm an engineer that does stuff, back end stuff stuff. I mean Generally. They don't let me touch the front end. Oh, okay. I know that feeling. Usually, I touch prod, everything breaks on purpose. Yeah. It's what I tell everybody. You know? I've done some tailwinds last week. Like, does that make me feel stagnant? Yes. Yes. Yes. Does. Alright. Awesome. So I think just to provide a little bit of context before we dive in to like the Kubernetes, the containers and the cloud run is like you have a career

2:26 behind you that required or I don't know if it required you had some time at Google along with other chain guard engineers. Right? So the context here is is that you've been working with containers in Kubernetes for a while. Is that is that a safe assumption? Yeah. Yeah. Yeah. I while I was at Google, I worked on the Google Cloud Build product. I started that team with a couple other folks. So that was sort of my introduction to containers in the first place was how to build them as a service. I learned a lot. I made a lot

2:57 of mistakes as everyone does, and they hired me at ChainGuard anyway. So, a lot of a lot of folks from Google, came over to ChainGuard, and and our focus there is also building containers, you know, as well as possible, as quickly as possible, serving them as well as possible. A bunch of folks were from the Google, container registry team also. So, yeah, a bunch of folks there dealing with both build building containers and running containers to serve them. Yeah. I mean, the the reason I bring this up, right, and I don't wanna condense your articles out into one sentence, but I'm

3:03 Why Migrate from Kubernetes?

3:32 gonna do it anyway, is that ChainGuard couldn't be arsed with Kubernetes anymore. I mean, that's what it boils down to. Right? I I I might I might temper that a little bit, but but sure. Right? Like, we found that we were not taking full advantage of all of the features that Kubernetes provides such that it was worth the maintenance and and upkeep and, you know, general stuff. You know, like, if it if it gave us if we were using every feature of Kubernetes, it would have been worth it to stay there. We ended up mostly running

4:08 a bunch of stateless services talking to, you know, Cloud SQL and, r two for Blob storage and a bunch of, like, off cluster services. So we were basically just running a bunch of stateless services inside of Kubernetes, which can be nice, but we weren't, you know, we weren't finding it to be useful for all that Kubernetes brought to us. So Yeah. Yeah. I mean and you were also saying that you had a lot of spiking traffic, and you were running Knative, really. Right? So Yeah. Yeah. Yeah. So the the lineage of the of the engineering team is also

4:32 Knative and Handling Spiky Traffic

4:42 a lot of you know, Matt and some other folks at started Knative with many other folks, not just not just them. And so they were deeply, you know, involved in it and knew how it worked and knew how it worked well and where it where it didn't work well. And Knative wasn't really the problem. I think that, you know, generally, it was it was that I I I think that Knative is sort of, like, the nice deployment API. It's just it's just a good deployment API. I mean, not just, but so Knative was never the problem. And Knative

5:11 Cloud Run for Stateless Services

5:19 actually made it easier for us to move to Cloud Run because they had the sort of same API and container contract and and all of that. So, yeah, Knative was never really the problem, and Kubernetes wasn't really the problem. The GKE wasn't really the problem. It was that we were not using all the stuff that, Kubernetes would let us do. Mhmm. Just we were just running a bunch of stateless services. Yeah. And there's a there's an easier way to do that, it turns out. Right. And I also saw in in the article, as David eventually hopefully will return and rejoin

5:55 us. I also saw in the article, though, that you all were basically instead of having spot instances, you were having to maintain servers running and maintain nodes running and things like that because you had a lot of spiking traffic, but you couldn't Yeah. Yeah. Yeah. So you kinda were stuck with it with all this all this infrastructure that you weren't really using unless you had a major spike in something, and that really doesn't make much sense if you're doing a lot of stateless systems. So Yeah. So our our traffic is is very spiky. And and like I mentioned in the in

6:31 the article, a lot of that traffic just comes from us. Like, we we tried our our main thing is we're just trying to build images as quickly and as fast as possible. Right? But in doing that, we also have to push them to the registry and then pull them again to scan them and pull them again to test them. And so when a new image is built, there's, like, a little mini flurry of activity around it on the on the back end on the infrastructure. Multiply that by an ever increasing number of images, and, basically, we'd had, like you know, over

7:02 time, these spikes got bigger. In order to absorb those spikes, we kept it you know, did the the normal thing you do, which is to keep a warm pool of nodes and instances ready Right. To absorb them, which, you know, you can turn the dial between, like, cost and reliability. We we turned it we turned it towards cost and reliability. Or sorry. Raising cost and raising reliability. But we knew that if, if sort of that trend was gonna keep going, we would need to figure out a cheaper way to do that sort of thing. Yeah. And,

7:39 again, we had we had a lot of folks that already, you know, knew Cloud Run, worked on Cloud Run, worked in near Cloud Run. And so we we, from the start, it sort of thought that was a good idea to migrate toward. I don't know if you noticed, but I had to rejoin. My limits crashed as it always fucking does. So if my question now seems to take us on a weird tangent, it's not intentional. It's just because I missed about forty forty five No. No. I I was I was gonna put my serious face on for a moment. Right?

8:11 Because I I was very dismissive of of Kubernetes at the start. Saying that you you could be ours, but but, you know, you know, something I mean, something I feel very guilty of is I go to conferences and meet people and say, you know, Kubernetes gives you all this stuff for free, but there is a very real tax to running and operating a Kubernetes cluster. And I feel like that is meaning yeah. If you're not using the whole thing, like you were saying, you know, sometimes that tax isn't worth the burden. And I think that's

8:37 an important lesson for people to walk away with is, like, you know, know what you're getting into. Even the people that have all of this experience in containers and Kubernetes sometimes just don't want to pay that tax for what they need. I don't know if it was mentioned, but Cloudron is, you know, is a fantastic product. I run a bunch of things on it myself, and then, you know, we can dive into that in more detail. But I think similarly, like, it it also Kubernetes also gives you a lot of knobs to tune and optimize things,

9:05 which is great, which is great when you are at, you know, a level of scale that you need to tune and optimize those things and have the expertise and the sort of, you know, interest and and time to prioritize that. We didn't take advantage of most of those knobs, and in fact, most of those knobs were, like, hard hard for us to, like, know how to use correctly, I think, or at least me. So we didn't really use a lot of them. We wanted a simpler just run the stateless service, scale it up as you see fit type of thing, and

9:36 that was that was exactly what Cloud Run was for us. Yeah. So I am curious. Right? Obviously, we're while we're talking about the migration from your blog and we'll put the link to this in the show notes for anyone who's listening in the description for anyone watching on YouTube. But this is for your chain guard public registry. Right? This is serving infrastructure that delivers images to people. Do you have other infrastructure that is still running Kubernetes, or is you know, is this, like, your main operational project? Yeah. I'd say there there are, roughly speaking, as of right now, three

9:42 Infrastructure Pillars & Migration Scope

10:12 infrastructure y type things we run. One is, serving serving images and and everything that it takes to serve images, auth, data store, a lot of, like, eventing and pubsub, internal stuff, to to do stuff when stuff happens. That's that's the, the first pillar is, like, sort of, I'll call it serving, and that's what this this blog post was about and our migration to Cloud Run was about. We also have a a bundle of infrastructure for building packages. At ChainGuard, we build every package from source ourselves. And so when a new, you know, Python release drops,

10:52 we build that Python release and and build an APK that ends up in an image. That infrastructure does still run on Kubernetes, and I think we get a lot of benefit from Kubernetes. We use all those knobs. We use all the you know, we use all of the features that that, are in there. So that second one package builds is still in Kubernetes. The third one is image builds. So when we build an image from these packages, we take you know, the Python image contains both the Python package and, you know, CA certs and lib crypto and things like that. We take

11:28 these distinct packages and put them into an image and push them. That infrastructure is mostly, at some part, just HTTP and TAR. Like, we just fetch APKs, stitch them together into images, and then push them, and then test them and then scan them and do a bunch of stuff downstream. So the three of them sort of. And then when it pushes it, it pushes it to the serving infrastructure. So all three are sort of living together next to each other, dependent on each other. But, yeah, the package build infrastructure still uses Kubernetes. It's very it's very reliant on it. It

12:03 takes, you know, I think, full advantage of all of the stuff that Kubernetes gives us, but the serving infrastructure wasn't really. And so, we moved that one to Cloud Run. Makes sense. Thanks. You did something with Cloud Run, I believe it now does that out of the box, but I'm assuming your efforts predate that, which is the ability to do multi region deployment. So maybe you could go into that and and tell us you know, because you're using r two as, like, the the blob storage that powers the serving infrastructure. That is powered by Cloudflare, distributed around the world, lowered

12:15 Multi-Region Architecture on Cloud Run

12:37 and latency to first pay for all your customers. So having multi region deployments makes a lot of sense. So how about you kinda dive into how you did the architecture that deployed it? Yeah. Yeah. So the the multi region cloud run service module that we have is basically a wrapper around run this cloud run service in three regions, right, or in n regions. As far as I know and and that does predate the Cloud Run providing this feature by itself. As far as I know and if someone from the Cloud Run team is is screaming

13:14 at this, telling me that I'm wrong, I think that the the Cloud Run feature of a global service is basically, like, syntactic sugar around run three regional you know, ran regional versions of this, and you can stitch them together with GCLB or load balance across them. So our Terraform module for this and Cloud Run's feature, I think, are more or less the same architecture of, like, I wanna run this. I wanna run this in three places, and I want GCLB to figure out which one is closest to the user. We a lot of our services

13:53 so the registry, for instance, when you when you ask to pull an image, we route that to the closest region to you. So if you're on the West Coast, it will West Coast of The US, it will hit the West Coast replica. It will do some sort of off check and say, oh, you're David. I know you. You're allowed to pull this image. And then it will, as quickly as possible, redirect you to r two to get that blob. We want you off of our lawn as quickly as possible. We want you to be, you know, talking to r two, pulling that

14:21 blob without us doing anything. So as much as is possible, we'd like to handle that request as close to you as possible, do as little work as possible in as little time as possible, and then say, your blob's in R 2. Go get it from them. Right. Yeah. But as far as I know that the the Cloud Run global service feature is, you know, infrastructure and niceness around run the same thing in three places. Mhmm. Yeah. I I actually deployed my first native multi region cloud run a couple of weeks ago. And I was disappointed to see that

14:57 I got three different URLs back for each deployment, and then had to manually go and configure my own load balancing for it. So I think it's still very early. I mean, it's still it's either alpha or beta. I got it's very early for what they're doing, and I don't think all of the the niceties are quite there yet. But I you're I think you're right. It's the same model that you've done with your Terraform module, Only probably a little bit better because you have a open source Terraform module that other people can can go steal into this themselves as

15:22 well. And if if the, you know, if the Cloud Run as the Cloud Run product improves, there's nothing in about our Terraform modules that requires them to be done this way. If there is a, you know, global Cloud Run service resource, we would just use that. You know? And and I think for the most part, depending on a lot of details, like, things would just work. I think the the next time you Terraform apply, it would just become a global service or whatever. But yeah. Right. Well, I'm assuming the way that this works is, like, you get sort of the networking

15:57 layer. It's probably like an Anycast IP or maybe it does DNS for the the written to the lowest latency. Right? I think it is just like the serving Cloud Run function is just a piece of Go code that signs an r two URL and then you're handing off. Right? That's what you said. You wanna out of that as soon as possible with the with the caller. So Yeah. Yeah. We do we do, you know, off checks and and as whatever we need to do to make sure that you're allowed to get this r two URL, and then we generate

16:24 a signed URL for you and redirect. That's also be cut not just for latency, not just to get you to the to r two as fast as possible. But if we serve from r two, then it's zero egress. Like, egress is free. So we don't want to, like, proxy that that, blob through our service because then we would pay, you know, egress for that. So we just say, you're good. We know you. Go go talk to R two and get your get your blob from them. Yeah. The zero egress fees is so appealing with r two. I feel like they're doing

16:57 a rug pull that at any minute, and I hope not any minute soon because I just uploaded a terabyte video to it that I wanna start, you know, pitching on my website. But I I I just especially in today's day and age, right, where every major cloud provider is charging us through the nose for egress, and cloud's there just sitting there. I'm like, yeah. We don't care. You know, as long as it goes direct through R two or their CDN, they they they they don't care. That's why I I definitely, agree that it feels too good to be

17:25 true, and maybe it will turn out to be too good to be true. But for now, I'm I'm very happy that it exists. And Yeah. I'm happy to I'm happy to sit on this rug for now. Let's see how it goes. Yeah. It it kinda makes you wonder what exactly is going on underneath the covers or underneath the rug, I guess, you could say, that's making this a good business decision somewhere down the line because you know? Is it just because someone really, really, really, really, really wants to twist the knife on, like, AWS or somebody just saying like,

17:57 See, this is what we can do that you can't do it. Or is it, like, honestly, they're going to use it to push some other product out there? I don't know. It's just kind of an interesting question how they're doing it and why they're doing it. Because, yeah, doesn't I think when you're a Cloudflare, right, and you've got probably every enterprise company in the world paying the big money for the top end that I mean, the Egress transfer on this and the storage means nothing. I I'm assuming. Right? I don't know anything about Cloudflare internally, but I'm assuming

18:24 they're just like, yeah. You know? I mean, Google and AWS are probably paying them a ton of money for other stuff. So, like Yeah. So if there's a Cloudflare Go ahead. Oh, sorry. My my understanding of how, like, network egress billing works anyway is that, like, Google and Amazon don't pay that much to be able to shift boot bits around. Right. Right? They just feel like charging you a bunch for it. Right. And Cloudflare also doesn't get charged that much to move the bits around. They just don't feel like charging you for it, which, you know, it's not

18:55 like they're losing money on it, I think. I hope. Right. There are a lot of there are competition in our on as well. You know? Bunny.net is a CDN with zero egress fees or very, very competitive egress fees. Civo Cloud started doing zero egress fees. So maybe it will have a knock on effect, but I I don't want us to start going down that rabbit hole and talking about something else. Just I will just stay I want a picture started. If if there is a Cloudflare person who works on this who's listening and feels like coming onto a podcast,

19:24 maybe reach out. We're kinda curious. Anyway, but kinda wanna switch a little bit because I noticed that you had been talking a lot about, basically, about leased access as you were building out these close no. Well, they're not clusters. Services, I guess, for lack of a better term. So your service account is very, very minimal, and then everything is working. And I'm kind of reading the article as we're talking. I you're doing explicit authorization and doing all of the really, really good things that are good security practices. Was this was this a problem to kind

19:28 Security: Least Privilege and IAM

19:58 of figure out with as you move off of Kubernetes systems that are a little more designed to work together, figuring out the architecture to make that that service system work? I'm curious. Mostly as someone who hasn't really done a whole lot of going back and forth between Kubernetes cluster and serverless functions. Usually, I'm just running serverless stuff. I'm not doing both at the same time, so I'm curious how that works. Yeah. The the first thing you mentioned that the having, like, in our service module, the service account has no permissions by default. You have to explicitly grant it Right.

20:35 You know, every every permission. That I totally understand why Cloud Run by the like, the regular Cloud Run service comes with a regular service account that has, I think, fairly broad permissions. Like, for people onboarding to your service, you don't want to say, like, in the five minutes I have your attention, first, figure out IAM and what exact, you know, permissions this thing needs. Oh, where are you going? Where are you leaving? You know? So they they, by default, give you a a, I think, a very powerful service account to run with by default, which totally makes sense for onboarding and and

21:11 new users. But then that's terrible when you generally, the route that goes is you onboard one service with a bunch of permissions. You onboard five more. You onboard 30 more. And then you realize, oh my god. Everything has permission to do everything, and this is, you know, somebody in security notices and makes you, fix that after the fact. Right. Ask me how I know about this, this this route of how things go. So we very early on decided, like, if we're gonna migrate to this thing, we wanna do it right from the start. We wanna, like

21:43 you know, there is no way to get this super powerful service account on this service without explicitly saying, like, yeah, this should be able to do everything, which none of them ever do. Nothing ever needs all those things. So Sure. We in the in the list of in the modules in that repo, there is a couple of things. One is, authorize a service to call another service. So given two services, you can say, like, this service here is allowed to talk to this service here. And one really clever thing, which which I think, Matt did was the only way to get

22:19 an address to the destination service is to call the authorize this service module. So if a needs to call b, there is a way around you can, like you know, the easiest way to get the the address for b is to call the authorized service module. And as a result, the output of that is the address for b. So if you're trying to call b and you're like, how do I find b? You find it by authorizing the and and so therefore, like, you know, as a as a fun as a fun, like, side effect,

22:49 you have also explicitly authorized a to call b, which is very nice and, something we take advantage of a lot. The other thing is, like so so, that's, like, service to service, communication and and authorization. If there's a service that needs to talk to KMS or to get a cloud secret or to, you know, make a request to a cluster or anything like that. Because when you create the service, you also have to create the service account that runs in there. You have the handle to say, like, okay. This thing that's running the service is going to also need permission to read

23:30 a secret, permission to, you know, sign with KMS, as opposed to create a service and then try to figure out what service account it had or, you know, like, find the auto created one and and and twiddle its permissions. I think that's just sort of a nice it, a lot of things in there were were done thoughtfully to make the default thing you do secure. Like, not just for secure defaults, but so that also that, like, the easiest path to get your job done is also happens to be the secure way to do it Right. Which I think was was really nice.

24:07 And we've definitely, like, taken advantage of that all over the place. Yeah. Yeah. When you make the easy path, the secure path, that's always a good design decision. So Well, in in a code review, right, like, you can you can see that somebody's, you know, jumping through eight hoops to get the address to be instead of just authorizing it. And you can see you can sort of tell, like, this is an awkward way to do this. Why didn't you just call the easy way? Why did you walk around the building to get there? You know? So

24:35 Yeah. Makes sense. Makes sense. So let's dig into the the authorization stuff a little bit more. Now I don't understand I I I don't know the commercial side of the the chain guard business here. Right? Does anyone with a login can pull any image as many times as they want, or does the authorization and authentication go a bit deeper than that where certain organizations can put certain images with certain number of times? Like, are you using SpiceDB based on the Zanzibar paper, or is it, like, a flat of authorization across the board? Great question. We

25:13 don't have rate limits for anything. For for our public free tier developer tier images or paying customers, we have we have no rate limits. All of that comes with the the caveat that every service has a rate limit imposed by physics. So, you can't possibly you know, you can't send us 10,000,000,000,000 requests a second. Something will blow up. But, we try not to impose artificial rate limits on anything. And a lot of that comes from, like we were talking before, using r two. It's not like you pulling our image, you know, a thousand times is gonna cost us a

25:48 thousand times more than the first time. So by all means, pull away. We, we don't use SpiceDB. We have our own, our own IAM model that we base heavily on Kubernetes' and GCP's IAM model. It's a lot, I think, simpler or at least simpler for now. We don't have hundreds of roles or hundreds of permissions. We have, I think, maybe two dozen, a dozen. But, you know, ask again in three years and see if we're up to the 350,000 roles that I am or I am roles that GCP or AWS has. Okay. Cool. Interesting. Alright. So

26:36 Using Managed Cloud Services

26:36 now you were I don't know if you were being polite or naive. Right? I I think that's the two choices we have. But you talked about Google's default service account for, like, Cloud Run functions being permissive and it's you know, you don't wanna scare people away in the first five minutes. I always thought was a bit more sinister. That's just terrible word. It's like they want you to be able to go and use all the other cloud products. Right? They want you to hook them to go to their scale. They want you to use BigQuery. They want you to

27:04 it's just like, yeah. Sure. Yeah. See, use whatever you want. The bell just keeps going up and up and up. Like, woo. Go nuts. You and the chain guard as a whole, like, like, the, you know, the big you. You do hook into more of GCP's services. I've seen BigQuery on the article, and you mentioned Cloud SQL as well. So, like, is that, again, an operational concern where you don't want to take on the burden of running your own post credits, etcetera? Or do were were you always just gonna use cloud products first? Like, I'm curious about that

27:34 decision process. Yeah. I think, in general, I think it makes sense to think about whether, like, whether your expertise is database maintenance, whether the company's whether the company, you know, differentiates the business differentiates on how well it can manage a SQL database. And if that is your if that is your competitive advantage, and there are there are definitely businesses and companies that that is their competitive advantage, then by all means, run it yourself. Right? Vanguard's not differentiating on how well it can run a a database. So we just want one that works and is reliable and, you know, relatively

28:20 low touch to get the job done. I'm sure it's not optimized to the to the ends of the earth. I'm sure there's stuff we could do that if we if we, wanted to make it, a little better. But, you know, ultimately, that's that's not where we where we differentiate from anybody else. On the other side, like, building the best images possible is where we differentiate, and so we have invested a lot in on that side. Right? And if if we had spent a bunch of time building our own database, building our own AMS, building our own secret store, built all

28:55 of this stuff, which, you know, as an infra nerd, I would love to. Right? Like, know, please let me go do this. But instead, they tell me I have to, like, build images or something. Yeah. You know, it's it's it's about, like, really trying to think about what where your time is most useful, and managing our own Postgres was not that. So we Right. We luckily had Cloud SQL right there. We pulled it off the shelf. We use it. You know? If if at some point our database becomes our our bottleneck, if that's if that and

29:28 and sometimes it is. But if if it becomes our biggest fire, like, we will invest in making that better and tuning it better. And if that means we run it ourselves, then that means we run it ourselves. But I don't think that would be our first choice. You did mention Not mine anyway. Not to put you on the spot. Right? But you did mention at the start that some of this migration was also cost driven. And, like, maybe it's just a Scottish person in me. Right? But I we you know, we're known globally as as quite tight fisted

29:56 people. We're cheap. I keep looking at the price of VCPUs and manage postgrads, and I'm like, nah. Nuh-uh. I'm gonna I'm I'm gonna make this at home. And I I often will run my databases on bare metal or a virtual machine even though the operational costs are astronomical because I have to do my own backups and stuff. But Yeah. I mean But you have the knowledge for it. Products. But, like, David, you have the knowledge for it. Like, it's it's a business decision. Right? Like, in the end, this is a business decision. You're evaluating the cost

30:26 of buying versus the cost of managing a team, like having dedicated people who may be busy or not busy. They may just kind of be sitting there. I don't know. Like, this is how I always looked at it. You you can look at it the same way as, let's say, if you have a lawn, if you have grass. Maybe you do, maybe you don't. Depends on where you live. Do you mow it yourself? How much is your time worth versus do you pay for lawn care service? Everybody has a different risk question there. But how much is your hourly pay, David?

30:59 Think about it that way. And, like, how much is that hourly pay and maintaining all of the tooling and maintaining all of those things versus calling somebody who's a specialist who has the really big tools. So I look out at the lawn that we have in the backyard of the house that we rent. And sure, we could get, like, a little push lawn mower and one of us could spend time mowing that lawn. Or we could pay a service that has one of those massive ride on lawn mowers, which I think are really cool and I keep wanting to get

31:27 on when they're not looking, but they don't notice me. And they come and they do it in, like, ten minutes, what would take me at least an hour, and then I have to also get rid of all the stuff that's there. It's the same idea of do I run my own Postgres server and my own Postgres database knowing that maybe I don't have the best understanding of how it works, so therefore, who knows how secure I really am because I'm doing it myself? Or do I buy it until I can get to the point where I can hire the

31:57 people and have a team that can maintain it? That's really the business question. Maybe I'm wrong, David. You can correct me. But this is, like, how I've always looked at this. So No. I I agree. And you mentioned at the end, like, hiring folks to do that. There's also a recruiting and expertise sort of question. Right? Like, we we, and this is related to our, to what our our differentiated you know, our our value proposition is. But, like, we haven't really hired database tuning experts. We haven't really hired, you know, Postgres operators experts. We could do it. I'm sure we could

32:32 we could muddle through and, you know, if David can do it, then anybody can. But then Oh, alright. Sorry. But, you know, we would we would rather spend that that limited resource of time and and energy on, you know, the product and on on improving the build pipeline and improving the, you know, serving infrastructure that that is differentiated from from other folks. And there's a hidden cost that Let's back up a minute. Hold on. Hold on. Sorry. I'm cutting you off. I didn't mean it. I didn't mean it. David's like, I have to get my revenge

33:04 here. Hold on a minute. Sorry. No. No. No. No. No. But the the what the CNCF wants us all to believe is that we just run a Kubernetes cluster. We stick the cloud data PG operator on it or any operator for any software, and this is all done for us. No expense here. Like, open source is wonderful. Like Okay. Okay. There is such a thing as a, you know, free as in beer or free as in puppies. Like Yeah. Yep. Where where can I get a free puppy? Hey. I have plenty of places to send

33:35 you one. But, you know, free Puppy doesn't mean forever free. It's an illusion. I I think there's also a a when it works well, it's free. When it's you know, I I can, in the five minute tutorial timeline, like, spin up Cloud Native PG and have it running and do a hello world and insert, you know, a thousand rows and feel good. The day two is another thing, and then day, you know, 390 is another one that, like, oh, no. It was really easy to insert a thousand rows. It's also easy to insert 20,000,000 rows. And now what? And and

34:16 so not that we wouldn't have this problem on a managed database either. Right? We we still do, But at least we don't have to there's a lot of operating stuff that we don't have to care about. See, the worst thing is I know David is explicitly nerd sniping a little bit. He's doing it on purpose, and I'm falling into it just as much as anybody else. Hey. I'm just enjoying those talking. That that that is no sniping. But but I will say, right, you mentioned something like, obviously, you know, people listening, they're making these decisions every

34:51 day. Do I go and pay for the managed service? Do I get an operator? Do we run it ourselves? Right? Jason's already gave us great advice. Focus on what your company needs to excel at. Right? I think that's that's ubiquitous to everyone should take that away. But we are missing something here. Right? It's easy for me to say I'm gonna run my own Postgres when it gets, like, 10 queries a week or 10 queries an hour or whatever that number is. So maybe, you know, you don't mention us in the article, and I don't know if I wanna

35:15 put you on the spot and say give us numbers. Right? I'm assuming your scale as in in the hundreds of requests per day, probably not even in the thousands of requests. I'm assuming we're going higher. I don't know how much you're willing to share there, you know, any context you wanna provide? Vanguard gets more traffic than my personal blog, if that is the question you are you are asking. Nice. Yeah. Yeah. I think I think the the answer to everything is it depends. Right? Like, I'm gonna put on my my senior engineer hat and say it depends. I knew that was

35:46 coming. If this is if you are receiving more traffic than you can handle, you figure out why, you figure out what the solution is. If you are not, not currently experiencing that problem, go fix the problem you are currently experiencing. Right? It's never set in stone. It's never you know, we're never even this you know, even while I was writing the the article about our migration to Cloud Run, we were making new changes to that stuff. We were, like, adding new stuff and tweaking stuff and and moving stuff around. So it's never even now, it's not done. It was just sort

36:26 of a birthday, like, one year since we started this. You may look at this. If you're reading this in the year 2026 listener, you may go back and look at this blog post and say, oh, it's all different now. Right? Like, it's all you know, they said that they did everything this way, and now they're using magical global Cloud Run services. And GCLB is deprecated and, you know, whatever. So, yeah, it's never done. Yeah. Alright. Let's I I know I know we're getting to the end. I just I wanna just highlight one thing. This was a lot of technical debt that

37:03 you had to convince leadership to let you go fix. That that's a lot to pay down that you had to kinda catch up on to be able to say, hey. We need time to be able to go do this. In, like, thirty seconds, how did you do it? Because that is, like, the the holy grail of, like, every infrastructure team ever is convincing your leadership that, hey. We need to completely change what we're doing now that we've scaled differently. So Yeah. It's complicated. I thought so. It depends. Oh, yeah. The the it helped a lot that we

37:42 I'm gonna I'm gonna throw Matt under the bus. Matt is our CTO and and sort of, you know, originally led this this decision. Having him on board from the start was helpful. Right? Like, it was his idea. It wasn't, like, our idea, and then we had to convince him. He did a fairly thorough cost analysis of, this is what it costs now to run both Mhmm. The prod environment, the staging environment, and everyone's developer environments. I think this is too much, and I think the trend, you know, going up means it will get worse the longer we put

38:14 it off. So, that definitely helped. He, also just sort of started doing it, like, not not, like, migrating us for real, but, like, demonstrating, hey. This is how easy it is to run the service in Cloud Run instead of, on Knative. Here's how easy it is to get GCLB to work. I think it was, like, a a weekend of of hacking for him or maybe a week to, like, get a demo environment that was entirely in Cloud Run. Not production ready, not ready to start migrating to, but, like, hey. It's not as hard as you think. It won't be a

38:50 nine month, you know, gigantic engineering slog to do. If we rip the Band Aid relatively quickly, we can, you know, get this done and and make the line go down instead of up. All of those are very helpful ways to convince folks that these are useful migrations and not just engineers being nerd sniped and loving migrating for migrating sake. Not that I've ever done that. But, yeah, sort of, like, forecasting the cost, understanding the cost now, and forecasting it and saying, if we don't do anything, this will be a cost in a year. If we do it now, then we

39:31 upfront that cost and have much less over time. Right. Alright. I got one more question, Orest. Now, again, you you just mentioned dev environments. Now a common challenge when I'm, you know, talking to people or working with companies is the more they move to the cloud, it typically, their development environments get a little bit harder. Right? It's not just running a gold binary that speaks to a local Postgres, which is okay for Cloud SQL. But when you're using KMS for secrets, you're using PubSub for, you know, webhook receivers and all these other bits and pieces. Maybe you can talk

39:36 Developer Experience

40:06 about the developer experience of working on this on your own machine and how that compares to what you ship to production. Yeah. We every every engineer that has you know, that touches the infrastructure has their own developer environment. In this case, developer environment means GCP project, Cloud Run service, you know, Constellation, Cloud SQL, KMS, etcetera, etcetera. If you want to run this any of this or if you want to run any of this locally, most of the time, you can. If it talks to KMS, you need to, like, you know, set up the auth to be able

40:45 to do that. If it talks to Cloud SQL, whatever, you need to do that. So but if it's just, you know, two services talking to each other or a service talking to r two to create a signed URL or something, you can run it locally. I think, generally, it's gotten easy enough to run-in Cloud Run that we just do that. Like, if it's if it's not if it's as easy to get the real actual environment up as it is to run it, you know, two or three services locally and talking to each other, then we'll just

41:16 put it in the real one and and run it there. It's not, like, as easy as running it locally a lot of times, but but, generally, like, if you're about to do something that integrates with another service or Cloud SQL or or KMS or r two, like, just run it in Cloud Run, and it's it's fine. That was actually a a a big benefit of this in the first place was that, like, everyone's developer environment was a significant cost when we had, you know, Kubernetes clusters to our you know, multiples of them in each developer environment.

41:51 Not just in terms in terms of, like, unused infrastructure, but, like, okay. It's been three months since I last had to make an infra change. First, I have to get my cluster back, like, into a working state. Right? I'm gonna spend a whole day just, like, kind of getting the cluster back into a workable state, and then I can finally start working on my service. With Cloud Run, there's a lot less bit rot happening underneath you. It's just sort of that service not none, but a little you know, most of that service is already just sort of running there, and it costs

42:22 nothing while you're not using it, which no one you know? I think my developer environment gets less traffic than my personal blog. I love that to skip for all of our conversations. It's like Jason's blog in the middle of it. Less than Is it bigger or smaller than my blog? Alright. Well, thank you so much for for sharing all that. I mean, it's we don't even talk about bedqueeny, but I don't really wanna keep you too much longer. But, you know Oh my I got nowhere to be. And I don't have a real job, so,

42:56 I mean, I can sit here for as long as you want. And, yeah, Laura got made unemployed, so I guess we're sat here for a while. Well, just for another, what, thirty six hours, I guess. So Oh, wow. We can get that long. Alright. So we're here for thirty six more hours. Keep it running. Alright. Let's do this. Livestream time. Anyway Well, there's this ongoing meme online. Right? Like, when people I don't know if you've ever searched for the Cloud Run and BigQuery. I do this regularly as I'm exploring. But people often argue over what is the

43:29 best GCP service between Cloud Run and BigQuery. They're like, they're both exceptional Wait. Products. Are over Not arguing, but there's just a there's a debate. Because Cloud Run and BigQuery give you so much for free. Like, they're both just these stupid amazing products. And So so which one have it. Which one's more valuable is what is what, I guess, people are arguing. Now this is interesting. I don't I don't get involved in these meme arguments. Meme argument? Is that, a thing? I just kind of exist and find out about these things on the fringes of the Internet when you tell me about

44:05 them. So Well, yes. It's I mean, it's it's not common because that would be tragic. Right? If people would just go on to Google and go, which one's better? But, you know, when it comes down to all of GCP's products, I think that the two shine and torches are Cloudron and BigQuery. They both cost Right. Pennies at really decent scale. Yeah. And I BigQuery is one of these things that I've never personally had a use case to really dive into properly because, again, I I'd be, you know, throwing a tennis ball down a large collider or whatever. I don't that's a

44:38 horrible analogy, but I didn't have anything. Right? So, yeah, I was just curious about the BigQuery stuff. Like, at what point do you decide that, yeah, we're gonna use BigQuery? Because I've never been there personally, and I'd love to know more. Yeah. We for the record, I would say Cloud Run is the better service, and I will fight you if you disagree. I'm just I'm just kidding. We probably also don't you I'm not. I'm really not. That's the title of the episode now. Jason says Cloud Run is better than BigQuery. There we go. Hey, man. I don't

45:07 need people to find out on the Internet. Everyone's gonna go to your blog, and then you're gonna have to come up with a new skill. But then Yeah. There we go. I have to run Postgres now. What was I saying? Oh Either way, the no. Also don't use we probably also don't use BigQuery for something that actually needs BigQuery. There's there's also the meme of, like, your dataset fits in memory. You know? Like, if if you are you don't need, you know, BigQuery and managed services for nearly everything you use it for. We probably don't.

45:41 But, again, the nice thing is that it's a managed service that we just sort of, like, plop data into and can query later. That could be cloud SQL. I think we just like having the not having to care about the scale of these things and just toss it into BigQuery, and and and it works. Thing we use BigQuery heavily for, which I which I do like a lot, is our eventing infrastructure. We didn't get to talk much about our eventing infrastructure. But, basically, like, you know, when a push happens when a when an image push happens, an event is fired inside

46:02 BigQuery for Event Logging

46:16 of our system, and it bounces around, and things can subscribe to it. That's built on PubSub. One of the subscribers to that PubSub is a recorder that records every event into BigQuery. So every thing that happens, every time anything happens, it is emitted into the the event pipe and recorded. And so that is something that's really nice to and we don't even think about it. Like, most of the time, it's just like there is a log of every event somewhere. But when you want to go find that and when you wanna say, like, oh, a

46:52 GitHub pull request was opened that was approved that did this, that pushed this image, that did this, You have this sort of log of everything to to quickly go through and and visualize. We also use these things for, like, you know, generating dashboards and graphs of, like, are people which customers are pulling which images and which free tier images are most popular and and things like that. So Oh. Yeah. I bet your security team really loves that, though, because one of the biggest things about logging that I always find people don't realize is you need a read only archive that no one

47:29 can actually change. Every single logging system should have a read only archive of everything. That way if somebody does access the system because usually what they'll try to do is they'll go in and then they'll change the logs so that they hide their tracks. But Right. If it's pre logged into BigQuery, they can't go in and just change it, hopefully. Maybe they could. They If they have the right permissions. But Anyone can do anything. But I know. I know. But, like, I like to dream that's a possibility. I don't know. But But the to your to your

47:58 point is that the reporter service that that, you know, listens to every event and writes it to BigQuery is the only service that or the only account that has access to write to that dataset. Yeah. There we go. We also have an alert set up for if any user besides this modifies the the dataset or or, you know, reads certain GCS buckets or, you know, does certain things that they shouldn't be doing. We call them the lasers. Matt calls them the lasers. You will trip a laser and, like, fire an alert and and, you know, notify our

48:32 Slack that, like, someone was editing that dataset in BigQuery. That's weird. Right? Yeah. You can also turn off the lasers. They sufficiently you know, smart person or or a a motivated person could disable the lasers, but it's nice to have another sort of Yeah. Like, layer of defense there. Yeah. Always a good thing. Yeah. But the the attack factor here is internal. So I guess there's also it's it's well, you're you're not looking for malicious activity on the outside trying to deactivate your lasers like it's, you know, robbery or anything like that. This is more of Well,

49:05 and it it it tends to show up with a Slack alert went off that, like, some coworker was, you know, touching this GCS bucket or something. And and they'll respond to that in Slack and say, yeah. That was me. I was debugging this thing. This is what you know? Or that laser is overly sensitive. We should, you know, detune it so that it's not, you know, gonna bug Slack every time I check my email or something. Yeah. Don't know. It's not good. I wish I could know when they check their email, though. That would be nice. They're just avoiding

49:38 us. I know. Alright. I mean, I guess the GCS bucket thing ties into your usage of Terraform as well, which we didn't really talk about other than the fact that there's a module. But I'm assuming all your state files live in Terraform. So if anyone were to go in there and start manipulating that, there would be a laser as well. I like this concept of lasers. I would have to start building lasers. We have, there's examples in the repo of our of our lasers around, certain things. If a if a human accesses a service account key,

49:59 Infrastructure as Code: Terraform & OpenTofu

50:09 like, that is a event. That is a it's something that a person should never be doing. Even if they have permission to, we don't want we don't want you to be unrecorded doing that. See, this is this is something that you you've told David this now, and this is how we get Skynet. If Dave is just going back time. Eddie. Alright. I think we should finish on stuff in controversial. So Oh, no. We haven't been being controversial. Okay. Sure. Okay. Better than BigQuery. I mean, we've already beat we've we've checked off the meme. Right? Now we can drop

50:47 into open source OSI definitions and talk about why Terraform over OpenTofu. Have we have we argued about Go and Python yet? We haven't done that yet this episode. Go and Rust. No. Rust. Rust. It was Rust. Was rest that we need to talk about. Right. Right. We almost have all the memes going. Okay. We almost made it one episode. So me having to say Russ Lora, and then then you just you poke the bear. Poke. Poke. Poke. Poke. Anyway, so so what are we thinking? Yeah. I mean, with no. Back on track. We're almost done with

51:20 this thing. Right? Why not open tofu? Was that even in consideration, or is it, like, you don't see the It it is a consideration. I would say, mainly, we are still floating on inertia of having used Terraform so far. Yeah. I don't think any of us have any reason not to use OpenTofu. A couple of or at least one person at ChangeRad, John, has contributed to OpenTofu to make it faster, to make it more performant, and I think we would love to migrate to it. It's just an another migration to to consider and and prioritize

51:51 with the rest. We have we have nothing but love for OpenTofu, and we'll we'll migrate to it whenever we can. That wasn't that controversial. So you need to come up with more controversial questions. But I I guess you could say, so you have nothing for love but o for Open Tofu, but there's a but in there. Okay. There there's the controversy. Anyway, I'll stop now. It's alright. We'll we'll deepfake his voice and put in something spicy. Okay. Good. Good. There's AI. We added AI into this. Okay. What else do we need to add in going on right now?

52:25 Future Roadmap & The Value of Focus

52:25 If you say cryptic Alright. No. No. No. No. Has been this has been really cool. And, you know, it's a really interesting project. The blog is fantastic. People should go read it. The open source modules examples, I think, are universally adoptable and stealable to anyone listening that just wants to go and do some of this stuff, which is really awesome as well. I'll finish I mean, I don't know if Laura has anything else, but I'll say I'll finish with this. I keep saying one more question, and I'm giving you, like, another six. But, you know, you've said

52:53 twice now that if we come back in twelve months or three months, it'll all be different. I'm curious. Over the next three, six, and nine months, what is your next what's your next mission? What do you think will change? What are you wanting to experiment with? Like, what's on your road map? Focus. Focus. Focus. I have to only do one thing for each of these things. I mentioned the three sort of pillars of infrastructure we have. The serving infrastructure, I think, is really solid. We just recently are wrapping up sort of an investment in our package build infrastructure.

53:25 I think the image build infrastructure is the next the next oldest pillar. Maybe we'll just bounce between all three and just, like, keep, improving them in in shifts. But, yeah, the image build infrastructure is something I think we'd we'd, I would love to start improving more. I don't think that would be a three or six month timeline thing. I would love for it to be. What was the rest of your question? Focus. Focus. Focus. Yeah. Great. I mean There's so many there's so many one thing because this is this is why you're shipping, and I'm sitting

54:02 here. I've got 12 plates that you can't see off camera that I have to go spend. That's why I had to reconnect everything. That's like, I just I can't stop messing with everything. So I really it's it always amazes me when people are able to we're gonna do this one thing and and I'm just like, I Very cool. I I think my coworkers listening to this and hearing you say that I'm very good at focus will be laughing very, very much. Oh, then they need to go work with David for a little while and then see

54:30 it all I hope that a place. Yeah. I mean, have you even bought a today? Have what? I haven't. Have you even bought a domain today? I mean, come on. Yeah. It's early. It's early here. There's plenty of time. There you go. There you go. Well, my my one thing is just, hey. Congratulations on getting through, like, some technical debt fixes for infrastructure that like, that is the holy grail for every operations person ever. So congratulations on surviving that. Yeah. There's there's plenty more where that came from too. Yeah. It's just you know? I mean, every little bit of

55:05 debt that we can pay down is always quite nice. Yeah. So Alright. Well, thank you so much for your time. Any final departing words for the audience before we say goodbye? No. Well, now I've known you for an hour, and I still think you're a cool person. So thank you so much for coming Yeah. Thank you. Her next time. Okay. Alright. We'll be back at twelve months to see what's changed. Hope you all have a good day. Take care. Thanks, guys. Thanks for joining us. If you wanna keep up with us, consider subscribing to the podcast

55:12 Conclusion

55:45 on your favorite podcasting app or even go to CloudNativeCompass.fm. And if you want us to talk with someone specific or cover a specific topic, reach out to us on any social media platform. Until next time when exploring the cloud native landscape on three. On three. One, two, three. Don't forget your compass. Forget your compass.

Meet the Cast

David Flanagan

@rawkode

Laura Santamaria

@nimbinatus

Jason Hall

@imjasonh

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Code

chainguard-dev/terraform-infra-common

Additional Resources

Chainguard blog article about migrating from Kubernetes and Knative to Cloud Run

More from Cloud Native Compass

View all 23 episodes

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Flatcar Linux: A Modern OS for the Always-On Infrastructure

Flatcar Linux: A Modern OS for the Always-On Infrastructure

Platform Engineering: Asking "Why"? with Evelyn Osman

Platform Engineering: Asking "Why"? with Evelyn Osman

AI-Augmented Programming

AI-Augmented Programming

Observability for Developers: What You Need to Know?

Observability for Developers: What You Need to Know?

The Future of Sustainability in Open Source

The Future of Sustainability in Open Source

More about Kubernetes

View all 172 videos

Hands-on Introduction to Yoke

Hands-on Introduction to Yoke

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Kubernetes Security Scanning: The 4 Tools You Actually Need

Kubernetes Security Scanning: The 4 Tools You Actually Need

More about Terraform

View all 12 videos

Fuck you, Hashicorp ... an IBM Company.

Fuck you, Hashicorp ... an IBM Company.

Atlantis: The Terraform Automation Powerhouse

Atlantis: The Terraform Automation Powerhouse

Hands-on with Qovery

Hands-on with Qovery

More about OpenTofu

View technology

Atlantis: The Terraform Automation Powerhouse

Atlantis: The Terraform Automation Powerhouse