Event-Driven Architectures at Wix | Rawkode Academy

Watch / Cloud Native Compass On demand

The embedded player needs JavaScript.

Open the video stream (HLS) Download captions (VTT)

Overview

About this video

What You'll Learn

When Wix favors events over RPC to decouple services and tolerate failures
How protobuf definitions, linters, and schema registries keep event changes compatible
How Kafka producer, consumer, and admin proxies reduce broker load and bills

Natan Silnitsky shares how Wix operates 2,500+ microservices and 70 billion Kafka events per day. We cover when to choose event-driven over RPC, schema evolution with Confluent Schema Registry and protobuf, Kafka cost optimisation, polyglot services, and ZIO on the JVM.

Chapters

Jump to a chapter

Transcript

Full transcript

Generated from the English captions. Timestamps jump the player to that moment.

Read the full transcript

0:00 Introduction

0:00 Okay. So welcome back to the Cloud Native Compass. I'm your host, David Flanagan. And today, I am joined by Nathan from Wix Engineering as we discuss and share wonderful insights into the world of event driven architectures. Hello, Nathan. How are you today? Hi, David. It's a pleasure to be here. Well, thank you so much for joining me. I'm really excited for this episode. Event driven architectures are are just what keep me excited about technology even after twenty years and been able to dig into it with experts and understand how they approach complex domains via event driven architectures

0:35 is always the highlight of my week. So thank you so much for joining me. I discovered you via the Wix engineering blog, which turned out to be a bit of a gold mine for engineering content. So before we dive into that and event driven architectures, could you share a little bit more about you? Sure. So I've been a software engineer now for close to twenty years, I think. And it's been such a lovely experience for me. So for the last eight years, I've been in Wix, which really drove me to become much better engineer in all of the skill sets

0:50 Guest Introduction: Nathan

1:14 that engineer should have. And also be able to share it and want to share what I've learned and and what Wix has achieved with the rest of the world. And I think it's really important, and I encourage every engineer, no matter what is their skill level, to do the same and and try to write a blog post about their experiences and what they learned. There's no such thing as something that is not interesting or something that no one will care about. And even if you if you don't get a lot of exposure, it's still great to to create content.

1:49 It it it gets you a create in a creative high, so I really recommend it. Yeah. That's great advice so early on in the episode. Like, I I often refer to this as, a collective knowledge when I speak at conferences and such as, like, the more we can get engineers working on their own their own issues, right, that their organization has, that their team has, that they personally have. And the more we share this knowledge in a public forum, the more that other people can learn and build upon that knowledge. Right? It provides foundations for

2:17 people to go and do wonderful things with technology. So that's such great advice. Thank you for sharing that. So you've you've been at WEX and you're saying that it's kind of, you know, improved your engineering chops. You're working on fun problems. I mean, just to give people some context and clarity on what Wix does and what kind of scale they're operating at, can you just share a bit more details there? Sure. So Wix has a very powerful website building platform. And over the years, it has enabled, all kinds of people with, different skill set levels, to build websites. We were pioneers in

2:25 The Scale of Wix

2:55 building websites based on AI, for instance, even before it became so trendy and viral, like these days. And in more recent years, we expanded our reach, from, like, self creators to agencies and web professionals and created a whole ecosystem platform. So we don't only manage online presence, but also manage businesses online from stuff like building and managing your bookings and schedule for your yoga and Pilates studio to, sophisticated tools like third party drop shipping. And as all as we expanded our offerings, more and more customers were brought in and the scale has really increased. So we have around 2,500

3:43 microservices in production, even more added every week. And they come up with a lot of visitors, around 1,000,000,000 unique visitors every month, which gives more than 500,000,000,000 HTTP requests per day and 70,000,000,000 Kafka events produced every day. So very large scale and a very big distributed system where there are a lot of challenges to keep up the high availability, performance, low latency, and resiliency in the fault tolerance. Wow. I mean, I I was asking to give the audience clarity on scale, and I think you've just surpassed the numbers I even had in my head. So those are some

4:22 pretty chunky numbers you've got running through your system. So let's gonna touch on something there. Right? Said you've got 2,500 microservices. It's growing every week. Did Wix start off as a monolithic application, or has it always been a microservice architecture or service oriented architecture to some degree? And were you part of that migration, if so? Yeah. I think very early on, Wix adopted the service oriented architecture. I think there were monoliths. Like, at the very beginning, I think, there were services that were expanded into a monolith and were written, like, I think back in, like, twenty o six, '20 o '7, '20

4:59 o '8. It's before my time, but with, like, Java technologies, like, enterprise technologies, monolithic stuff. But very, very quickly after that, especially since we adopted Scala, we went over to look more smaller and smaller services and microservices and haven't looked back since, really. We kept on improving standardization, ease of building in a new service with all the different cross cutting concerns that are involved because it's really a big challenge. So, like, keeping Velocity fast with all the needed building blocks for every service because while we may not be monolithic, it's in some respects, can think about the

5:43 Wix platform as a distributed monolith, right, because you have all these verticals that offer a lot of more extra features for the different kinds of websites. So you can have an ecommerce store, but you can also have the bookings features or a restaurant or a hotel, etcetera. And all of these have to work together along with the site editing infrastructure, etcetera. So we have a lot of of of different services working on very different domains. But at the end of the day, they need to have basic language to to communicate, to propagate user context, and to do regulation like GDPR

6:24 and a lot of other concerns. Of course, concerns where you work with microservices, it's much harder to debug, much harder to know what's going on. So you wanna have good monitoring infrastructure in place and and ease of, like, a lot of tools to investigate. So all of that has to be built in to each and every microservice. And that's why I say, okay. They are created independently by different teams, and they have completely different domains, business domains. But in the technical aspect, you can think about them as a distributed model. Alright. Awesome. Thank you. Like, if those you

7:00 Developer Experience in a Large Microservice Landscape

7:00 know, you're an you're an engineer. Right? Your nine to five is to to write code and help build out the Wix platform. How many of those services do you have in your head as a 2,500? I know it's a completely superficial question, but I'm just curious, like, what is your exposure to application that size? How much of it can you actually grow up on your own? Right. So as an infrastructure developer, I I'm exposed to potentially all of them. Like, each day, some some team from another part of Wix will contact us about some question or issue.

7:33 So at times, I'm familiar more with with with with parts different parts. But, of course, it's really impossible to keep all 2,500, in your head. So I know, the different verticals. I know the different challenges. Like, some verticals have more throughput than others. You know, as the funnel shrinks, you get less, like, technical challenges. You can still have a lot of API and and business and and, like, business logic challenges, but you'll have less technical challenges. So I'm probably more familiar with the higher scale ones, I would say. But getting 2,500 is impossible for sure. Yeah.

8:15 I mean, I I don't know what the split is between application services and infrastructure services, but I imagine there's a lot of identity stuff going on. There's a lot of caching going on. There's the message processing. There's consumers. Like, you know, when you have as and I'm making some assumptions right now. Right? Because I I don't know a lot about works yet. But I imagine there's a substantial amount of platforming that has to be done to support the developers to be able to, you know, onboard and run their services when you get to this level of scale.

8:46 Why Event-Driven Architectures at Wix

8:46 So well, in fact, let's before I say, like, what are the challenges, let's just start off with, like, you know, why go down this path of event driven architectures for a system like like Wix? Like, what are the pros and and what are the cons of this? Sure. So usually, when you start off with microservices, if you come from analysts, then you're used to function calls. Right? And procedure calls. And then the natural extension of that when you split up to multiple services is to do remote procedure calls or something similar like REST APIs. And that's very a very natural thing to

8:50 When & Why Event Driven Architectures

9:23 do. Now what I've noticed that can happen in such situations is it's it can be a bit restrictive. For instance, you want to call call a service. And actually, what like, in ecommerce, your card service calls the catalog service and the inventory service, etcetera. And what usually happens is, like, a big chain of of of calls that that occur in in such situations. Right? Like, you mentioned identity probably will be in the context, but also a lot of of of business procedures happen on different services, and and and you start getting a very large chain of of, requests between these services. And

10:09 and this this can really form a a big challenge where you need to take care of making sure that you still have resiliency and fault tolerance in place. And that can be tricky because you have to make sure that you meet deadlines and give out responses back in in in a orderly amount of time, like a normal amount of time. And when you spread out the request and and get, like, a Russian dollar of requests here, you you you end up being really dependent on the availability and the performance on each and every one of these services along the chain. So it

10:48 can really make your your request brittle. So, of course, there are mitigations to stuff like that. Right? You can if if a service is not, responding and failing, you can retry. Or if, some service is completely down, you want to avoid cascading failure, so, you introduce circuit breakers to reduce the impact and not affect the entire, system. But while we do have a lot of cloud applications at Wix with a lot of request responses, At the same time, we really rely very heavily on an event driven architecture as well. So it's kind of a hybrid

11:27 architecture. And this type of architecture really makes everything a lot more resilient, and I would also say decoupled. So if we have some event happening on the site infrastructure level. Right? So all of these verticals are interested in that. So imagine all of them now calling the site's infrastructure and requesting all kind of information. So that can really get a a really big bottleneck and really slow things down. So you can cache a lot of the responses, but caching is a mechanism that's really need to be deployed carefully. You want to avoid stale values because stale values can really corrupt

12:08 your data and provide the incorrect answers. So if you have just events being emitted from the site infra service, so all the verticals, different verticals like e comm and restaurants and hotels can can consume them and and decide what's interesting to them and keep that smaller dataset on their side, you get a much more resilient distributor system here. You get a lot more like, the the decoupled nature means that you reduce blast radiuses. You you kind of silo issues. Right? If if the restaurant's consumer of site events has some issue now, it does it can it just cannot affect

12:51 any other vertical. It's impossible. Right? So that's a lot a great advantage. And I'm sorry for the long answer, but also Wix, like I said, evolved from only catering to do it yourself, site owners to actually work with developers. And then they also require to know sometimes what's going on, in the site. And one of the ways like, if they develop JavaScript on top of a Wix site, they wanna get sometimes JavaScript callbacks and do actions when stuff happen. So, naturally, it will be much easier to expose all the rich possibilities that the Wix platform offers as JavaScript events

13:34 or callbacks than you would want to have the ability to emit events from all of your microservices potentially, right, to make it a platformized effort so that it easily then translated over to to this JavaScript callback or to the to an HTTP webhook and stuff like that. So it gives you flexibility, resilience, and the ability to layer stuff one on top of the other much more easily in in a more robust way. Okay. Thank you. So just to kinda recap a little bit of that. Right? You know, you're talking about how you've got services, and and some of them have, let's call

14:05 Distributed System Challenges and Resilience

14:14 it, an an an appetite for eventual consistency to the point, I guess, it's maybe not binary, but either they cannot accept eventual consistency, and those are probably gonna be more likely to be your your current based endpoints that you were mentioning. And then there's the ones where it's okay. Maybe we can accept some eventual consistency so they may take on a more event driven pattern. And you also spoke there about cascading failures and, you know, the Russian doll of HTTP request, right, which is a a very difficult problem. You know, I think the paper that I

14:45 Service Mesh

14:47 often refer to here is the Twitter paper on where they were trying to solve something like this with retry logic, baking it in to, like, a client library, which is now spun out to service mesh. Is that the approach that Wix take? Do you use service mesh to provide that tooling to your application developers, or have you gone down a definite route to kinda help them with retries and circuit breakers? Like, I'd really love to understand how you've implemented that. Sure. So, yes, we we do utilize a service mesh, and we I don't think there's a standard

15:12 RPC Resilience and Database Choices

15:20 circuit breaker mechanism at Wix. And I think we try at the end of the day the responsibility of the specific service, like, as a client. I think service mesh, in our case, mostly helps with easy, like, discoverability and easily addressing, right, the the correct service and and not need needing to worry about it the request getting to to the correct correct place. And so in that respect, what what we do try to enforce in these requests usually like, the default time that a request has to get a response is quite short. I don't remember the exact amount of seconds,

16:04 but it's it's quite short. So you have to have a good reason to to then increase it. Right? Because the longer the response can potentially be allowed to return, the more problematic it is for the health of the distributor system. Right? So the time budgets here are are quite constrained. And if a service does not meet up to those standards, then they need to work hard on the performance, right, improve their RPC endpoints. And if their database is starting to have to act act up, then they need to investigate how how to solve it and maybe

16:44 switch to a different technology and stuff like that. But I would say that for Wix, a lot of the services don't suffer necessarily the scaling issues because of the funnel I talked about. So we have, like, the heavy duty services that have very specialized data storage solutions, and others usually are are quite fine with MySQL. So will get MySQL in the last few years introduced document database. So because we were heavily invested in MySQL, we took the ability to now work with documents and JSONs with MySQL in both hands, and we have database layer on top of

17:21 that. So it it's really simple and fast to to create new domain entities and and to persist them and update them. And in terms of the latencies and and performance, usually, it works quite well quite well, and and they get their own tables. So sorry about the rabbit hole, but maybe I'm my own Russian doll here. But in terms of on on getting the the best performance and and resilience, it's up to the developer at the end of the day. The developer like, if I'm I'm starting to notice that my KPIs are starting to be problematic and it's because of this

18:07 this service, then may I may reach out to them and ask them to to to fix their performance if they haven't noticed it. And another way I can do it is I can consume their events and have my own materialized view and then and then just take the responsibility for performance completely for my for my on my end, which is, like you said, if you can do it in eventually eventually consistent way, then it's definitely an option. And I think at weeks, like, 99% of the cases, eventually consistent is good enough. Okay. There was a lot there in that answer.

18:45 So let's let's cover a few things and remove a few assumptions for the audience. Right? Yeah. When you interview me, you need to keep a notebook. Have been taking notes. You'll have heard my keyboard touch. But Yeah. You we we both mentioned service mesh. You confirmed there as some service mesh. We're using some API surface of that. So I just wanna clarify. Does WEX run on top all these 200 2,500 services on top of Kubernetes? Yes. Right. Okay. Cool. You also spoke Short answer. Yeah. Well, we're gonna get to the harder stuff. There's what you worry. Plenty of time for

19:19 us to chat. But you spoke about the the application developers and their responsibilities, you know, where they get to select their own tooling, their own databases. They were responsible for client side retries and such. You're on the back end of our team. Right? You lead the back end of our team. So I'd love for us to understand how the responsibilities at Wix work with regards to what does the back end of our team, provide. Is it just cluster? Is it SRE semantics? Is it observability? Is it guidance? Are you embedded within application teams? Like, can you go into just how

19:31 Platform vs. Application Developer Roles

19:52 you work with the the devs and how you put this all together? Sure. So there are two levels of infrastructure for back end at Wix. There's the more, I would say, low level low level part, which says, okay, we're running Kafka. How can we translate that to work well with microservices? Or, okay, we're running MySQL. How can we simplify it for, like, routing layers on top and make it easier to work with? And and also caching and stuff like that. So we we have a caching solution that is partially based on AWS DynamoDB, like the second layer. So I would

20:39 say this, like, the more cloud infrastructure translations to microservices level. And then on top of that, we have the I would call them the platform engineers maybe. Their goal is to make, like, building a microservices super simple and fast. Okay? So the journey for the developer usually starts with writing the interface in protobuf. And it's not only the interface of of the service, like what RPC and Kafka endpoints are part of it. But, also, a lot of the tweaks are done in in in Proto, and then code is generated behind the scenes for it to work.

21:19 And then you complete a little bit more configuration in code, and you can focus on the business logic. So because you have database layer that you work with and I'm talking about, like, the 80%. Right? It's not like each and every service weeks, especially not the, like, really high scale ones. They they can't operate under these these templates. Right? But for the 80%, like, we have a lot of verticals, a lot of various segments that have a little bit less traffic. And all of these tools and and this framework, really allows them to focus on, like, the

21:54 domain challenge. Right? How can we make sure that we provide the best booking solution and and make it so you can have all all these cool features on, like, sending reminders and starting soon and and be able to easily rebook rebook classes and maybe suggesting classes that are similar. Right? So all of these rich business domains, they can focus on that. And what do they get behind the scenes? So they get the auto generated RPC endpoints. So all they need to do is fill in the what action to take when when they get such a request and how

22:36 to handle and process an incoming event. Right? They they need to write that in. And if they need to act interact with the database, so usually, they they just work with our simple DB layer that I mentioned before. It's the works with document flavor of of MySQL and with JSON support. So what happens? You get your RPC or Kafka domain entity, And you use another tool that you can employ here, which is auto mapping to your data layer entity. Right? So and that data layer entity, you can easily add indexes if needed to get better performance and stuff like that.

23:18 So the idea is to really abstract away the intricacies of the SQL queries or or stuff like that. And you also will get built in, like, PII encryption when needed or the built in notifications and segmentation needed for GDPR and stuff like that. Right? And now we're getting to the data locality realm where because of other regulations, you need to keep something only in Europe or only in The US and stuff like that. So all of that can be taken care of for you seamlessly. Yeah. So if I recap, we we get, like, the platform engineers giving

24:00 you really simple interfaces to auto generate the code. And we're actually really excited about the possibilities of large language models like Chachi PT to harness all the great work that's been done so far and take it to the next level and increase velocity even further. Okay. Awesome. I always like to just give, like, that ten second recap to make sure I even I understand everything, and we haven't messed up on any of the good details there. You could pick the low level cloud infrastructure team, the back end of your team where you're providing the the Kubernetes clusters, the cloud

24:22 Building Kafka Infrastructure and Proxies

24:36 resources, and all that stuff. Right? Yeah. So that's another part of Wix. Wix is a big company. So, like, the cloud infrastructure is run, like, outside of the back end. And then we like, I'm part of the you can say CloudInfo for for the back end. And, actually, we developed a lot of cool stuff on top of Kafka. Right? Because you wanna make sure that you get the best performance out of Kafka possible. And to solve all these tricky edge cases of, oh, wait. For some reason, the service pod wasn't able to reach the Kafka

25:17 broker for some reason, and and we get we added a resiliency layer here. And for the consumer side, we added resiliency because we implemented all kinds of reach wise strategies built in. It's really easy to configure. And so we have a rich set of, like, event driven messaging infrastructure on top of of Kafka. And we also have self-service tools to manage issues in your production environment. So things like easily pinpointing an event that that you're interested in in our control plane that we built or investigating why some partitions in Kafka have a consumer lag on them

26:04 and saying what what events potentially got stuck in the processing and also be able to skip or replay events. So you can see that it's a bit more like infrastructure y and production y on that level. And the top level on top of that is, like, having the frameworks in place to, like, write a microservice skeleton in a few minutes and focus on the business logic. Okay. Great overview there. So I think we've covered a lot of, like, the the back end platform engineering, providing tooling. We've got the application developers over here. They only really need to focus on their business

26:48 logic. You know, especially and I wanna kinda look back into event driven architectures. Right? So we got all these application developers. And I don't I don't know how many application developers you've got. Right? But I'm gonna take a guess. You've got two and a half thousand services. Let's assume any team is responsible for, say, 10. That's 250 teams. Let's assume five persons per team. We're potentially talking about 1,250 developers. Is that anywhere close? I think it's it's it's in the ballpark. Yeah. Okay. Cool. So, you know, that that's a lot of people who hopefully focus on on our domain

27:15 Schema Evolution & Versioning

27:23 logic. But, you know, based on the conversation we've had so far, I'm curious. We've event driven systems which are notoriously hard. Right? They're they've got a service. It's a Madden event, and they decide, oh, I'm gonna change the structure of this event. And they're sitting there going, is this an upcast? I created a v two. Is this a new event? Do they get support from, like, the back end teams, the platform teams, the people with experience? Is there like, how do they are they expected to know all of this, or do you guide them on this? Do you support

27:52 them for that? So I think one of the subject here is scheme evolution and and versioning. Right? So definitely, we have, like, design guidelines and and platformization guidelines on how to make sure that when you design your microservice, you write a forward looking API, but also one that can evolve over time. And also, you know, because of the power of protobuf, you can os easily add linters that make sure that you you don't easily mess that up so that you and because of auto generated calls, it's built in with the slash v one for the first thing. So it will be

28:33 easy to do a v two and if needed. So we don't really have, like, automatic linting for for the backward forward compatibility of each of the domain entities that we send over the events. But with Portabuff, you you pretty much know that it will be ill advised to just go and and delete a field. It's better to, like, market a deprecated and add another field. And so there are definitely guidelines on how how to to do these minor changes to your schema. And I think I don't remember. There are some basic rules that enforce some stuff.

29:21 So but I can recollect them right now. So some of this is responsibility of the developer. But because like, part of the onboarding process, Wix, is is how do you write a product which is API first. Right? You it used to be different. It used to be like, okay. So we have some UI and some back end service behind it. And so the AI is going to ask all kinds of things, so we need to build an API for it. But because we we've evolved Wix into a platform that also has develop developers on top of it

29:57 and because it's a bigger company now, so you need developers to communicate, you don't want, like, Slack or Microsoft Teams to be a barrier to understand how to consume something from some events from another service. Right? So we want everything to be well documented, easily discoverable, and easily evolved. Right? So so I can safely stay on the older version for a while, especially if I'm an external developer. Right? And and then safely migrate on my own time to to the new version. And it will be the responsibility of the service owner to make sure that both version

30:38 versions are are supported. Okay. And are you leveraging like, just because you mentioned kinda discoverability there. Right? Are you leveraging anything, like, Backstage, like an IDP that aggregates and collects all of these service definitions and the event definitions and the protobufs and gives someone a place to go and search them? Sure. So, actually, because of, like, the unique nature of Wix's platform, a lot of work has been done to expose things to external developers, and we built it by ourselves. So there was no point in, like, not doing the same for internal developers. So there are products out there. Right?

30:40 Service and Event Discoverability

31:17 For instance, especially for events, have the Confluent schema registry. And there's also API Courier, I think, or something like that from Red Hat, like open source. And those are great tools for for managing and discovering event schemas. For specifically for Wix, we built our own custom services. So each time a new service is within the its protobuf has the event definitions, they are transferred automatically to the main repository that keeps all all of the the definitions. And you can also listen if if you're, like, a CICD service or some other build related service that cares when

32:09 new events are created so that you can do stuff with them. You can listen to that and create all kinds of cool stuff on top of that. And, also, I think for RPC communication, with the service mesh, it's also important to know that a new service has been created or updated, and so events are emitted on that as well to allow the communication to to be simple there as well. And and so because of everything is based on protobuf, you get the schema. You get well documented stuff. And once you translate it, hopefully, to a strongly

32:53 typed language like Scala or TypeScript, you also get the type safety when you consume it and or or call the RPC endpoint as well. Awesome. Let's kinda jump back to something you said earlier as well. So I I I love that answer. Right? You're you're saying that you're a developer company. You've got public facing APIs. There's a bit of dog fitting. The discoverability is the same as the people that use the platform, which is awesome. That's kind of what I took away from that, so I hope that's correct. But he also said earlier that, you know, sometimes

33:25 you have to approach an application team because they may not have noticed that the round trip time on an API request is, like, it's growing exponentially. It's getting out of control. Whatever that regression is, it could be anything. Right? I'm assuming they maybe don't know a lot of the times that they have dashboards, metrics, traces, logging. I am assuming these are potentially provided by, like, the templates you mentioned, which are provided by the platform team. So they're just taking this cookie cutter boilerplate and injecting our domain logic, maybe not aware of all the stuff that's

33:56 going on in the background for them. It got me kind of thinking. Like, is Wix a a polyglot organization? If I was new at Wix, I've got a service that I wanna write, and there's no template for, like, Rust or Alexa or PonyLang. And I'm like, hey. I really wanna do this in Pony. Are you gonna just tell me no, or is there a process there to bring on, like, new technologies? Sure. So every technological stack has a new one, has incurred incurred, like, cost in an organization, especially for like, you have one developer that's super

34:03 Polyglot Architecture Approach

34:33 excited about Elixir, and then writes a service in Elixir. But what's what happens when they decide to leave the company? Like, who is going to maintain this Elixir service? So so we start off on the JVM, mostly Scala. And then it was apparent that because there are a lot of front end developers writing JavaScript and you want them to easily write middleware services and rendering services, So relatively quickly, like, Node. Js became an additional first class citizen, like with Scala. And in recent years, they are starting to gain, like, a bigger, bigger effort to expand, like, the the offering. So why

35:25 not write in Java and Go and Python as well? Now in order to achieve that, like I mentioned, a very complex, let's call it, platform framework and a lot of stuff. So it's not and that's on the JVM, and it's not really so simple to to achieve that. So one way that we're experimenting now is with kind of a sidecar pattern with with two containers where we have the host container, which is running the JVM, and all of the cool microservice infrastructure that I talked about. Then you have the guest container that can be a Node. Js

36:07 container or a Python or or a Go container. And they communicate via gRPC. And just so you you specify the proto because proto can be in any language, and you can implement your event handler or your RPC endpoint in in the languages I mentioned, and it will get invoked from the host that will actually do the heavy lifting. Another approach we're contemplating perhaps is use utilizing GraalVM also to get that working. But, like, in terms of cost benefit analysis, I think, well, definitely, I don't see, like, the engineering of Wix going to, hey. Just write in any

36:57 code, like, language you ever saw. And because it's cool because at the end of the day, there's this production system that needs to be well maintained and and get good performance and and keep our customers happy at the end of the day. Awesome. Thank you. Okay. So based on everything you've just said, you know, there's a lot of JVM happening, and you mentioned Graal as well. Right? But they're definitely heavy JVM users. Your event driven architecture on AWS, you've got Kafka NMX, which is notoriously difficult to kind of scale it with resource intensive. And the costs must be rather astronomical, I

37:17 Scaling and Cost Management for Kafka

37:37 would imagine, for a company of x scale. You know, you mentioned that wonderful number at the start of the episode, 70,000,000,000. I think it was events per day. So maybe you can dive into that a little bit more detail and tell us what's interesting from your perspective. Sure. I would love to do that. So, basically, over the the last couple of years, we we're trying hard to see how we can keep the costs down. So I mentioned that microservices are new microservices spin up every week, but we need to see how the infrastructure costs don't

38:13 increase linearly with that. Right? So we have our infrastructure on top of Kafka, for instance. And we saw that we need to allow to have a lot more Kafka brokers because we have so many microservices, and all of them need to connect to the Kafka brokers and understand, like, is there something for me to consume now? Right? Now all not all parts of all services are gonna get stuff they need to to consume. And but they're going to run, our infrastructure and waste CPU memory, in order to do that. So in order to mitigate this, we actually

38:59 created three services for each of the basic Kafka features. So we have a Kafka producer, a Kafka consumer, and a Kafka administrator. So for each of them, we created a proxy that in this way, we can really, really limit the amount of connections that we need for for the brokers, and then we can really limit the amount of brokers that we actually need to run. So while the admin proxy and producer proxy is really straight kind of straightforward to do because they are stateless, for the consumer proxy, it's a real it's been a real adventure

39:34 where you need to keep, like, the state of of which consumers are now, connected and to shard the the and load balance, between the pods and have a lot of work on the stateful nature of consuming from Kafka, which is really exciting and challenging. And we're also trying to reduce Kafka based traffic because that's also quite expensive for us. So we're considering now that we have our our service our proxy service, we're considering doing all kinds of events caching and stuff like that in order to if you have a lot of consumers for the same event,

40:22 we can, reduce the the the pressure on Kafka and and just sometimes, get caches from ourselves, so from our proxy service. So a lot of really challenging technical stuff we're we're doing now in order to reduce our our Kafka bills, basically. Wow. Okay. So so the problem statement there, right, is you have so many consumers and purchasers of Kafka connecting to all of the Kafka brokers that you're having to scale out horizontally to the brokers to handle them and have connections. So in order to avoid that, you know, that that's because that's a cost thing. Right? We we running the JVM on

41:01 a container on top of a Kubernetes cluster is quite a chunky thing to do, and then you got that across so many services. Yeah. Okay. I understand the problem statement really well. You've went on the solutions approach of you said you wrote three proxies. Right? But it sounds like they're slightly more sophisticated than just a proxy. Like, they're actually their own little orchestrator within the cluster. So all the consumers and producers sorry. I'm cutting you off here. But all the producers, the consumers only ever speak to your proxy, and then it handles some, I guess, like,

41:32 delegation of all the Kafka workloads. Yes. So when you have a lot less producers connecting to actually, Kafka produces connecting to the Kafka workers, that that you don't need the 2,500 services times the number of pods they have to connect. You only you need a much smaller set of producers to actually do the work for them, and then you really limit it now to connection. That's how you save producer connection. And over on the consumer side, so, yeah, you you you consume for them, and then you delegate to them via RPC, which is much much less cost intensive

42:15 in our data centers. I hope that clarified it. It does. Yeah. It's really interesting. Just fun problems and fun solutions that you get to work on. And, really, I think it's just exciting that you're able to kind of share that with the audience. I think they're gonna take a lot away from that. So thank you for sharing that with us. Yeah. I enjoy coming to work every day. Awesome. We have gone over time, so thank you for being patient and answering all these questions. I'll just I mean, I could easily talk to you for another hour. I I think

42:38 Essential Tools for Nathan

42:48 we'll maybe do a part two in a little while if you're up for it. But let's finish with something a bit more fun just to, you know, get us on the way. But you mentioned a lot of tools in today's episode. Right? From Kubernetes to Kafka to MySQL to the Scala to Graal. Like, if there were only three tools that you had that you could pick and everything else would just be removed from you. Like, what are the three most important tools to you and Wix engineering that you just could not live away? For myself or for Wix

43:16 engineering? Let's do both. Let's start with you. Okay. For for myself, I I actually really, enjoy working with the functional library on top of Scala called Zio, which I really enjoy working in every day because it's really tuned for high concurrency and asynchronous work and get all the benefits of functional programming. So I really enjoy working it every day. Don't take it away from me. And I think Kafka has proven to be very powerful, versatile. And and so we solve so much with it in in Wix. So I definitely don't see, like, Wix without it.

44:00 That's like the two biggest ones for me. Alright. Awesome. K. And do you wanna add any more tools, or do wanna leave it as ZIO and Kafka. Right? I I think those are the the ones like, if if you're on the on the JVM or or, like, not satisfied with the the concurrency and asynchronous features from your language. I I definitely recommend to check out z o z I o library on top of Scala. Awesome. Thank you. Alright. Well, thank you so much for spending time with me today answering all these questions, sharing all of your knowledge and all the

44:34 Guest Plugs and Conclusion

44:39 interest and problems that you have in Wix engineering, and most importantly, how you're navigating and solving them. I'd like to give you just a moment. Feel free to share and shamelessly plug, Wix, yourself, your blog, your Twitter, your, if you have a daily farm, like, anything you wish, please feel free to share. Well, first of all, it was really a pleasure, being on the show, and, you're a great host. And I really enjoyed your questions, kept kept me, on my toes. And a lot of things I talked about today are on my website, so at natansill.com.

44:44 Plugs

45:16 Block posts on on ranging from, like, event driven architecture and more like Kafka specific. I talked about skill management. And but I also have other software engineering in broader terms related stuff there. And you can and also, I do a lot of talks and a lot of conferences, so you could check out all that in my on at my website as well. And, yeah, you can find me on Twitter and LinkedIn for sure where I I I get updates on everything that we do at Wix Engineering, which is interesting. So that's it. Alright. Well, thank you again. I will ensure that

45:55 all of those links are in the show notes for people to check out. So to everyone listening, thank you very much, and we'll see you all next time. Have a great day.

Meet the Cast

David Flanagan

@rawkode

Laura Santamaria

@nimbinatus

Natan Silnitsky

@natansil

Weekly Cloud Native insights

Stay ahead in cloud native

Tutorials, deep dives, and curated events. No fluff.

Documentation

Confluent Schema Registry

Code

ZIO Scala library

Additional Resources

Wix Engineering Blog

Natan Silnitsky website

More from Cloud Native Compass

View all 23 episodes

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Flatcar Linux: A Modern OS for the Always-On Infrastructure

Flatcar Linux: A Modern OS for the Always-On Infrastructure

Platform Engineering: Asking "Why"? with Evelyn Osman

Platform Engineering: Asking "Why"? with Evelyn Osman

AI-Augmented Programming

AI-Augmented Programming

Observability for Developers: What You Need to Know?

Observability for Developers: What You Need to Know?

The Future of Sustainability in Open Source

The Future of Sustainability in Open Source

More about Kubernetes

View all 173 videos

Hands-on Introduction to Kueue

Hands-on Introduction to Kueue

Hands-on Introduction to Yoke

Hands-on Introduction to Yoke

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

Navigating Kairos: Immutable Operating Systems with a Cloud Native Twist

More about gRPC

View technology

Feature Flags via GitOps with Flipt

Feature Flags via GitOps with Flipt

Introduction to OpenTelemetry

Introduction to OpenTelemetry