About this video
What You'll Learn
- Understand how OpenEBS implements Container Attached Storage for Kubernetes by managing storage controllers in application pods.
- Explore Mayastor's architecture using SPDK, DPDK, huge pages, and NVMe-oF replication through Nexus and replicas.
- Deploy and validate an OpenEBS MayaStore cluster by configuring pools, creating PVCs, and running fio storage performance tests.
Paul Burt and Jeffry Molanus join Rawkode for OpenEBS and its new Mayastor engine. They cover Container Attached Storage, DPDK/SPDK, huge pages, and NVMe-oF replication via the Nexus, then deploy Mayastor on a Kubernetes cluster and benchmark it with fio.
Jump to a chapter
- 0:00 Holding Screen
- 2:10 Introductions
- 2:13 Introduction and Guest Introductions
- 3:35 Slides - Introduction to OpenEBS
- 4:00 What is OpenEBS and Container Attached Storage?
- 5:20 Industry Trends Driving Cloud-Native Storage
- 8:56 High Performance Storage & User Space Development
- 10:32 Introducing MayaStore - OpenEBS's New Engine
- 10:47 OpenEBS Adoption and Use Cases (Local/Replicated PV)
- 12:53 MayaStore Goals: Rethinking Storage in User Space
- 16:07 Deep Dive: DPDK and SPDK Explained
- 16:15 What is DPDK (Data Plane Development Kit) and SPDK (Storage Performance Development Kit)
- 22:30 My summary of what we've covered
- 31:00 Why Huge Pages / Enabling Huge Pages on Linux
- 38:40 Deploying OpenEBS with Mayastor
- 44:00 Fixing my unhealthy cluster
- 47:30 Adding the nvme kernel modules
- 52:30 Configuring Mayastor
- 59:30 Requesting a PersistentVolumeClaim
- 1:11:30 Deploying fio to run some benchmarks
- 1:18:00 Closing thoughts
- 1:22:30 Summary of OpenEBS & MayaStore Concepts
- 1:24:38 MayaStore Maturity and Production Readiness
- 1:26:10 MayaStore Architecture & Replication (Pools, Replicas, Nexus)
- 1:31:14 Live Demo: Cluster Setup and Prerequisites (Huge Pages)
- 1:38:49 Live Demo: Troubleshooting Kubernetes Cluster
- 1:50:03 Live Demo: Installing OpenEBS MayaStore Components
- 1:51:21 Live Demo: Verifying Installation (MSN, NATS)
- 1:53:06 Live Demo: Configuring MayaStore Storage Pools
- 1:58:11 Live Demo: Creating StorageClass and Persistent Volume Claim (PVC)
- 2:04:10 Live Demo: Examining the MSVolume Architecture (Nexus, Replicas)
- 2:11:35 Live Demo: Testing the Volume with FIO
- 2:17:07 Conclusion and Future Outlook
Full transcript
Generated from the English captions. Timestamps jump the player to that moment.
Read the full transcript
2:13 Introduction and Guest Introductions
2:13 Hello and welcome to today's episode. This is Rawkode live. I am Rawkode slash David McKay and I am very pleased today to introduce a whole host of people as we can see here. We have six, five sorry, six including myself, guest today and I'm gonna delegate the introductions because it's too many new names for me and I'm not very clever. So I'll introduce Paul, Paul Burt from OpenEBS, Maya Data. Please take it away. Yeah. Howdy. Thanks for having us. We we're deeply invested in open source, so we figured we'd bring the open source community with us. It's kind of the plan here.
2:50 So I'm the director of community and marketing at Myadata. We build OpenEBS and Litmus Chaos are kind of our top tier open source projects that you may recognize. I think we're gonna explore Maya Store, which is a new kinda fangled component of OpenEBS that does some really cool stuff with technology from Intel and Linux foundation projects and other stuff. So we have Kiran Mova, who is one of the cofounders of Maya data along with us, and he's the architect of OpenEBS. We have Jeffrey, who is our CTO at My Data and works deeply on the
3:28 MyAstore project. And then Glenn Boolingham and Jan, I think, are product management and architects, software developers correspondingly. So we've got a a full house here today. Yeah. We're setting a record for the most guests I've had on the stream at any one time. That's awesome. Now we're I believe we're we're gonna start with a few slides just to talk around what OpenEBS is. So, Kevin, I believe you're gonna share your screen. Yes. Thanks, Paul and David, for having us here today. Let me see if I can figure out share the screen here. I haven't prepared slides
4:00 What is OpenEBS and Container Attached Storage?
4:13 specifically for this event, but I do have some slides that I want to share from the last presentation that we had at KubeCon. It'll be a trimmed down version of it before we get into the most interesting stuff on my store just to set up a context on the origins of OpenEBS and how we came down to working on My Store. So this was slightly, you know, older survey now. It's from 02/2019, '2 thousand '20 is going to be published at the upcoming North America KubeCon. But OpenEBS is one of the most evaluated storages in the open source for straightforward workloads on
4:56 Kubernetes. And we call the storage pattern that we have implemented as container attached storage. And we while we are the leading open source platform or container attached storage example, there are, like, some commercial options available as well. You know, like Portworx, which was recently in the news, is the one that I was also featured in the survey. So the origins are pretty simple. Like, I think we think that congress law is kind of getting realized right now at all industries. In some recent article that I've read, there are no more company I mean, no data data companies kind
5:20 Industry Trends Driving Cloud-Native Storage
5:36 of survive now, and every company is a data company, whether it's Bloomberg or could be Ben's or every organization has become like a data organization as well. This is a story from CNCF end user report which talks which kind of represents the way teams are organized at Bloomberg and how Congress law can get applied there with respect to disaggregated teams, small number of teams working on several thousands of microservices, and a much, much smaller team supporting what do you call a platform, a SRE team supporting them. And this is possible with Kubernetes and other CNCF technologies
6:20 that enable these teams to automate their entire infrastructure as a service within their organizations. So the Kubernetes started off as an orchestrator for state level upload, it's kind of soon transformed or it's enabling data to be done in Kubernetes as well. And we also see the shift coming from the way databases are architected as the teams are transforming from specialized or, you know, layered teams to more functional teams. Even the types of databases that are getting deployed within these are different. It's not a single database that all the data goes to, but rather depending on the
7:01 service that you are providing, your databases are optimized for that. We tend to see now variety of databases in an organization's stack. So a few other reasons, especially with respect to agility that's driving the teams to go into microservices and disaggregate it is also this notion of data gravity. Once we end up using some existing technology or, a data layer, even though we kind of are becoming better with agility in terms of application development and the choice of databases. But the real data gets stored onto some platforms, and you kind of get locked into that
7:45 one. And this can actually slow down the growth. This is one of the reasons why we thought we should not get locked into any kind of storage when you're running your applications. And we'll try to see how we try to provide you this data, like, vendor agnostic way of setting up your data infrastructure. And the other thing is about the new hardware that's available and new technologies or, like, new techniques that are available to make optimal use of this new hardware. For example, this shows a 96 core machine with almost close to one terabyte RAM that's available on the
8:27 node. And none of the existing technologies can fully optimize the power of such boxes. Right? It's almost like, you know, even the I when you're talking about processing the IO, how do you go from working at, like that are that are optimized for working at four core of, like, eight core CPUs to 96 core CPUs. Right? Things around the storage also have been changing. A lot of stacks around the storage infrastructures have been rewritten. In fact, like, by using the IO reading or lockless mode techniques. What we are trying to see is, okay, now we have NVMe protocol that can work
8:56 High Performance Storage & User Space Development
9:12 at a much faster speed than the traditional protocols. Can we make use of the newer hardware and the transport protocols that are faster and enable you to process more IOs to deliver you a high performance storage? Another interesting aspect was about the shift towards user space to provide you with a better software delivery experience while not compromising on the performance itself. User space tends to make us believe that, okay, we are going to lose out on some performance. But I think what we want to show via the master project that and master isn't the first one that we
9:57 attempted. We tried to do this with couple of other projects and before coming up with, you know, putting all those learnings from the previous projects. Most of us on the call today come from CFS background that was focused on maintaining the data consistency, but at the expense of, like, performance. We try to get those learnings to build milestone that's both developed on open standards, providing you the performance, and optimizing the infrastructure that you have. So rewrite was inevitable, and we will try to walk through My Store, which is our latest engine, see how some of these aspects that we developed really
10:32 Introducing MayaStore - OpenEBS's New Engine
10:43 came to life with that. Just in terms of few more slides or, like, you know, these are the early adopters on the catalog that we have for the OpenEBS project. It's actually the list is quite big now with 25 more public references. And we also have, like, many more more customers and partners talking about OpenEBS as the or the container attached storage as a way to deploy your applications. And we actually support both local PVs, make it easy to run local PVs as well as support a replicated storage. Right? The reason for doing that, the cognitive workloads
10:47 OpenEBS Adoption and Use Cases (Local/Replicated PV)
11:30 tend to have the distribution logic built into them. Many times, when they're getting started, they just want a local storage which is operating at the speed of the disk. So for these, we make it easy to run, you know, consume the storage that's available on the notes and carve it out and provide it to the applications without adding, like, data services, like, snapshot capabilities or, like, you know, few other things like, you know, optimized backups or incremental backups, that kind of stuff. But as we have noticed, as people start using this, they get really comfortable. But once they
12:10 run the applications for a longer period, they tend to ask for other capabilities as well, you know, like snapshots, backup restore, and encryption. Those things are what are getting built into our replicated stack via My Store, which operates at the speed that local TVs can provide you along with providing the additional data service capabilities. With that, I will hand it back to David and Jeffrey in case they want to add something more to this, and then we'll walk into the live demo session. Yeah. I I guess thanks, Kiran, for the the quick introduction. I I guess the only
12:53 MayaStore Goals: Rethinking Storage in User Space
12:59 thing I want to add to it is that, the idea with with Maya storage is that we wanted to, rearchitect a storage stack, in user space because it runs in containers. And so you need the implicit ability to be able to decouple yourself from the kernel because all kernels are not the same, throughout, the cloud providers. And, the focus there was, to to embrace the new technologies like NVMe NVMe over fabric. And the only real way to to squeeze out these performance numbers is to, use these lockless models. And that's not because it's it's it's cool. It's actually very hard. It's
13:41 it's very, hard. Asynchronous programming is a nightmare, really, no matter what language you use really. But the the the the interesting thing that has happened recently, not that recently, but, you know, relatively recently, is that storage is now much faster than the CPUs. And that in history has never ever been the case. And so it requires some rethinking, and that's what what Kiran alluded to with the hardware changes and, audio u ring, which is a new interface to the Linux kernel, for example, that has a different way of of submitting work to the to the kernel. And that
14:16 is because the hardware is much, much faster these days. But in order to make use of, IO u ring, for example, it's not like, oh, I install IO U Ring and off I go. It's like, no. No. No. You really have to rewrite your application, unfortunately. Right? So all these things together, cloud native design, rewriting the software, languages with concurrency primitives built in with with Go with GoFunk instead of pthread create, you know, these type of things. It's a very, exciting time, I think, and gave us the, for us, it provided justification to to, you know, at the very least,
14:54 give this a, you know, a shot. So, Maya story is is so, you know, it's still early, but we've we've we've put a significant amount of work in it already. And, yeah. So that's that's roughly it. I hope that makes sense. Yeah. I have some color. I feel like a fifth grader who did, like, a book report last week and I get to present it now. Something that, like, has has helped this all make sense to me is thinking in terms of, you know, what Kieran was saying, direct attached storage is kind of the easy
15:25 thing you reach for, but it lacks features. And eventually, in certain projects, you may get to a point where you want network attached storage. You want features like encryption and all the fancy sorry. I've got a angry, hungry cat in the background. That's misfits for you. So you you you're looking ultimately for getting started simply, but growing into something that's a little more professional or a little more feature full down the line. And OpenEBS has been designed such that we can support multiple storage solutions. So we have a ZFS based solution. We have a long horn based solution. And what
16:02 we're exploring today is Maya store, which is based on kinda like groundbreaking technology. So their projects like DPDK and SPDK. Are you familiar with those, David? I I I'm cheating actually. I asked you this before the call. Yeah. I am definitely not familiar with them. Can you well, let's expand the acronyms first. What what do those mean? Alright. So, good question. So DPDK stands for data plane development kit, and and the emphasis here is on development kit. So it's not a turnkey solution that you just, you know, pull in and off you go. Everything is is is, hunky dory
16:15 What is DPDK (Data Plane Development Kit) and SPDK (Storage Performance Development Kit)
16:40 and is oriented towards networking. And the genesis there is that, is based on the fact that as you get more bandwidth towards your computer system, like hundred gig, 400 gig, or what have you, you typically have an interrupt driven approach. So a packet comes in, the interrupt happens, the CPU stops, processes the interrupt, and then goes on its merry way. But as the rate of packets increases, what you see is that the dominant behavior of the CPU or the operating system rather is getting interrupted all the time. Right? So it doesn't make any forward progress because it's
17:15 constantly getting interrupted. So one way to get around that is is and what you see is that you get interrupts that are white hot. So it constantly gets interrupted. One way to to get around that is to batch things. So you get an interrupt. It's like, okay. I'm not doing anything. Not doing anything. Not doing anything. And then boom. You handle multiple requests in one interrupt effectively. The downside of that is that your bandwidth goes up, your latency increases. Right? So that's not ideal. So the the idea of of of DPDK and projects that are built on top of
17:48 it, for example, a VPP vector packet processing from Cisco was that, well, instead of having the kernel do all that hard work, why not isolate a CPU one, two, or three, or maybe four, who knows whom how much you need. Because these days, we have, course to spare. Right? I mean, the new Threadripper is, like, 64 course, and it's it's a consumer product. It's, like, an insane amount. So instead of having the kernel do that, what if you do that in user space and isolate one or two, three course to do just the networking? And the benefit of that is the CPU
18:25 doesn't do anything else than that. And so the caches, the level one, two, three caches of that CPU remain hot, if you will. And, combine that with some additional technologies like huge pages and whatnot, you can actually see that it it it it not only in certain situations outperforms the kernel, but turns out to be far more efficient because the kernel can now, you know, worry about processes, and and and things like that instead of just constantly, doing the interrupt. So that's that's CPDK. On top of that, they Intel also has SPDK, which is DPDK plus
19:04 some storage specific protocols like NVMe and iSCSI, Virtio as well. And, but as mentioned, they are development kits, not turnkey solutions. So so what we've done is we we we took those, development kits to so to speak, and and, you we are using those, in Maya store to benefit from the work that they already have done because it's it's a significant amount of work, it is micro optimization to it's like I've never seen that much optimizations before where they pat things on cache lines and, you know, it's it's like insane, really. And Intel has very specific software called called
19:44 VTune to do that. So I I I can't really do that. But suffice to say, it's highly optimized for a very, very specific task. And, that's what these, development kits, stand for, if that makes sense. Yeah. I mean, I understood at least some of the words. Oh, there you go. But the short version for, someone like me who doesn't have the the deep kinda expertise of the rest of the panel here is you end up paying a tax when you're using a Linux kernel. It's been in development since 1991. So it's kinda got this tried and true
20:20 figured out model that requires locking and doing all these kinda slow processes where you'll pay, like, a 30% tax just by having the Linux kernel manage things for you. There's probably even deeper taxes that you pay, having things sort of filter through the machine that is the kernel. And what these technologies are doing is they're connecting your program and user space, short circuiting the kernel, going directly to the hardware, connecting you to the hardware. And it's DPDK is a solution supported by the Linux Foundation. SPDK, as Jeffrey mentioned, is supported by Intel. So, there are big names investing in all
20:57 these technologies, and a lot of effort going into making these things really shine. And, what we're seeing is they really do shine. If if you've got a need for speed, these do some really cool stuff for you. Yeah. There there are other examples. For example, Skyla DB, which is a reimplementation to a certain extent of Cassandra, where they leverage that same framework to do some of the things that they're doing like network IO and things like that. And I also would like to to emphasize that it's not a deficiency of the kernel for it not being
21:36 able to to keep up, if you will. But it's more like it needs to do so so many things. It's like, you know, give it a break. Right? And and that's basically what it boils down to. And, so it's it's not not to say that, you know, the the kernel is is is is bad. Not at all, in fact. It's just like yeah. I mean, there's there's there's this thing called physics and, you know, you can't beat that. And there is this notion of of serialization that needs to happen in certain cases. And and therefore, is turns
22:09 out to be somewhat more efficient to do it that way if low latency and high performance is your primary objective. And when you're dealing with these low latency devices or high bandwidth devices, however you wanna call it, then that indeed, is what you need. So yeah. Okay. Cool. That's all really really useful. I'm gonna try my best now to take all of that information that we've just covered for the first kind of fifteen minutes and just surmise that as best as I can and then I promise we will go straight on to the to the actual hands on portion of this.
22:30 My summary of what we've covered
22:47 And I've already realized which is really good as you saw we're talking there that I've made my first mistake in the preparation. So that's gonna be interesting. But we we'll call it a knowledge transfer to the audience. And I forgot to enable huge pages but we can do that live. Right? Yeah. Yeah. Yeah. Yeah. Sure. Yeah. We do it all the time. Yeah. Okay. So in summary and Paul you're gonna keep me right here. Right? So. Yeah, for sure. Stateful so Stateful services on Kubernetes is hard because we have ephemeral compute and containers that has to move around. So
23:19 we need to see a size specification and open APS as an implementation of CSI, which is ability to provide storage to workloads on our Kubernetes cluster. Now there seems to be multiple storage engines with OpenEBS. There's a C store which is the one which is tried and tested and then Maya store which is the one which is currently in development, but has some pretty wicked performance gains based on the development kits that we were talking about just there. Which I believe are written in Rust which just adds an extra cool factor or if I made that up.
23:51 Yeah. Oh, no. You're right. Yeah. It's it's cool tech done cool. We've got all the latest tech kinda packed in. Yeah. Yeah. We yeah. We basically took the the the most cool keywords and sorted them and just collect them together and yeah. No. But indeed, it's it's it's it's written in Rust and for various reasons. But one one of which is based on the experience I've had with with other storage systems. And 99% of the time, the issue was a more pointer dereference or using something that's not there anymore. And Rust is supposed to help you
24:22 catch these things earlier. Yeah. Oh, yeah. Yeah. So the ownership semantics, if it compiles, won't crash. Those are all really good benefits I would imagine for a storage layer. Yeah. Yeah. Yeah. Yeah. So we've got our CSI implementation. We've got our multiple back end stores. I mean, question in my head right now is are are we gonna we're using my store today and we're gonna show that off. Is is that in a position where people should begin to adopt it in production or is this very much wait and see kind of thing? So good question. And and and the thing is
24:54 there are certain factors that come into play. So the, we talked about NVMe, and we also mentioned NVMe over fabric. If if not, then it's basically NVMe over a network, whatever that network is. Could be RDMA, fiber, or TCP IP. And one of the issues, that is still there is that NVMe over TCP was only ratified last November. So it is relatively young. IOU ring, which is something we use, same thing. It's relatively young. So we we are on the border of it's out there, but nobody's really using it. Right? So it needs to needs to as
25:33 Kieran always says, it needs to cook. Right? We need to cook it. And it requires more cooking. But, we are working on it, with a with a with a team on a daily basis and trying to make it, more robust, every day. But it still requires some, you know, some love and attention, let's say. But we're we're we're getting there rather fast, I would say. Yeah. It's ready for your proof of concept. Let's put it that way. Yeah. Yeah. Alright. Well, I mean, I'm gonna put it in production anyway. So I'll let you know how to get on. Okay.
26:08 So let's finish this the the summary here. The multiple back end stores can run-in two different variations. One of them being local PV mode and one being replicated PV mode. Does that apply to all of the backing stores or just Maya store? So I I think per kinda what Jeffrey was tuning in on on where it doesn't work over NVMe over TCP IP is relatively young. I I think currently local PV is the only supported mode for, MyStor. Is that right? Well, so so it becomes a little bit complicated, I suppose, to to to express without my hands to begin with. But
26:53 so so so what we do is so let's say you have a Kubernetes node and that has local storage. Right? The first thing that we do is that we can, don't have to, and that's where the complexity comes. But let's assume we always do for the sake of argument. We take that local storage, whatever it is, slow or fast, doesn't matter. That we are designed to be fast doesn't mean that we can't operate in slow mode. We'll do that just fine. We take that local storage and we can create what we call logical volumes on that storage device such that you
27:27 can create multiple PVCs and Kubernetes speak on on that device. Right? So so we we we we wire that up to, so to speak. And the way that we connect to that local storage, so let's say it's a local NVMe device, there are several ways that we can connect to that. One is directly through the PCIe subsystem and user space IO. We'll we'll forget about that for now. Or U ring, as we talked about, IO U ring, or AIO, which is the asynchronous interface in Linux that has been there since I don't know how long.
28:00 Once we've done that, that's what we call a pool. From that pool, we can create, replicas, as we say. And the replicas can be or ideally are scheduled across different nodes. Now you don't have to do a replica, but if you want to, you should be able to. That's the idea. Per developer, per workload, you can determine, if that is required. So let's say we create a mirror, two replicas. These two volumes are created through our control plane, which talks and understands CSI and creates CRDs in in the control plane and whatnot, while doing so. And these replicas are exported over NVMe
28:38 TCP, but in user space. Right? So we we we we don't use the kernel for that. Then we create another NVMe controller that connects to those two replicas and that NVMe controller is where the workload is, which means that the node that wants to write to the PVC connects to this what we call Nexus. It writes to the Nexus, and the Nexus then writes to the replicas wherever they may they may be. And that whole data path is in NVMe over fabric. And so there is some some some, you know, nuances there in terms of, okay,
29:17 the stability because of the newness and and whatnot. But in principle, that's what we usually do. However, we can also directly remotely connect to existing iSCSI and or external NVMe over fabric targets. But when I start to explain that, I lose people typically. So I'll leave it at that. Yeah. No. That's It sounds generally yes. Yeah. Both are generally supported, but it it's like a Facebook status. It's complicated in the case of my own. It's complicated. Yeah. I like that one. Yeah. Yeah. And so for all intents and purposes, you you could say it's a proxy. Right?
29:56 Proxy. IO comes in, two IOs go go out. Very simple. And then obviously, the rebuild and, you know, there's some of that. But in principle, that's what it is. Alright. Okay. So final part of the summary then. Hopefully, I've not missed anything. What is recommended or I guess encouraged would be if you're running a cloud native data store that handles its own replication replication to use local PV mode and avoid that nine x storage costs, but if you're running more traditional databases that don't have replication by default, then it may be better to use that
30:28 replicated volume semantics. And then the other thing that was mentioned but I don't think we went into a lot of detail is that OpenEBS provides a lot more than that as well. Snapshotting, backups and other goodies too. I love the, I'll share my screen and what it says on the website because there's some bold claims here so. Kubernetes storage simplified which I just love right? You know, this is these are hard problems right? And I'm really glad that you and your team here are working on the hard bits so that I can just have the
30:58 nice easy to consume API. So that's great. So simplifying storage is great and then I love this one line install so to get started so I think that now unless there's anything you wanna cover that I missed in the summary there, we'll just start kicking the tires on this. How does that sound? Let's have at it. Let's break some stuff. David, just to kinda clarify. So if you are gonna use the existing bit, like local p host path device, This is the command that we proceed with. But if you want to try on the master,
31:00 Why Huge Pages / Enabling Huge Pages on Linux
31:35 there's a different link that we should check out. Oh, yeah. Yeah. We don't have to run this helm command. I just love the fact that it's right there direct on the page. Like, I just love the ability for people to get started with a one liner command. Think as developers, generally we don't wanna read the docs, just wanna install and then start poking at it and you know, that's good. Alright, so let's cover where we are. I try as always with this stream is not to prepare too much in advance. So all I've done thus far is prepare a
32:03 modest sized Kubernetes cluster. So we've got six nodes running on Equinix metal, each with their own. I used basically a heterogeneous cluster. We've got some large nodes, some smaller nodes and some very big nodes. So I'm have lots of RAM, some have lots of disk and I figured we could just talk about the trade offs that we're making as we talk about that applied to OpenEBS as we go. And because we are gonna need to configure huge pages, I've already kind of just split my screen into six so that we can get that out the way
32:33 first. I've never done this before but I'm pretty sure I can replicate my input to all of these. I'll work that out in a second but can we talk about first huge pages? Why do we need that for OpenEBS? So good question. So the the operating system divides memory up in so called pages, and those pages are typically four k. And the operating system needs to know where those pages are loaded, who's owning them, and whatnot. And and it keeps a small table called oh, I I I just used this the the short term of TLB.
33:19 I think it's transition look aside buffer or something like that. But suffice to say, it it keeps a cache of which pages are loaded where. The more memory you have, the more of those pages you have. The more of those pages you have, the more likely you are to have a TLB miss. If you have a TLB miss, it is a rather expensive operation. I mean, it's it's nanoseconds. But, you know, if you have a lot of misses, then they amount to something. So the idea is is, well, why don't we use huger pages, like two meg?
33:51 There's your answer. So you get less TLB misses and that means that the CPU has to, can focus more on the actual work than finding the right pages to load, effectively. So that's, and and and and it really adds up. That's one reason. Second reason is, if you want to do DMA transfer into a PCI address from user space, that memory location where you write from and it's typically done with the scatter gather list as they say. It's like call write this collection of addresses into the PCI registers. At the moment you say that, you cannot have it. You you can't have
34:36 that that memory gets replaced because it's virtual memory. Right? So it can be swapped out and yada yada. And if you do DMAs, like, can't have that. You can't all of a sudden move this piece of memory somewhere else because by the time that we actually DMA it, what are we DMA? So an undocumented feature is that the huge pages are what they what they so called pinned into memory. So they are put in a location, and they stay there. They don't move. And so those those two reasons are why, we use, huge pages. Databases use them too, Right?
35:17 To reduce these these cache misses and whatnot. Okay. That makes sense. So how do I enable, huge pages on the Linux kernel? So you you you don't really have to enable them as much rather than you need to to tell the kernel how many you want. And there is a, variable in sys, m m or sys slash kernel m m. I always use tab completion so I forget, where you can basically echo in the value, of of, huge pages that you want. And you have two choices. You have the two meg pages or the one gig, and
35:56 we always use the, two meg. And so you specify the amount of two meg huge pages. So if you wanna have one gig of huge page memory, you specify five twelve. So from usability standpoint, it's it's it's not ideal. Right? But, gets the job done. So I could just echo five twelve to slash sys slash Yeah. So let me open my browser. I I always forget it. Because I I use a declarative operating system next to a declarative Oh, you use NextOS. Right? Yes. How do you know that? I run NextOS too. No. Back for work. Yeah. But I Really?
36:41 Yeah. How interesting. So it is let me grab that thing here. Alright. We can actually spin up on Equinix Medal. If I had known I had an on the stream as well, I would maybe have been as bold to drop off. Expert is a big word, but considering all the others, alternatives, this was the least, worrisome, I would say. Let me see. Well, it's actually in the in the in the in the quick start. Maybe that's easier. I've just pasted that one, Jeffrey, into private chat, if that helps. In any regard. Oh. Yeah. See. Thank you. Okay. So
37:28 let me just I know people watching this stream are not really gonna be able to tell because of the replication here. But we're echoing five twelve to sys kernel huge pages huge pages blah blah blah blah blah. Yes. Exactly. We will Yes. Click that into the show notes in case anyone is Yeah. Yeah. And if you then type cat proc slash mem info, I think it is, it it should say, there you go, huge page total, five twelve. Huge pages free, five twelve. Now, one unfortunate thing is is that when the node runs Kubernetes, the kubelet needs to know about these huge
38:05 pages. And unfortunately, for one reason, that is, unknown to me, however, is that you need to actually restart the kubelet. The kubelet is not updating those values. So we have to restart the kubelet, and then, you know, if you have a bad day, then it's already game over. But Alright. We have restarted the kubelet. See if I could still run get notes on my cluster. Alright. I think we're good. So we've enabled huge pages and all six of our worker nodes at least. I won't bother with the control plan. Yeah. Right. Yep. So and then
38:40 Deploying OpenEBS with Mayastor
38:45 yeah. Oh, sorry. You you just go right ahead. I'll just answer as you go. Well, I mean, normally I I try to follow the documentation. But I know that we're we're we're playing with my store today. Maybe not gonna be following what is recommended in the documentations, or is that a incorrect assumption? Yeah. So we have a documentation for my store as well. If you just go to the alpha features, is right there. Yeah. My store, it should get there. It should point to our sorry. Multiple redirection. I'll fix that. Yeah. That'd be good. Sweet. Awesome.
39:28 So let's see what we need to do here. There are some prerequisites. Guess maybe we should kind of Oh, yeah. Well, there there there's one one one interesting thing actually. There's the the the we use certain CPU instructions, and we we we kinda arbitrary decided that if your CPU is older than ten years, it's like, you know, buy a new one or not run Maya store. So there's some of that. And it turns out though that the the the CPUs in in in the in the cloud are older than you think. But I in your cluster, I don't think that that's
40:04 a problem. Yeah. You don't think? I mean, that's not the most confident answer I've heard there, but we'll see. Yeah. Yeah. Well, or the CPUs are not 10 years old there. Right? So No. No. No. We've we've got pretty recent hardware. Yeah. Yeah. Exactly. Alright. So we got modern CPUs, four gig of RAM, like that's not gonna be an issue and we've got huge pages which we've just enabled. So let's see. Preparing the cluster. It's a bit difficult to read. I'll zoom out a little bit. Okay. So we did verify huge pages. We've set oh, see this was all documented. Like
40:43 it is just like I should have just came here first. We aim to please David. Something about the manual. I know. I think when I'm going down some sort of crazy offbeat in path here and it's like, no. Like, you know, you're just following the docs like. So we have to label the nodes that we want to run Maya store. Okay. Which means and oh, I need my node names again. Hate it when I paste stuff and then I can't actually see where I am. I thought the hardest bit would be the copy and paste. Alright. Let's try again.
41:34 Okay. So Well, just do the first two for now. Is that enough for us to get started or should I run through them all? Two two would be fine. It will limit the number of replicas you can have to to two, but it would be fine. I'm almost there now. Three if you want to push the boat out. This is captivating TV right here, so I might as well. Alright, so last one. Okay. Got my labels. Now we're going to the quick start. So this is just gonna apply remote YAML. So this is just creating the
42:23 namespace first. That's right. And this is just for the control plane of OpenEBS. It it's data data and control plane. Data and control. All of the containers are created in a a my store namespace. Then RBAC, everybody's favorite. And the CRDs. I I call it the cloud native pseudo actually. So this is interesting. So my story is using that under the hood. What's going on here? Yeah. So so as we started to develop this thing, we we really we realized that you may or may not wanna have Maya store running on all the notes. Right? So,
43:15 we needed a way to separate out the CSI notes from the notes that was actually running Maya store, and we need a way to register ourselves against the control plane. And after, some trial and error with various approaches and and and concepts, we said, well, you know, let's just use a message bus, because, you know, that's where these things are are really good at, in terms of creating patterns and, you know, things like that. And well, then you Google flight plus Kubernetes and we saw Nats and so, okay. Sold. Obviously, a little bit more insight went into
43:54 that. But NATs is is very lightweight, and we use it to to register nodes to the control plane. And also, we are working on a fault management subsystem that as the IO errors occur within the data path, one of the things that you cannot do is is is determine what you should do based on that error. Right? You need a holistic view of the system as a whole. So we we instead of handling the error, we we basically, dump the error plus some metadata in the message bus where then a cluster level service will receive
44:00 Fixing my unhealthy cluster
44:31 that message and keeps track of how many errors and whatnot. And based on an heuristic, it will determine then to take a particular replica, on or offline. That's that's roughly, the idea, if that makes sense. Yeah. I I think so. I'm concerned that we have some pending pods. Let me check why Nets is pending. If I given us an unhealthy cluster. I'm not nervous. You're nervous. Ready, ready, ready, ready, ready. Alright. Okay. What's let's take a look at that node list again, the pod list again. So it it looks like core DNS is is not happy.
45:38 Yep. And the Calico cube controller is not happy. Wonderful. So I get for trusting get nodes. Let's see. That doesn't look right. I'm just gonna delete that pod and see if it magically fixes itself. Computer says no. The error message is about taints and tolerations. Right? Do we have to dig into that? Well, mean, that's not what I was hoping to do. I'm gonna edit the node and see. I don't know why we haven't uninitialized label. I'm just going to remove it. Alright. Hey. I have no idea what is going on there. I should probably remove
46:56 that from all of our notes, though. This is the the OpenEBS slash studying for your your Kubernetes administrator exam episode. Yeah. Come and get your CKA learning how to unfuck your cluster during a demo. Like okay. I'm just gonna remove the tents. Oops. Save. Oh, just thinking about if the NVMF NVMe for fabric kernel module is on the machines as well. I just thought about that actually, which is Okay. We can take a look at that. Alright. Let's I've got at least a good few there that are running. All those ports are happy. My cluster is happy. NATS is now happy.
47:30 Adding the nvme kernel modules
47:52 Cool. Problem one resolved. Problem two is you're worried about the kernel module. Right? Well, worried is like so. Thank you. Pulled his cable out. Alright. Well, you fix that. I will just edit the last node while I'm here. Where's the last two nodes? You still there, Jeffrey? Nope. Can you hear me? Sorry. Yeah. I don't know what happened. Wireless technology. The what what operating system are you using? It is the Ubuntu 20 o four. Okay. So let me double check. So if I just run an LS mod, is there a name of a module I should be looking for?
49:05 It actually, the the quick start, talks about it, as well. So the but I think it was in the prerequisites. The the quick start will talk about it, Jeffrey. That that's an oversight at this juncture. Yeah. So We have Sorry. Sorry. Yeah. So we're looking for n m NVMe dash t a c p and NVMe t. No. Just the TCP one. The the the the other one is not needed. Okay. So I can just mod probe that. Right? Yeah. So There we go. Easy. Easy easy peasy lemon squeezy. Okay. Well, that's good. Cool. Now there's a command. If I go back
49:50 to Let's go back to the deploy one, right? So I deployed NATS and then it wants me to just check NATS it's happy, which I'm actually I'm okay with. I think it was running And now we need to install the CSI node plug in. This is the preflight checklist. Hey, if I if I all I need to do is remove some tents from nose and copy and paste a few more commands. I mean, think we're doing pretty well. So we got five. That looks good and now control plane. I love it when the docs just work. I
50:36 mean, it really just makes this a lot easier. Kudos on the docs. I mean, the only thing that's gone wrong so far is everything I've done upfront. Yeah. Don't say that too often because Glenn is on the call and I yeah. Yeah. No. But that really helps a lot. Yeah. Yeah. Just I mean, as someone I mean, I I'm pretty much a technology magpie. I just get to play with technology every day and new stuff and see what happens. And when the dogs work, it just makes such a difference. I don't wanna give you too much of
51:08 an eagle here, Glenn, but so far so good. It it's okay. Feel that you're putting me on a pedestal that I can only fall from. I fall from grace probably shortly. Exactly. I just like to build people up so I can watch them fall. Okay. So that gave me oh, I was just a bit too quick. Right. Is that us? That that that is MSN. So it's a tribute to the the old chat utility that we all grew up with. But that stands for Maya storage node, and and that is, one of the things that we use NATs
51:45 for. So as Maya store starts up, it starts to look for the message bus. And then through the message bus, our control plane gets the message and then creates this CRD. And when you describe the CRD, it should show you some information about the node and and stuff like that. Oh, yeah. It's it's it's a namespace, I think. So Yeah. It's a little bit annoying sometimes, but yeah. Right. So there you go. So it it it gives us the information of the gRPC endpoint, and things like that. So, this is not, how to say, very useful information for for,
52:30 Configuring Mayastor
52:30 a developer necessarily, but it does show you, like, you know, all the notes that are there and their endpoints and and whatnot. So yeah. Alright. Sweet. So, I mean, the next step is is configure managed store. So just if if we had like a thirty second summary of where we are right now, we have we have OpenEBS deployed with Maya store and a control plane and data plane and and and that's it. Right? Yes. Yes. You you've done a lot of work but don't have anything yet. That's that's that's right. If I were to request a PBC
53:02 right now, it's just gonna It it won't work. Yeah. It won't work. Yeah. Yeah. Yeah. Exactly. Yeah. So so the the next step, is is to create pools as we as we say. And we we have stuff in the works actually and that facilitates this to make it a little bit easier, and there is NDM that also makes it easier. But I think for the for the purpose of of this exercise, I think it is probably just easy to to look on the notes, which is the disk device that I wanna use and create a thing like that.
53:42 So I think okay. I was gonna say, we might get away with just using SD on everything, but that's not the case at all. Yeah. Yeah. So you you can just, you know, pick one or two notes or whatever. So I I guess we wanna use the ones with the NVMe desks or is that am I wrong there? Does it not matter? Yeah. It it it it it depends a little bit on the NVMe drive exactly. So if if so if you go back to the quick start guide Yep. Real quick. So the we we talk about the schemes. You
54:19 go back up a little bit here. Right? So this is this is how we connect to the local storage. And the by default, we use AIO because that's common in all kernels. If it's a more modern kernel, and I think we just re enable it by default, we, in in the development branch, it uses I o u ring. If you want to go, like, really, really fast, there is a scheme that's not documented here because it's very difficult to use still, and that is PCIe. Right? So ideally, we would tell the kernel, let the NVMe device go. We'll handle it
54:59 for for you. But that's rather complicated. So that's not what the one we're doing? What's that? I I well, we could, but I would not recommend. I mean, it it can be a long exercise because we would need to figure out the PCIe, BDF, and and then yeah. Alright. So we're gonna go with the NVMe over fab option? Is that Well, I I would just if you go to the first example tab. Oh, this one? One. And just, so the array of disks, we only accept one disk still. Put any disk in there you wanna use,
55:40 and Okay. It will just use whatever So I think I'm just getting confused. Because I just assumed from this example that I had to use something that was exposed as SD. But I I I could still use slash dev slash m v m e o one n o one. Is that Yeah. Yeah. Yeah. Yeah. A block device. Yeah. Whatever. Yeah. Alright. Let's let's start with the two desks with the three and a half terabyte NVMe drive. Sure. Okay. So let me create this file. Let's not apply it directly to the cluster. And just so I'm assuming I'm gonna have
56:18 to modify this a little bit. Right? So You're certainly gonna have to change the node name in the spec section to the node on which we're creating the pool. Okay. So we'll call this pool one dot yaml. And I need to change this to be value from here, I believe. Was there something else you sorry. You said I had to change there? Yes. The node the node name. Absolutely. A pull Is there a way for me to say, you know, my cluster is made up of three different node types. Is there a way for me to say, identify nodes with this
57:02 label and expose this disk? Not at this time. Good suggestion. Not not at this time. And, indeed, as Jeffrey says, we we we do have something in work to to automate this aspect going forward. But that's a good suggestion. I'll be the only one I have. Which still comes back to, no, I'm sorry, I gotta be cutting and pasting this for now. Alright, so this will be node pool one and node pool two. I've got this, this. Is that, does that look okay? I've not made a mistake there. That looks good. Alright. Awesome. That looks good. I can apply pull
57:36 one. Okay. So does that mean I can run my Azure pulls? Indeed. However, you would Namespace. Right? Need to be in the namespace indeed. And spell it right. And spell it right. You you can also type MSP. It's it's like shorter. MSP. MSP. Alright. So I have two storage pool, like, online now. Yeah. Is that am I done? Is that it? Can I go home now? No. We need to actually create a PVC. We'll we'll be needing storage class. Yep. We'll be needing storage All that stuff. Yeah. Yeah. Yeah. Okay. So storage classes we can do. Let's
58:26 just continue to add to this file. I'll make sure all this code is, oh, I mean, it's mostly copy and paste, I'll still publish the code to get lab anyway. So my store Yeah. We should change that to NVMe or NVMath, however you wanna phrase it. Yeah. And the protocol NVMath. That will need to be yeah. That will need an f in it. Yeah. It's so the the e needs to go? Yep. Yep. Alright. And do we do we want to set the replication factor? Do we want to have two replicas or are we replicas
58:58 or are we happy with one for testing? Oh. I mean, now that you've said that, I I kinda wanna set it to two. It'd almost be short changed to have one, wouldn't it? And I could just apply this. Is there anything else YAML wise before? Nope. Okay. Let's just get that applied then. Alright. So now I have a storage class. So now I can run some sort of database that consumes a storage class. Right? No. Now you need to create the PVC. Well, I was assuming we deploy that database with Helm, and that'll do that for me.
59:30 Requesting a PersistentVolumeClaim
59:37 Is that what's on this next page? Oh, yeah. Well so I I I'm like the yeah. So I always do it this way because this this this Helm thing is too too too fancy for me. But if you wanna run Helm, I I guess that would work too. Yeah. Yeah. Yeah. No. Let's stick to the docs. Know. Let's not deviate because in the event it's my mistake when it goes wrong. And I know I'm already over to this. Let's continue to add to this So we're gonna create a PVC. We'll just call this super DB.
1:00:09 And I mean I can just pick any value, right, for the size. So it's gonna use to cool. Let me clarify my understanding is correct here. We have exposed two three and a half terabyte disks to my store has created a pool that's replicated. So I've got seven gig of storage, but three and a half gig set is available. Is that correct? No. So it's it's well, no. Yeah. It it it's an interesting way of thinking though. So I can certainly see why you think that way. But the the replication model is on a per replica.
1:00:45 Right? So you have two decoupled three terra three and a half terabyte pools that can store whatever. Right? But when you create a PVC, you say, I wanna have two replicas, and we will look for two pools on which to put the replica. And then and only then is it, like, we replicate that data. But the pools themselves are are a thing on their own. Yeah. So you you oh, sorry. Yeah. Kiran Kiran can explain this better. Yeah. So we've done this with almost all the pools, the same convention. So the storage is not aggregated from different nodes.
1:01:23 The it's kind of stays on that node. And when we talk about, like, a five g volume that you create and you specify two replicas, it actually saves 10 GB of data. Each replica has five GB of data when it's full. Right? So it's a synchronous duplication of the entire data on each of the replicas. Okay. So right now, I mean, my pool is is a pool always one desk? What I mean is, like, if that actual disk dies at a physical level, then the replication doesn't help me at all. Is that correct? Yeah. So if the for the PVCs that
1:01:59 have more replicas, they will survive. For the PVCs that are on there that only have one replica, they're they're gone. But you but that's the idea that the developer gets what he asks for, even though it it might not necessarily always be what he wants. So we we are looking into it's like, well, maybe you wanna have local redundancy as we call it. So you create a pool, and that pool by itself is also redundant. But that comes at a cost in terms of, you know, storage. So there's traditional Several ways to skin a cat.
1:02:34 Yeah. Oh, yeah. Sorry. So traditional operations wise then. I mean, when I'm in this kind of situation, from my experience, I would probably rate the desks on a hardware level. Is that still something that would be recommended? Depends. Depends. Right? So it's let let me let me put it this way. Let's say that, I am a developer and I you're a company and I work for you and you ask me to deploy something in Kubernetes and I see five storage nodes and I think, oh, let me make sure that I don't lose that data, I would type replica
1:03:03 as five. So that means that the system finds nodes suitable to put those replicas on. And then if one of the nodes dies, I have four. Another one dies, I have three, and and and so on and so on. If I say, no. In the cluster, find a pool. Just put that replica on that single pool. That pool dies. I lost my data. Right? Okay. Now that that's not to say that we cannot make the pool redundant. We could. But I'm not sure if that that traditional model, right, would be always the preferred way. So
1:03:41 yeah. Okay. That makes sense. That that's good. Okay. So I now have a PVC. Let's go back to our I'm not gonna verify the claim. I'm pretty confident. Famous last words. I know. I okay. Okay. Okay. Like, let's run it. And I called it super d b. Okay. It worked. Alright. So now it wants me to actually check the p b itself. So volume. Alright. That looks good to me. So Yeah. There is this is a CRD and that shows you the so we we we've created this almost, like, small very small storage network in
1:04:41 the Kubernetes cluster, it shows you the it can show you the details of of how that thing is is is built up in terms of where are the children as we call them. So there's an additional CRD that that does that. Yeah. Yeah. It'll be good to do a minus OEML on that one. On the MSV? Yep. Oh, wow. So yeah. So let me see. Let me digest this. I Oh, we got replication here. Right. So so there's the the nexus, and the nexus has children, which maybe we should call them replicas. But anyway and so the nexus writes to those two
1:05:29 children, and the children are replicas on pool on node two and pool on node one. So whoever now mounts the PVC eventually writes to pool on node two and pool on node one. Okay. It's like a simple mirror mirroring effectively. Yeah. Jeffrey, again, like, I'm asking some basic questions here. Why is one URI in VMF and the other one is PDF? Maybe it's Ah, yeah. Yeah. That's a good question, Kiran. Optimization. So, the Nexus itself figured out that, hey. Wait a minute. One of those two replicas is actually on the machine that I'm sitting on.
1:06:16 It doesn't make sense to do that over the DCP socket. Right? So I'll just, you know, mem copy it. Well, it's actually a serial copy, but, you know, bypass the network there. And future optimizations, we wanna give it an affinity such that reads always go from the local one, things like that. We're right in time and bless you. Thank you. Yeah. So let me I I wanna make sure I've got this right in my head. So we've created a master volume and it's available on both of those nodes. So my workload has the freedom to migrate
1:06:57 between either of those two nodes if it wants that PVC. Is that correct? No. No. Close. So the the the the yeah. I I think yeah. Next time around, probably should have a picture. So the the the nexus is a NVMe over fabric target. And any node in the cluster can connect to the Nexus over NVMe over fabric. Also iSCSI, but we're not going into that. But that's how you how a node would connect to this PVC. And then the Nexus then doesn't do anything other than so it's stateless. There you go. But it needs to get this IO out
1:07:41 somewhere, and it pushes those IOs out to the children or replicas as it, is also written down here. So the Nexus receives the IO from your workload and pushes that out to those two nodes you see there, pool node one and pool node two. That's how the it's like a rate mirror thing. Right? Mirror. Okay. So if I if I deploy a pod that consumes this PVC, it's gonna be scheduled on one node and always that node there forward. Is that right? So the the the workload itself can can go anywhere because the the way that you
1:08:15 connect to the Nexus is over a fabric. Right? That's networked. But likely, the scheduler will put it on the same note because I think we have some some rule affinities in there that that prefer it to go that way or that way. But, you you can that's what I meant with that's the reason why we have the the message bus. Right? You don't wanna have the Maya store service run on every node. Right? So let's say you have five nodes, only two of them run Maya store and you mirror them. And the other three nodes just connect to
1:08:48 the Maya store instances. Right? Okay. We have a question in chat, which will pop up now. Oh, boy. So does the Nexus picks the IO from the app itself? Or how does that work? Pick the IO from the app itself. Not sure. I was hoping it made sense to you. If you could give us a a little bit more detail in the chat, we'll try and come back to that question. I I think, Duffy, it's question about does this use POSIX? Does it mount? Or can the app directly communicate to the Nexus? Oh. Oh, right. Right. Ah, good question.
1:09:28 So we can actually do well, maybe I I should stop confusing people, but we can actually do both. So if you write an application, you can use an API to directly write to the Nexus without anything else. Right? But the way that is set up now is that the node that is going to connect to the Nexus will use the NVMe over fabric TCP module that we loaded in the beginning, right, to connect to the Nexus. So if if if you if you're familiar with iSCSI, just just replace iSCSI for NVMe over fabric. That's it.
1:10:03 Okay. So I think there was something really cool there. Like, so what you were saying was my application doesn't have to change. I can write to a desk and then the magic happens for me. But I think what also you alluded to there is I could make my application OpenEBS or MyAstro aware and write directly to the Nexus. Yeah. Exactly. Yeah. And in fact, we've done a lot more work there and don't get me started, but we we we Jan, who was on the call, has actually written a Vert.io driver that allows you and we
1:10:31 actually did this for Go. You have these, what's it called, interfaces. Right? Read at, write at, that type of stuff. And if you implement the interface, you can, you know, use it the way it was like a file. And so you could you could use Go reader and writer to write to a Vert.io device that would immediately write into well, it wasn't called MyStore back then, I must say. But no. So, yes, that is possible. And I think that that's, in fact, the real future because why would you go through the operating system anyway? It's like, yeah, because you have a file
1:11:04 system. Well, file systems, why is there a file system still? They used to be there too because the caching and da da da, but it's like, do we really need to? That's what they mean with DAX DAX file systems, d a x, direct access file system. It bypasses the page cache. It's like, sorry, colonel. You're just slowing us down. Alright. Cool. That's exciting. Definitely something for another session I think. Okay. I I I don't wanna keep us too far beyond what we actually had scheduled. So let's assume that in the next twelve minutes we wanna show off everything that we've done
1:11:30 Deploying fio to run some benchmarks
1:11:37 so far. Are we gonna continue with the docs, or is there something we should jump straight to? I so the if you deploy a file part, then it would just mount that PVC and run FIO against it with some very odd defaults, like, very small. But, yeah, you could run that, and then it would, you know, show you some output in terms of Do you want to modify that, or do you want me just to apply it straight up? No. I I I think it's fine for the for the sake of the the stream. I think
1:12:13 we've changed the name of the claim though, didn't we, Jeffrey? Yep. Oh, yeah. Shoot. The the fire. Yeah. I'm always expecting it to be called MS volume claim, and I think we went to super DB. Did we not? Sorry. No. No worries. No worries. So if I run get deployment I'm assuming I just created a deployment here. No. No. I know it's just a flat pod. Just I it's on the Default? Yep. Yeah. Okay. Oh, I can't modify it. Okay. So I'm gonna have to delete the pod, and then I will need to pull this
1:12:53 down. That was a c k question. I think you can modify a pod. Oh, why can't you modify a pod? Why not? The specs immutable. Yeah. Oh, why? Because you're not supposed to work directly with pods. You're supposed to use an abstraction. So this is awesome. I love that you're using Nexery.dev. Alright. Isn't it amazing? I'm a huge fan of Nexery. It's so cool. Yeah. You're familiar with It's like a dynamic docker registry where you just tell it whatever commands you want. Like I could just add w get slash w get to this. Yeah. Yeah. It's so
1:13:29 good. And it's always the, like, freshly compiled. If if they commit a new version, it's you know? Yeah. Anyway. Yep. Alright. So change that to Super DB. This runs Yeah. And sleep. Yeah. And so then what we typically use, we exec into the container, and then in the container on file because you wanna see output. Right? So Alright. So we deploy file. We should see that exists this time. Yep. Curating. That's better. I'll just run a watch on that. There we go. Good. I thought I was gonna have to think of a joke there. So
1:14:15 and we wanna exec in. I'm in. So let's see. Oh, Yeah. That's alright. I just copied this. So can someone tell me what file is? Flexible IO tester. It's it's written by, Jens or. I'm not sure if I pronounced his name correct. He's a guy from Denmark who works at Facebook, and he is the the kernel maintainer of the Linux block layer. So everything block device related, he's he's kinda, like, responsible for, if you will. And, in order to test the stuff, he thought I need a tool, and there you go, file. So it's like the de facto standard to
1:15:11 test block devices, file systems, whatever. It has a lot of options. And so You can basically simulate any kind of workload with this sequential random, small size, big size, override. That's a cool part of it. So for example, like, one of the things that we did is postgress. You kind of can replay that and then convert into a file or test file and then replay that. Very flexible too. Flexible is also on the other. Right? Alright. It's doing its thing. Yeah. It's only about one minute? Yeah. I think the run time is set to one minute. Yeah.
1:16:02 But the what is the one that say? Yeah. It only creates a small file, obviously. So an interesting thing to know is, SSDs have a certain span. I'm not or at least I'm not sure what the nomenclature is, but a big CPU vendor calls it spans. Let's call it that. And if you look at an NVMe SSD, it has several chips. Right? And what you wanna do is when you run a workload, you wanna touch all the chips because that's how you get the concurrency. Right? If you write to all chips at the same time.
1:16:34 And one issue with SSDs is like so if you have a small dataset, you're only retouching a subset of the chips and therefore you don't have optimal performance. Right? If that makes sense. So you're you're not touching all the chips at the same time. So there's some of that, thing to consider when when running, with, these things. But long story short, that's it. Excellent. So At a scale of one to 10, how difficult do you think this was apart from dealing with me, obviously? This was painless. I mean, I'm pleasantly surprised. I always just think
1:17:20 story is just gonna be one of those things that you just bang your head off of it. But I mean even, I know you said earlier that this is a very much in development, maybe not quite production yet, but, know, people can start to use it. I didn't expect us to be able to run through the documentation and just get it to work. Again, the only mistakes or problems we had were were me. So yeah. It it yeah. It's it's it's, you know, it's it's not trivial for sure. And and and and Kubernetes by itself is also something that, you know,
1:17:47 is is is not trivial even though it makes things easier to deploy once you get a hang of it. It's not trivial. Right? So yeah. And yeah. So, yeah, you can run some stuff on it and and play with it and and whatnot. And and the we haven't touched upon several other things. Like, for example, it's like, I've actually have a I still need to blog about it, but we've done some benchmarks where we where we pushed well over close to a million IOPS, on on, NVMe devices. And for example, it's like how do you define
1:18:00 Closing thoughts
1:18:24 how many cores you wanna use? And how do I tell the operating system to not schedule any workloads on the cores that I run Maya store on? Because, for example, if you type l s, you don't want the kernel to interrupt the Maya store process. Hey. Where the kernel said, hey, Maya store stopped. LS gets the run and then right. So you wanna have these context switches. You want to avoid that from happening. And there are various ways to do that, but the canonical way is to boot the system with certain arguments that isolate CPUs and all
1:18:57 stuff like that. So if you really wanna wanna push the system, there is there is more work to do, to be fair. So this is this is like the happy path. Everything's hunky dory. But, you know, if you wanna get to the optimal performance, there's a little bit more work. Can you can squeeze the Juicero packet with your hand and get some of the juice out, but there's a special machine you can put the packet in and really get all the juice out if you want. Yeah. Sounds like there's a there's a lot of tweaks
1:19:27 and things you can do to get even more. So excited about it because, you know, I I work for Equinix Metal. I have all the bare metal compute that I need access to ever in the world. And now I get Don't don't don't make me jealous. I don't think that's appropriate. And now I have a storage layer that is a built with really cool technology like Rust and doing really cool things. And I can squeeze the performance out of it to millions a million IOPS, you know. I mean, I think that's just terribly exciting and, you
1:19:55 know, that's Yeah. To tie this back to kind of the sorry to interject the original presentation that Kieran introduced with. To get started, the most important thing is to have it just work, to have it install and just be there and up and running. And over time, as applications grow, as your your business or your project grows, that scale tends to favor customization, I I like to say. So there's a lot of more knobs and a lot more buttons you wanna turn and press. And our solutions, OpenEBS and Myasore, are kind of built with that
1:20:31 in mind. We want it to be easy to set up and get going, but we also wanna give you those knobs and buttons for when you're ready to scale and do that extra stuff to really get going. Well, is there okay. So before we we wrap this up, is there anything anyone here wants to show before we conclude today's session? Is there anything that we haven't covered that we think is important to I'm thinking. Well, I mean, yeah, I would I mean, we could go through some some logs and and see some stuff on how it you know?
1:21:05 But I think this is pretty much it. We try to keep it very, very simple at the beginning and then get the foundation right and then start to come up with some more sophisticated stuff. Yeah. Awesome. I think this was like a great introduction and we go hands on and we and we showed how it works. I mean, I'm definitely gonna speak to you all afterwards and see if we can line up another session where we we dive into all those other concerns OpenEBS tries to bring, snapshots and backups and really try and you know understand the full landscape
1:21:38 of storage on Kubernetes. But I think for today, this was fantastic. I just wanna say thank you to all of you for joining me. Like I said, that's the first time I've had this many people on this stream and it was an absolute pleasure. Thank you. Well, thank you. For having us. Thanks. Alright. Well, you all Yeah. You all have a great day, and I will speak to you all soon. Thank you. Bye bye. Thank you.
Technologies featured
Meet the Cast
Stay ahead in cloud native
Tutorials, deep dives, and curated events. No fluff.
Comments