How to cure Kubernetes "storage headache"

Posted May 25, 20204 min read

Why in the cloud-native world, we have automated the management of so many underlying hardware complexity, so storage is still so painful? The reason is two words:data silos.

If you are already using Kubernetes, there may be a simple reason:it makes your life easier. After all, this is based on the entire premise of container orchestration. It makes the infrastructure disposable, rotate it when needed, and discard it when you are done, so you do n t have to think too much. At least, this is how it should work.

As you know, if you have set up a job that depends on persistent data, you will immediately encounter a big problem-storage.

Although Kubernetes completely abstracts the computing and network infrastructure, when your application is stateful and the data is persistent, it needs to be stored in an appropriate manner. You must still have all the knowledge of the underlying storage infrastructure to find the way to the data you need.

Not only is the location of the data, but all other fine-grained considerations(performance, protection, resilience, data governance, and cost) that come with other types of storage infrastructures, most data scientists do not want to consider.

Why in the cloud-native world, we have automated the management of so many underlying hardware complexity, so storage is still so painful? The reason is two words:data silos.

As long as we continue to manage data through the different infrastructures on which we live, rather than focusing solely on the data itself, we will inevitably end up spending a lot of storage silos. Fortunately, this is not a difficult problem. By transforming our way of thinking about data management from an infrastructure-centric approach to a data-centric approach, we can first use Kubernetes to provide us with a commitment:make storage SEP(Someone Else s Problem).

Virtualize your data

When the data you need is scattered on different storage islands, each storage island has its own unique attributes("or" or "cloud", "local", "object", "high performance", etc.) Abstract infrastructure considerations. Someone still has to answer all questions about performance, cost, and data governance before you can build your pipeline.(If the person is the IT administrator you want to ask for help, you can bet they cringe every time your name appears on your name. Because they know they will spend a lot of time on the mysterious infrastructure interface to destroy you The data spans all different copies and data stores, and they simply cannot complete the task before lunch.

The only way to get rid of this headache-the only way to truly achieve the speed and simplicity that Kubernetes should provide you-is to virtualize the data. Basically, you need to build an intelligent abstraction layer between the data and all the various storage infrastructures. This abstraction layer should enable you to view and access data from anywhere, without having to worry about whether the given infrastructure has the cost, location, or governance appropriate for the operation you are performing, and without constantly creating new copies.

Doing this is not as difficult as it sounds. The key:metadata. When you can encode all data requirements, context, or lineage considerations as metadata for data everywhere, it does n t matter which infrastructure data resides at any given moment. Now, when you build a data pipeline, you can fully use metadata. And your virtualization layer can use AI/ML to automatically handle all basic data management and infrastructure considerations for you.

Utilizing infrastructure abstraction

Once the virtualization layer is established and data management is done through metadata, you can perform various things that could not be done before.

  1. Eliminate data islands:Now, it doesn't matter which infrastructure or where you need the data. For your application, all those previously isolated storage resources(local, cloud, hybrid, archive) look like a common global namespace.

  2. Access storage resources programmatically:Since you are dealing with metadata(instead of a bunch of different underlying hardware infrastructures), you can now set up pipelines and access data through declarative statements:I need data with this performance , That's all it really cares about. Then, the intelligent virtualization layer can be realized and implemented, and your application(or your overburdened IT administrator) does not need to specify the operation method.

  3. Make data management self-service:Data scientists do not need to worry about the cost of comparing different storage types, enable data protection or ensure that security and compliance requirements are met every time a pipeline is established.(Thus, your IT and security team may also not want data scientists to make these choices-unless they want everything to run on the most expensive storage without proper compliance.) Once the management of metadata and data is separated , All disappeared. The storage administrator can set up the fence by configuring a basic policy once. Users can then start self-service to meet most of their data management needs without having to open a ticket, and there will be no errors in making these calls manually each time the pipeline is established.

  4. Continuously enrich your data:When the system supports customizable and extensible metadata, you can now do all kinds of interesting things. For example, you can build a recursive process where you run data through the system, get some results, add those results back to metadata, and then run the job again. You can start to establish a deep contextual understanding of the data surrounding the data. The more data processed and used, the more data will be used for other jobs in the future. And, for any other application or data scientist who wants to use it, intelligence is now everywhere. It is not limited to one copy, but hidden on a storage island somewhere.

Unlock the data

When you virtualize data, all of these things are possible because the use of metadata is more flexible than the isolated storage infrastructure. The storage considerations that accompany setting up and orchestrating data pipelines can now be resolved for you. Your storage resources will become programmable, self-service and automatically compliant, usually without manual intervention.

Suddenly, you actually live in the reality that Kubernetes and software-defined storage should always be delivered. Regardless of the infrastructure, storage is software-defined, programmable, and consistent in a hybrid cloud environment. Your data is richer and more flexible. Your IT team no longer leaves the explosion photos on the ID card on the wall to throw darts. Most importantly, you actually spend more time processing data without worrying about where the data is stored.