How to escape the data gravity in a cloud native world

How to escape the data gravity in a cloud native world

Hybrid cloud headlines have dominated tech news in the last year,  with Microsoft investing in Azure Stack, AWS building OutPosts and not to forget, Google who led the pack with the announcement of their Kubernetes on-premises offering last July. These are advancements in computing, but what about storage? Any truly scalable cloud architecture is build around an object storage interface that serves as a building block for many services, to name a few:

  • repository of system images
  • target for snapshots and backups of virtual machine images
  • long term repository of logs
  • main storage location for user generated content like images, videos or document uploads depending on the application

 What about applications which need to store petabytes of data?

In a cloud native world, Kubernetes is the fundamental building block that makes running and managing applications in hybrid environments easier. The cloud native movement promised and has delivered improvements on many important axis:

  • faster application development
  • better quality and agility thanks to automation
  • flexibility in deployment options with hardware and infrastructure abstraction
  • less human errors in operations via automation and common practises

For example, today it’s possible to develop an application on a laptop with a self contained Kubernetes environment that runs all its dependencies, build a helm package and deploy the exact same code on a public cloud environment running in Google, AWS or Azure or on premise with a local K8s distribution like Scality MetalK8 or Redhat Openshift.

For stateless applications, like a web proxy or serving static content that do not require storage capacity beyond what can be offered inside a given hyper converged compute cluster, this is a reality. On the over end, if your application requires more capacity, like storing user documents, photos, or videos, the dreams of deploy anywhere and automated operations fails short.

In a previous post, I outlined the significant differences and gaps between AWS S3 and Google Cloud Storage REST API or with Azure Blob Storage. These differences are even more complex with the  multitude of smaller object storage services like Wasabi, Backblaze or Digital Ocean.

It’s not practical to think that an application can be developed to support  Azure Blob Store and its object snapshots and Append Blobs features while keeping this application portable to AWS. With such a set-up, your service could become global with a need to run in a region where Microsoft is not present. Do you really want to rewrite significant piece of your code for each storage service?

There should be a way to stay independent of storage clouds interfaces and vendors, the same way that it’s possible to abstract compute clouds thanks to solution like Kubernetes.

Why is object storage a good fit for cloud native architectures?

For one they share a lot of similarities:

  • Stateless: Great fit the Kubernetes/Pod model, greater flexibility
  • Abstraction of location: containers and data life cycles are very different, it’s a good way to separate operation concerns, right-sized capacity
  • Predictable: no locking and no complex state management
  • Optimised for heterogeneous networks: Multi-cloud and geo distributed by nature
  • Rich Metadata: metadata search is key, especially over multiple clusters and multiple geographies

What about if you could access data easily and transparently wherever it’s stored? No matter if it’s on prem, in a different data center or in any public cloud storage service.

To my knowledge, Zenko is the only solution today that can unify the leading 6 public cloud storage providers with a single API and support lifecycle policies between these heterogeneous environments. But don’t trust me, you can try Zenko in a few minutes by creating a sandbox environment in Orbit, Zenko’s hosted management portal!

And of course feel free to disagree and comment in our forums over here. We’re eager to get object storage as part of the reference architectures.

Photo by Filipe Dos Santos Mendes on Unsplash

Multi-cloud workflows to save time and money for media & entertainment industry

Multi-cloud workflows to save time and money for media & entertainment industry

The media landscape has changed drastically over the last couple of decades, and will continue to change as technology advances. Companies are looking to lower costs across every part of the production process so how can entertainment creation be made faster and more efficient? As the amount of data continues to grow, companies need a better storage solution.

The following is an excerpt from a live chat between Giorgio Regni, CTO, Paul Speciale, CPO, and Wally McDermid, VP Business Development, all from Scality. The group discussed how cloud workflows can save time and money for the media and entertainment industry. View the chat in its entirety and stay tuned with future ZenkoLive chats on Zenko Forum.

The media and entertainment industry is currently being challenged by four big things

  • Explosive and unrelenting data growth. The amount and the type of content have certainly drastically changed and continues to change. With that comes complexity for media companies in how they generate, curate and distribute that content.
  • Data gravity impeding global team collaboration. For example, how do you easily get data, your video files, from where they’re created in Europe, to your creation team maybe in the US, and then distribute it to a market in Asia?. There’s a sense of data gravity and how that slows down the distribution of content internally within a media company.
  • On-demand consumption model reshaping the established landscape.The way we consume content has also drastically changed. It used to be a single television set and now it’s any number of cell phones and tablets and laptops, and it’s YouTube and it’s Daily Motion and it’s Facebook.

Monetization pressure from new emerging business models

It used to be (to oversimplify it) the media company would send us a television program, insert 2-4 commercials every 10 or 15 minutes, and that was their business model. Now, we subscribe to Netflix. There are ads embedded within videos, or overlaid on top of videos.

Media and entertainment customers don’t want to use the cloud just to back-up or archive – they actually want to process the data.

Paul: It used to just be about one-time data movement, and now it’s much more about active data workflows. You’re really inserting the cloud services into the workflow. It’s kind of a pipeline from capture to delivery. The services are integral to that. And you need to do that in a very efficient manner.

When you talk about media, there’s always a concept of a media asset manager. How does that play with something like Zenko?

Paul: The whole idea of media asset managers is to provide the business intelligence to the end user to catalogue things, to find things. You’re going to have thousands and millions of assets spread around your on-premises system. But now imagine the idea that you have it not only on-premises but in two or three clouds.

The idea of using an intelligent metadata manager like Zenko to tag the metadata and to be able to have the application and Zenko interplay to do things like intelligent searches, seems to be a perfect marriage.

Customers are very concerned about data transfer costs. How cost-effective is a solution like Zenko?

Wally: If you are actually moving less data around the world overall, because you have a global namespace, that will certainly save you network costs. Even if it’s internal data movement from your origin server in the US to, for example, an origin server in Europe. You’ll save money just on the transfer cost across your network. But you’ll also make your workers more efficient, which is a slightly softer cost.

Content is being created so quickly, and needs to get pushed out to consumers so quickly, that anything media and entertainment companies can do to become more efficient and faster in their workflows will save them money/increase their revenue.

It used to be that the standard workflow was to start on-premises, and then flow the data to the cloud as needed. But we’re starting to hear kind of the reverse – where the origin of the storage is in the cloud.

Wally: The larger media and entertainment companies that have an existing on-premises infrastructure and data center are probably going to go on-prem to cloud. But some of the smaller companies who may not have that same expansive footprint on-prem will likely start in the cloud, do much of their work, and then just bring the important summary results back down on-prem.

Paul: The idea is to use cloud first. It’s probably the right place for collection if you have a distributed team. If you’re doing collaboration with teams across the planet, it makes sense.

There’s the old cliché that with great power comes great responsibility. I think sometimes with the cloud providers, it’s ‘with great power comes great complexity.’ I think that’s the challenge that Zenko aims to solve.

Is there a way that people can benefit from the cloud and also enable existing data to be sent to one of the big cloud providers for processing?

Paul: One of the things we realized that can be done through Zenko is to discover that data. We can discover it, we can basically assess the file system, and not move the data but import the metadata into the Zenko name space.

It’s a lighter-weight operation. It doesn’t incur movement of the heavier assets. Once it’s captured in Zenko, there is the ability to apply policies to it. Some of those policies could be to replicate the data to the cloud, to move it to the cloud. It becomes part of the cloud workflow.

Wally: Nobody likes a whole-scale lift-and-shift. Lots of companies have hundreds of filers with a bunch of data on them. The solution we’re talking about simply ingests and creates metadata from your existing data, which can remain in place for as long as you want. Over time you can use it and/or migrate it to different platforms, and you can do that on your own schedule.

How does the cloud impact a collaboration workflow?

Paul: So much of the media industry has been focused around how to move data efficiently between teams. But ultimately, isn’t it better if they’re not moved around?

The cloud provides a global view and global access. But a system like Zenko is even better. Because on top of that it can abstract a global namespace. That global namespace will not only consist of one cloud, it may be multiple clouds, and as we just talked about here it can also include the on-premises systems. I think from a sharing and collaboration perspective, having a global namespace –  there’s nothing better than that.

Photo by Stuart Grout

With MetalK8s, Scality puts Kubernetes on Bare Metal

Scality Zenko Kubernetes Bare Metal MetalK8sToday, we are breaking convention to make it easier to run Kubernetes (K8s) on bare metal servers to support stateful applications that need stable, persistent storage. To do this, we built MetalK8s, an open source, opinionated K8s distribution, to simplify on-premises deployments.  It is available now on GitHub, under the Apache Software License v2.

We made the early design choice to trust K8s as the infrastructure foundation for Scality’s Zenko Multi-Cloud Data Controller. Since Zenko is meant to run in any cloud environment and also on-premises for the most demanding enterprise customers, K8s is the perfect tool to run Zenko the same way whether it’s on Amazon, Azure, Google Cloud, in a private cloud or on bare metal.

Because the on-premises bare metal K8s experience is not prevalent in the community, we decided to take the time to find the best way to deploy Kubernetes on bare metal servers for Zenko.

Why are we choosing to go Bare Metal?

Kubernetes itself grew up in virtualized environments, which is natural given its purpose to orchestrate distributed container environments. We realized that very few people dare to run K8s on bare metal, and actually most have no choice but to run it on virtual infrastructure.  In the course of our development, though, we discovered that there are several huge benefits to be gained from deploying on bare metal. But, this is only true when developers and operators find all of the tools they need for smooth, long-term operations.

While developing Zenko on Kubernetes, we required efficient access to stateful local storage for both metadata and data. Moreover, as Zenko is a distributed environment, we really wanted to optimize the proximity of compute and storage to the same machine that has the local storage. For applications that require this type of storage access efficiency, the K8s environment has never been optimal. By default, K8s can otherwise force the use of an expensive SAN or cloud block-storage volumes. With MetalK8s, we resolve this problem and are enabling fast local storage access for container-based applications.

Why an Opinionated Distribution?

We chose to go the ‘opinionated’ route because we have made some specific choices in the course of our development: MetalK8s’ goal is to provide great functionality while reducing complexity for other users and delivering the stateful storage efficiencies described earlier.

Our team specifically chose to leverage an existing project rather than reinvent the wheel, so we based MetalK8s on top of the excellent open-source Kubespray ‘playbook’ project. Kubespray enables us to install a base Kubernetes cluster reliably using the Ansible provisioning tool with its dependencies (e.g.; the etcd distributed database system).  This allowed us to quickly iterate and implement the features we need to run Kubernetes at the scale needed by Scality customers. This is where our own Scality DevOps team excels, and so this stayed in line with our focus on ease of operations. Contrary to Kubespray’s general-purpose approach, we decided to make hard choices like use Calico as the only Container Network Interface (CNI) implementation. Further, an “ingress controller” is deployed by default, based on Nginx. And for simplicity, all of these are managed as Helm packages.

The installation is further augmented with a set of powerful operational tools for monitoring and metering, including Prometheus, Grafana, ElasticSearch and Kibana.

Unlike hosted Kubernetes solutions, where network-attached storage is available and managed by the provider, MetalK8s assumes no such system to be available in environments where MetalK8s itself is deployed. This means its focus is on managing node-local storage and exposing local volumes to containers managed in the cluster.

Contributing Back

The team plans to work with upstream projects including Kubespray, Kubernetes, Helm Charts and others to release back all useful contributions and eventually implement new features.

You can learn more from Scality Architect Nicolas Trangez at OpenStack Summit in Vancouver on Tuesday, May 22 at 3:10pm PDT Convention Centre West – Level One – Marketplace Demo Theater.  Scality’s CTO Giorgio Regni and the Zenko team are also available for interviews on site: please book time with them.

Nicolas Trangez co-authored this post.

 

 

Four critical differences between Google Cloud Storage and Amazon S3 APIs

Many junior devops engineers have floated the pipe dream that you could simply point any application to any cloud storage without ever touching the code. As it turns out, that’s not such a pie-in-the sky idea. Zenko abstracts all major clouds under a single namespace and a single API, namely the AWS S3 API and this removes all the headaches of support multiple APIs from the get go.

It’s a common misconception that cloud storage APIs are similar enough that moving from one provider to another is just a matter of changing a host name in a configuration file. This might have been mostly true in the early stage of cloud but as you’ll see, it’s far from being true now.

Let’s compare key elements of Google Cloud Storage API (GCS) to AWS S3 API (S3):

  1. Multipart Upload or how to efficiently upload large pieces of data
  2. Object-level tagging or how to assign easily searchable metadata to objects
  3. Object versioning or protecting against accidental deletion and providing rollback to your users
  4. Replication or how to make sure they’re always a copy of your data somewhere else
Google Cloud Storage Amazon S3
Multipart upload The application needs the logic API tracks the pieces
Object-level tagging Not available Supported since Nov, 2016
Object versioning DELETE request without version moves from ‘master’ to ‘archive’. There is no concept of ‘version stack’. DELETE without version specified applies DELETE marker to master. You still get the latest version of an object if master is deleted.
Replication Data stored redundantly with Multi-Regional Storage in a fixed manner. Flexible and dynamic control with Cross Region Replication API

Multipart upload

Though GCS does have a method for merging multiple object into a single larger one, it lacks a counterpart to AWS’s popular multipart upload API. Here’s how multipart upload (MPU) works on S3:

  1. You initiate the upload by creating a multipart upload object
  2. You upload the object parts in parallel over multiple HTTP requests
  3. After you have uploaded all the parts, you complete the multipart upload.
  4. Upon receiving the complete multipart upload request, Amazon S3 constructs the object from the uploaded parts.

In that model, S3 keeps track of all the uploaded parts of a MPU. For example, aborting an MPU will remove all associated parts and they take care of managing the state of your upload for you. Objects only appears in your bucket after all the uploading is done.

In GCS, you’re in charge of keeping track of each part, of piecing them together and you have to write the corresponding logic:

  1. You upload “parts” of your object as individual objects in a bucket
  2. Perform a compose method on that list of objects, limited to 32 item per operation
  3. You repeat the compose operation by batches of 32 until the entire full final object is stitched together.

This clearly is a cumbersome process, it’s possible to merge in parallel for faster stitching together of a large object but it’s not trivial and requires a somewhat complex logic on the client side.

Developers also need to keep in mind that GCS allows a maximum of 1024 parts, while S3 allows 10,000 and both share the same 5TB maximum file size.

Update: On June 21, 2018 GCS removed the limit on the number of components in a composite object. Learn more on our forum.

Object-level tagging

Object tagging is a way to categorize data with multiple key-value pairs. It’s a useful way to locate data and is much more powerful than object name prefix-based search. You can think of object tagging as similar to Gmail tags by opposition to filesystem folders. Objects tags can also influence S3 Lifecycle and cross region replication policies. This API is relatively new for S3 but unfortunately it has no equivalent in GCS yet.

This functionality can not be migrated over from S3 to GCS, so check if your application requires tagging.

Object versioning

Both GCS and S3 support object versioning and enable the retrieval of objects that are deleted or overwritten. But both implementation differ in subtle ways that make them not fully interchangeable.

Think of the AWS object versioning as a stack of versions ordered by time:

  • Each object has a master version that always points to the most recent entry in the stack
  • Any operation that doesn’t specify a version works on that master version
  • This includes delete operations, ie deleting an object without specifying a version creates a DELETE MARKER
  • It’s possible to get or delete a specific version by using a version ID

GCP behaves differently, for each object, it maintains a MASTER version and an ARCHIVE version:

  • Deleting an object without specifying a version id moves it from master to archive and does not create a DELETE MARKER
  • Deleting a master object by using its version ID permanently destroys its data and does not move it to the archive
  • There’s no concept of a stack so even if an archive version of an object exists, deleting the master version does not promote the archive to master. A get operation on the object will return a 404 not found code.
These differences are not obvious and these two versioning implementations are not interchangeable.

Replication

Replication is a way to copy object across buckets in different geographical locations and increase both data protection and availability. It’s a storage best practice, keeping a remote copy is one of the best insurance and doubles your data durability.

S3 supports replication through their Cross Region Replication (CRR) API and supports two way synchronisation of buckets.

GCS doesn’t have a replication API and lacks the flexibility of S3 CRR but it can still redundantly store data across locations by specifying a Multi-Regional Storage bucket location. This means that GCS stores your data redundantly in at least two geographic places separated by at least 100 miles within the multi-regional location of the bucket but you cannot precisely control which region like with AWS.

Both GCS and S3 provide geo redundant storage but AWS implementation supports more locations, flexibility and API control.

Key takeaway: two incompatible cloud storage protocols

The GCP and AWS S3 API are not interchangeable and require significant adaptation of your application and client logic to migrate from one to the other. When looking at object storage compatible applications, S3 is clearly the most widely supported API by far. That’s why we decided to implement the Amazon S3 API for our multi-cloud controller, Zenko.

Try a sandbox version of Zenko very quickly from Zenko Orbit, our hosted management portal.

How to Cloud Data Mirroring from Microsoft Azure to AWS S3

How to Cloud Data Mirroring from Microsoft Azure to AWS S3

Maz from our team at Scality has been working on a simple guide explaining how to use Zenko and Orbit to replicate data between Microsoft Azure Blob storage and Amazon S3.

This is a very important milestone for us as it shows how easy it is to just create an account and login into the Zenko management portal, create a Zenko sandbox and start replicating data between 2 completely different public clouds replication wizard, no command line required. – Giorgio Regni

Why is this news worthy?

It is all about data durability and availability!

Replicating your data across different providers is a great way to increase its protection and guarantee that your data will always be available, even in the case of a catastrophic failure:

  • In terms of durability, we now have two independent services each of which has a durability of eleven 9’s. By storing data across both clouds, we can increase our data durability to “22 9’s” that makes a data loss event a statistically negligible probability.
  • We can also take advantage of immutability through object versioning in one or more of the cloud services, for even greater protection. We have also gained disaster recovery (D/R) protection, meaning the data is protected in the event of a total site disaster or loss.
  • In terms of data availability, what are the chances that two cloud regions in one service (for example, AWS US East and AWS US West) are unavailable at the same time? Stretching this further, what are the chances that two INDEPENDENT cloud services such as AWS S3 and Azure Blob Storage are unavailable at the same time?

Download the ebook or read it here: