How to move data to Google Cloud Storage with Zenko

How to move data to Google Cloud Storage with Zenko

If you want to use the strengths of any public clouds, you often have to move your data. Take machine learning, where Google Cloud Platform seems to have taken the lead: if you want to use TensorFlow as a service, your training datasets will have to be copied to GCP. Moreover, managing data on the level of application (in my case ML application) was something giving me a headache.

I used to move data to the cloud with ad-hoc solutions but that is inefficient and can lead to a high quantity of abandoned data occupying space. With Zenko, you can copy or move data to Google Cloud while keeping track of stray files, controlling your costs and making the process less painful.

The limits of uploading data straight into GCP

A common objection to installing Zenko is why not simply upload data into the cloud?

It depends on what you are doing. There is gsutil CLI tool and Google Storage Transfer offered by Google. The first one is slow and is good for small, one-time transfers, though you have to make sure you don’t end up terminating your command because gsutil can’t resume the transfer. Storage Transfer Services is scheduled as a job on GCP so you don’t have to guard it. If you transfer data from an external source, you pay for egress and operational GCP fees for using this service. It’s also worth mentioning rclone: it is handy to transfer data to GCP but doesn’t manage the transfers on the object level.

Zenko is an open source tool you can use to transfer and manage data between your on-prem location and desired locations in any public cloud. The key difference is that you are able to use one tool to continuously manage/move/backup/change/search the data.


Manage your data in any cloud – Try Zenko

Move that data

Step 0 – Setup Zenko

You will need to set up your Zenko instance and register it on  Zenko Orbit to proceed with this tutorial. If you haven’t completed that step, follow the Getting Started guide.

Step 1 – Create a bucket in Zenko local filesystem

This bucket (or multiple buckets) will be a transfer point for your objects. There are general naming rules in the AWS object storage world. These are the same rules you should follow when naming buckets on Zenko.

Creating a bucket on Zenko local filesystem

Step 2 – Add GCP buckets to Zenko

For each bucket in GCP storage that you want to add to Zenko, create another bucket with the name ending in “-mpu”. For example, if you want to have a bucket in GCP named “mydata”, you’ll have to create two buckets: one called “mydata” and another called “mydata-mpu”. We need to do this because of the way Zenko abstracts away the differences between various public cloud providers. S3 protocol uses a technique to split big files and objects into parts and upload them in parallel to speed up the process. When all the parts are uploaded it stitches them back together. GCP doesn’t have this concept so Zenko needs an extra bucket to simulate multipart upload (it’s one of the four differences between S3 and Google storage API we discussed before.)

Creating “-mpu” bucket on GCP for multipart upload

Find or create your access and secret keys to the GCP storage service to authorize Zenko to write to it.

Creating/getting access and secret keys from GCP Storage

Step 3 – Add your Google Cloud buckets to Zenko

You need to authorize access to the newly created GCP buckets by adding the keys (follow the instructions in the animation above). In this example, I have three buckets on GCP all in different regions. I will add all three to Zenko and later set the rules for data to follow and that will allow me to decide which data goes to which region on GCP.

Adding GCP buckets to “Storage locations” in Zenko

Now you can set up rules and policies that will move objects to the cloud. You have two options, replication or transition policies if your objective is moving data to GCP.

You can replicate data to Google Cloud Storage. And it can be as many rules as you like for different kinds of data. Zenko will create a replication queue using Kafka for each new object and if replication fails it will retry again and again.

Here is how to set a rule for replication. I am not specifying any prefixes for objects I wish to replicate but you can use this feature to distinguish between objects that should follow different replication rules.

Setting up object replication rules to GCP Storage

Another way to move data with Zenko is through a transition policy. You can specify when and where an object will be transferred. In this case, the current version of the object in Zenko local bucket will be transferred to a specified cloud location, GCP center in Tokyo in my example.

Creating a transition policy from Zenko to GCP Storage

As you can see there is no need for manual work. You just have to set up your desired storage locations once and create the rules to which all incoming data will adhere. It could be data produced by your application every day (Zenko is just an S3 endpoint) or big dataset you wish to move to GCP without sitting and hypnotizing the migration.


Manage your data in any cloud – Try Zenko

For more information ask a question on the forum.

How to reset Zenko queue counters

How to reset Zenko queue counters

The objects counters for target clouds can get out of sync when objects are deleted before they are replicated across regions (CRR) or when deleted or old versions of objects are removed before delete operations are executed on the target cloud. If this happens, you need to reset the Zenko queue counters in Redis and below are the instructions to do it.

Step-by-step guide

To clear the counters you first need to make sure the replication queues are empty and then reset the counters in Redis.

1) Do check the queues, set maintenance.enabled = true and maintenance.debug = true for the deployment. This can be done by setting the values by enabling them in the chart and running a “helm upgrade” or by setting them with an upgrade like this:

% helm upgrade my-zenko -f options.yml --set maintenance.enabled=true --set maintenance.debug.enabled=true zenko

This enables some extra pods for performing maintenance activities and debugging. After it’s done deploying make sure the “my-zenko-zenko-debug-kafka-client” pod is running.

2) Then you can enter the pod and check the queues:

% kubectl exec -it [kafka-client pod] bash 

# List the avail queues (replacing "my-zenko-zenko-queue" with "[your name]-zenko-queue")
root@[pod-name]/opt/kafka# ./bin/kafka-consumer-groups.sh --bootstrap-server my-zenko-zenko-queue:9092 --list

3) Identify the target cloud replication groups relevant to the counters you want to reset and check the queue lag like this:

root@[pod-name]/opt/kafka# ./bin/kafka-consumer-groups.sh --bootstrap-server my-zenko-zenko-queue:9092--group backbeat-replication-group-example-location --describe

Check the “LAG” column for pending actions, lag should be zero if they are empty. If the queues for all of the targets are quiescent we can move on.

4) Now we can head over to a Redis pod and start resetting counters.

% kubectl exec -it my-zenko-redis-ha-server-0 bash
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli KEYS [location constraint]* |grep pending

# (for example: redis-cli KEYS aws-eu-west-1* |grep pending)
# This will return two keys, one for bytespending and one for opspending
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli KEYS aws-eu-west-1* |grep pending
aws-eu-west-1:bb:crr:opspending
aws-eu-west-1:bb:crr:bytespending

# Set the counters to 0
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli SET aws-eu-west-1:bb:crr:opspending 0
OK
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli SET aws-eu-west-1:bb:crr:bytespending 0
OK

Do this for each target location that you wish to clear.

Failed Object Counters

Failed object markers for a location will clear out in 24 hours (if they are not manually or automatically retried). You can force the to clear by setting the “failed” counters to zero. You’ll need to find the keys with “failed” in the text and delete them. Something like this:

##
# Grep out the redis keys that house the failed object pointers
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli KEYS aws-eu-west-1* |grep failed

##
# Now delete those keys
no-name!@galaxy-z-redis-ha-server-0:/data$ redis-cli DEL [key name]

Developing and debugging a highly distributed system can be hard and sharing our learning is a way to help others. For everything else, please use the forum to ask more questions 🙂

Photo by Nick Hillier on Unsplash

How to manage data automatically with custom Backbeat extensions

How to manage data automatically with custom Backbeat extensions

Backbeat, a key Zenko microservice, dispatches work to long-running background tasks. Backbeat uses Apache Kafka, the popular open-source distributed streaming platform, for scalability and high availability. This gives Zenko functionalities like:

  • Asynchronous multi-site replication
  • Lifecycle policies
  • Metadata ingestion (supporting Scality RING today, with other backends coming soon)

As with the rest of the Zenko stack, Backbeat is an open-source project, with code organized to let you use extensions to add features. Using extensions, you can create rules to manipulate objects based on metadata logs. For example, an extension can recognize music files by artist and move objects in buckets named after the artist. Or an extension can automatically move objects to separate buckets, based on data type (zip, jpeg, text, etc.) or on the owner of the object.

All Backbeat interactions go through CloudServer, which means they are not restricted to one backend and you can reuse existing solutions for different backends.

The Backbeat service publishes a stream of bucket and object metadata updates to Kafka. Each extension applies its own filters to the metadata stream, picking only metadata that meets its filter criteria. Each extension has its own Kafka consumers that consume and process metadata entries as defined.

To help you develop  new extensions, we’ve added a basic extension called “helloWorld.” This extension filters the metadata stream to select only object key names with the name “helloworld” (case insensitive) and when processing each metadata entry, applies a basic AWS S3 putObjectTagging where the key is “hello” and the value is “world.”

This example extension shows:

  • How to add your own extension using the existing metadata stream from a Zenko 1.0 deployment
  • How to add your own filters for your extension
  • How to add a queue processor to subscribe to and consume from a Kafka topic

There are two kinds of Backbeat extensions: populators and processors. The populator receives all the metadata logs, filters them as needed, and publishes them to Kafka. The processor subscribes to the extension’s Kafka topic, thus receiving these filtered metadata log entries from the populator. The processor then applies any required changes (in our case, adding object tags to all “helloworld” object keys).

[maxbutton id=”1″ url=”https://www.zenko.io/try-zenko” text=”Try Zenko now!” ]

Example

Begin by working on the populator side of the extension. Within Backbeat, add all the configs needed to set up a new helloWorld extension, following the examples in this commit. These configurations are placeholders. Zenko will overwrite them with its own values, as you’ll see in later commits.

Every extension must have an index.js file in its extension directory (“helloWorld/” in the present example). This file must contain the extension’s definitions in its name, version, and configValidator fields. The index.js file is the entry point for the main populator process to load the extension.

Add filters for the helloWorld extension by creating a new class that extends the existing architecture defined by the QueuePopulatorExtension class. It is important to add this new filter class to the index.js definition as “queuePopulatorExtension”.

On the processor side of the extension, you need to create service accounts in Zenko to be used as clients to complete specific S3 API calls. In the HelloWorldProcessor class, this._serviceAuth is the credential set we pass from Zenko to Backbeat to help us perform the putObjectTagging S3 operation. For this demo, borrow the existing replication service account credentials.

Create an entry point for the new extensions processor by adding a new script in the package.json file. This part may be a little tricky, but the loadManagementDatabase method helps sync up Backbeat extensions with the latest changes in the Zenko environment, including config changes and service account information updates.

Instantiate the new extension processor class and finish the setup of the class by calling the start method, defined here.

Update the docker-entrypoint.sh file. These variables point to specific fields in the config.json file. For example, “.extensions.helloWorld.topic” points to the config.json value currently defined as “topic”: “backbeat-hello-world”.

These variable names (i.e. EXTENSION_HELLOWORLD_TOPIC) are set when Zenko is upgraded or deployed as a new Kubernetes pod, which updates these config.json values in Backbeat.

Finally, add the new extension to Zenko. You can see the variables defined by the Backbeat docker-entrypoint.sh file in these Zenko changes.

Some config environment variables aren’t so apparent to add because we did not add them to our extension configs, but they are necessary for running some of Backbeat’s internal processes. Also, because this demo borrows some replication service accounts, those variables (EXTENSIONS_REPLICATION_SOURCE_AUTH_TYPE, EXTENSIONS_REPLICATION_SOURCE_AUTH_ACCOUNT) must be defined as well.

Upgrade the existing Zenko deployment with:

$ helm upgrade --set ingress.enabled=true --set backbeat.helloworld.enabled=true zenko zenko

Where the Kubernetes deployment name is “zenko”. You must update the “backbeat” Docker image with the new extension changes.

With the Helm upgrade, you’ve added a new Backbeat extension! Now whenever you create an object with the key name of “helloworld” (case insensitive), Backbeat automatically adds object tagging, with a “hello” key and a “world” value  to the object.

Have any questions or comments? Please let us know on our forum. We would love to hear from you.

Photo by Jan Antonin Kolar on Unsplash

How to do Event-Based Processing with CloudServer and Kubeless

How to do Event-Based Processing with CloudServer and Kubeless

We want to provide all the tools our customers need for data and storage, but sometimes the best solution is one the customer creates on their own. In this tutorial, available in full on the Zenko forums, our Head of Research Vianney Rancurel demonstrates how to set up a CloudServer instance to perform additional functions from a Python script.

The environment for this instance includes a modified version of CloudServer deployed in Kubernetes (Minikube will also work) with Helm, AWS CLI, Kubeless and Kafka. Kubeless is a serverless framework designed to be deployed on a Kubernetes cluster, which allows users to call functions in other languages through Kafka triggers (full documentation). We’re taking advantage of this feature to call a Python script that produces two thumbnails of any image that is uploaded to CloudServer.

The modified version of CloudServer will generate Kafka events in a specific topic for each S3 operation. When a user uploads a photo, CloudServer pushes a message to the Kafka topic and the Kafka trigger runs the Python script to create two thumbnail images based on the image uploaded.

This setup allows users to create scripts in popular languages like Python, Ruby and Node.js to configure the best solutions to automate their workflows. Check out the video below to see Kubeless and Kafka triggers in action.

For those of you who like to also see text description, follow Vianney’s full tutorial on Zenko forum.

Photo by Yann Allegre on Unsplash

How to use Azure Video Indexer to add metadata to files stored anywhere

How to use Azure Video Indexer to add metadata to files stored anywhere

As the media and entertainment industry modernizes, companies are leveraging private and public cloud technology to meet the ever-increasing demands of consumers. Scality Zenko can be integrated with existing public cloud tools, such as Microsoft Azure’s Video Indexer, to help “cloudify” media assets.

Azure’s Video Indexer utilizes machine learning and artificial intelligence to automate a number of tasks, including face detection, thumbnail extraction and object identification. When paired with the Zenko Orbit multi-cloud browser, metadata can be automatically created by the Indexer and imported as tags into Zenko Orbit.

Check out the demo of Zenko Orbit and Video Indexer to see them in action. A raw video file—with no information on content beyond a filename—is uploaded with Zenko Orbit, automatically indexed through the Azure tool, and the newly created metadata is fed back into Zenko as tags for the video file. Note that Orbit also supports user-created tags, so more information can be added if Indexer misses something important.

Why is this relevant?

  • Applications don’t need to support multiple APIs to use the best cloud features. Zenko Orbit uses the S3 APIs and seamlessly translates the calls to Azure Blob Storage API.
  • The metadata catalog is the same, wherever the data is stored. The metadata added by Video Indexer are available even if the files are expired from Azure and replicated to other locations.

Enjoy the demo:

Don’t hesitate to reach out on the Zenko Forums with questions.

Photo by Kevin Ku on Unsplash