How to move data to Google Cloud Storage with Zenko

If you want to use the strengths of any public clouds, you often have to move your data. Take machine learning, where Google Cloud Platform seems to have taken the lead: if you want to use TensorFlow as a service, your training datasets will have to be copied to GCP. Moreover, managing data on the […]

Written By Dasha Gurova

On June 4, 2019
"

Read more

Solve the challenges of large-scale data, once and for all.

If you want to use the strengths of any public clouds, you often have to move your data. Take machine learning, where Google Cloud Platform seems to have taken the lead: if you want to use TensorFlow as a service, your training datasets will have to be copied to GCP. Moreover, managing data on the level of application (in my case ML application) was something giving me a headache.

I used to move data to the cloud with ad-hoc solutions but that is inefficient and can lead to a high quantity of abandoned data occupying space. With Zenko, you can copy or move data to Google Cloud while keeping track of stray files, controlling your costs and making the process less painful.

The limits of uploading data straight into GCP

A common objection to installing Zenko is why not simply upload data into the cloud?

It depends on what you are doing. There is gsutil CLI tool and Google Storage Transfer offered by Google. The first one is slow and is good for small, one-time transfers, though you have to make sure you don’t end up terminating your command because gsutil can’t resume the transfer. Storage Transfer Services is scheduled as a job on GCP so you don’t have to guard it. If you transfer data from an external source, you pay for egress and operational GCP fees for using this service. It’s also worth mentioning rclone: it is handy to transfer data to GCP but doesn’t manage the transfers on the object level.

Zenko is an open source tool you can use to transfer and manage data between your on-prem location and desired locations in any public cloud. The key difference is that you are able to use one tool to continuously manage/move/backup/change/search the data.


Manage your data in any cloud – Try Zenko

Move that data

Step 0 – Setup Zenko

You will need to set up your Zenko instance and register it on  Zenko Orbit to proceed with this tutorial. If you haven’t completed that step, follow the Getting Started guide.

Step 1 – Create a bucket in Zenko local filesystem

This bucket (or multiple buckets) will be a transfer point for your objects. There are general naming rules in the AWS object storage world. These are the same rules you should follow when naming buckets on Zenko.

Creating a bucket on Zenko local filesystem

Step 2 – Add GCP buckets to Zenko

For each bucket in GCP storage that you want to add to Zenko, create another bucket with the name ending in “-mpu”. For example, if you want to have a bucket in GCP named “mydata”, you’ll have to create two buckets: one called “mydata” and another called “mydata-mpu”. We need to do this because of the way Zenko abstracts away the differences between various public cloud providers. S3 protocol uses a technique to split big files and objects into parts and upload them in parallel to speed up the process. When all the parts are uploaded it stitches them back together. GCP doesn’t have this concept so Zenko needs an extra bucket to simulate multipart upload (it’s one of the four differences between S3 and Google storage API we discussed before.)

Creating “-mpu” bucket on GCP for multipart upload

Find or create your access and secret keys to the GCP storage service to authorize Zenko to write to it.

Creating/getting access and secret keys from GCP Storage

Step 3 – Add your Google Cloud buckets to Zenko

You need to authorize access to the newly created GCP buckets by adding the keys (follow the instructions in the animation above). In this example, I have three buckets on GCP all in different regions. I will add all three to Zenko and later set the rules for data to follow and that will allow me to decide which data goes to which region on GCP.

Adding GCP buckets to “Storage locations” in Zenko

Now you can set up rules and policies that will move objects to the cloud. You have two options, replication or transition policies if your objective is moving data to GCP.

You can replicate data to Google Cloud Storage. And it can be as many rules as you like for different kinds of data. Zenko will create a replication queue using Kafka for each new object and if replication fails it will retry again and again.

Here is how to set a rule for replication. I am not specifying any prefixes for objects I wish to replicate but you can use this feature to distinguish between objects that should follow different replication rules.

Setting up object replication rules to GCP Storage

Another way to move data with Zenko is through a transition policy. You can specify when and where an object will be transferred. In this case, the current version of the object in Zenko local bucket will be transferred to a specified cloud location, GCP center in Tokyo in my example.

Creating a transition policy from Zenko to GCP Storage

As you can see there is no need for manual work. You just have to set up your desired storage locations once and create the rules to which all incoming data will adhere. It could be data produced by your application every day (Zenko is just an S3 endpoint) or big dataset you wish to move to GCP without sitting and hypnotizing the migration.


Manage your data in any cloud – Try Zenko

For more information ask a question on the forum.

Simple, secure S3 object storage software for modern applications