Here is a brief overview of the architectural differences between AWS, GCP and Azure for data storage and authentication, and additional links if you wish to further deep dive into specific topics.
Working on Zenko at Scality, we have to deal with multiple clouds on a day-to-day basis. Zenko might make these clouds seem very similar, as it simplifies the inner complexities and gives us a single interface to deal with buckets and objects across all clouds. But the way actual data is stored and accessed on these clouds is very different.
Disclaimer: These cloud providers have numerous services, multiple ways to store data and different authentication schemes. This blog post will only deal with storage whose purpose is, give me some data and I will give it back to you. This means it addresses only object storage (no database or queue storage) that deals with actual data and authentication needed to manipulate/access that data. The intent is to discuss the key differences to help you decide which one suits your needs.
Each cloud has its own hierarchy to store data. For any type of object storage everything comes down to objects and buckets/containers. The below table gives a bottom-up comparison of how objects are stored in AWS, GCP and Azure.
|Base Entity||Objects||Objects||Objects also called blobs|
|Storage Class||S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA, S3 One Zone-IA, S3 Glacier, S3 Glacier Deep Archive||Multi-Regional Storage, Regional Storage, Nearline Storage, Coldline Storage||Hot, Cool, Archive|
|Region||Regions and AZs||Multi-regional||Azure Locations|
|Underlying service||S3, S3 Glacier||Cloud Storage||Blob Storage|
|Management||Console, Programmatic||Console, Programmatic||Console, Programmatic|
Following the traditional object storage model, all three clouds (AWS, GCP and Azure) can store objects. Objects are identified using ‘keys’. Keys are basically names/references to the objects with the ‘value’ being actual data. Each one has it’s own metadata engine which allows us to retrieve data using keys. In Azure storage these objects are also called “blobs”. Any key that ends with a slash(/) or delimiter in case of AWS is treated as a PREFIX for the underlying objects. This helps in with grouping objects in a folder like structure and can be used for organizational simplicity.
- AWS: 5TB object size limit with 5GB part size limit
- GCP: 5 TB object size limit
- Azure: 4.75 TB blob size limit with 100 MB block size limit
In object storage everything is stored under containers, also called buckets. Containers can be used to organize the data or provide access to it but, unlike a typical file system architecture, buckets cannot be nested.
Note that in AWS and GCP containers are referred to as buckets and in Azure they are actually called containers.
- AWS: 1000 buckets per account
- GCP: No known limit on a number of buckets. But there are limits for a number of operations.
- Azure: No limit on the number of containers
Each cloud solution provides different storage tiers based on your needs.
- S3 Standard: Data is stored redundantly across multiple devices in multiple facilities and is designed to sustain the loss of two facilities concurrently with 99.99 % availability, 99.999999999% durability.
- S3 Intelligent-Tiering: Designed to optimize costs by automatically transitioning data to the most cost-effective access tier, without performance impact or operational overhead.
- S3 Standard-IA: Used for data which is accessed less frequently, but requires rapid access when needed. Lower fee than S3 Standard but you are charged a revival fee.
- S3 One Zone-IA: Same as standard-IA, but data is stored only in one availability zone. It will be lost in case of an availability zone destruction
- S3 Glacier: Cheap storage suitable for archival data or infrequently accessed data.
- S3 Glacier Deep Archive: Lowest cost storage, used for data archival and retention which may be accessed only twice a year.
- Multi-Regional Storage: Typically used for storing data that is frequently accessed (“hot” objects) around the world, such as serving website content, streaming videos, or gaming and mobile applications.
- Regional Storage: Data is stored in the same region as your google cloud dataPRoc. Has higher SLA than multi-regional (99.99%).
- Nearline Storage: Available both multi-regional and regional. Very low-cost storage used for archival data or infrequently accessed data. There are high operation costs and data retrieval costs.
- Coldline Storage: Lowest cost storage, used for data archival and retention which may be accessed only once or twice a year.
- Hot: Designed for frequently accessed data. Higher storage costs but lower retrieval costs.
- Cold: Designed for data which is typically access once in a month. It has lower storage costs and higher retrieval costs as compared to hot storage.
- Archive: Long term backup solution with the cheapest storage costs and highest retrieval costs.
Each cloud provider has multiple data centers, facilities and availability zones divided by regions. Usually, a specific region is used for better latencies and multiple regions are used for HA / geo-redundancy. You can find more details about each cloud provider storage specific region below:
AWS, GCP and Azure combined have thousands of services which are not just limited to storage. They involve and are not limited to compute, databases, data analytics, traditional data storage, AI, machine learning, IOT, networking, IAM, developer tools, migration, etc. Here is a cheat sheet that I follow for GCP. As mentioned before we are only going to discuss actual data storage services.
AWS provides Simple Storage Service(S3) and S3 Glacier, GCP uses its Cloud Storage service and Azure uses Blob storage. All these services provide massively scalable storage namespace for unstructured data along with their own metadata engines.
Here is the place the architecture of each cloud deviates from each other. Every cloud has its own hierarchy. Be aware that we are only discussing the resource hierarchy for object storage solutions. For other services, this might be different.
AWS: Everything in AWS is under an “account”. In a single account there is one S3 service which has all the buckets and corresponding objects. Users and groups can be created under this account. An administrator can provide access to the S3 service and underlying buckets and the service to users and groups using permissions, policies, etc. (discussed later). There is no hard limit on the amount of data that can be stored under 1 account. The only limit is on the number of buckets which defaults to 100 but can be increased to 1000.
GCP: GCP’s hierarchy model is ‘Projects’. A project can be used to organize all your Google cloud services/resources. Each project has its own set of resources. All projects are eventually linked to a domain. In the image below, we have a folder for each department and each folder has multiple projects. Depending on the project requirements and current usage, the projects can use different resources. The image shows the current utilization of the resources of each project. It’s important to note that every service will be available for every project. Each project will have its own set of users, groups, permissions, etc. By default you can create ~20 projects on GCP, this limit can be increased on request. I have not seen any storage limits specified by GCP except for the 5TB single object size limit.
Azure: Azure is different from both GCP and AWS. In Azure we have the concept of storage accounts. An Azure storage-account provides a unique namespace for all your storage. This entity only consists of data storage. All other services can be accessed by the user and are considered as separate entities from storage accounts. Authentication and authorization are managed by the storage account.
A storage account is limited to storage of 2 PB for the US and Europe, 500 TB for all other regions, which includes the UK. A number of storage accounts per region per subscription, including both standard and premium accounts is 250.
All cloud providers have the option of console access and programmatic access.
- AWS: Console using aws.amazon.com and programmatic using AWS CLI
- GCP: Console using cloud.google.com and programmatic access using gsutil
- Azure: Console access using portal.azure.com and programmatic access using AZ CLI
Identity and Access Management
Information security should ensure proper data flow and the right level of data flow. Per the CIA triad, you shouldn’t be able to view or change the data that you are not authorized to and should be able to access the data which you have right to. This ensures confidentiality, integrity and availability (CIA). The AAA model of security needs authentication, authorization and accounting. Here, we will cover authentication and authorization. There are other things that we should keep in mind while designing secure systems. To learn more about the design considerations I would highly recommend going through learning more about security design principles by OWASP and the OWASP Top 10.
AWS, GCP and Azure provide solid security products with reliable security features. Each one has its own way of providing access to the storage services. I will provide an overview of how users can interact with the storage services. There is a lot more that goes on in the background than what will be discussed here. For our purpose, we will stick to everything needed for using storage services. I will consider that you already have an AWS, GCP and Azure account with the domain configured (where needed). This time I will use a top-down approach:
|Underlying Service||AWS IAM||GCP IAM||AAD, ADDS, AADDS|
|Entities||Users/groups per account||users/groups per domain per project||users/groups per domain|
|Authentication||Access Keys / Secret Keys||Access Keys / Secret Keys||Storage Endpoint, Access Key|
|Authorization||roles, permissions, policies||Cloud IAM permissions, Access Control Lists(ACLs), Signed URLs, Signed Policy Documents||domain user permissions, shared keys, shared access signatures|
|Required details for operations||Credentials, bucket name, authorization||Credentials, bucket name, authorization||Credentials, storage account name, container name|
AWS: AWS Identity and Access Management(IAM) is an AWS web service that helps you securely manage all your resources. You can use IAM to create IAM entities (users, groups, roles) and thereafter provide them access to various services using policies. IAM is used for both authentication and authorization for users, groups and resources. In other clouds there can be multiple IAM services for multiple entities but in AWS for a single account there is only one point of authentication and authorization.
GCP: GCP IAM is similar to AWS IAM but every project will have its own IAM portal and its own setup if IAM entities (users, groups, resources).
Azure: Azure uses the same domain services as Microsoft and is known to have a very stable authentication service. Azure supports three types of services: Azure AD(AAD), active directory domain services(ADDS – used with windows server 2016, 2012 with DCPromo) and Azure active directory domain services(AADDS – managed domain services).
Azure AD is the most modern out of the three services and should be used for any enterprise solutions. It can sync with the cloud as well as on-premise services. It supports various authentication modes such as cloud-only, password hash sync + seamless SSO, pass-through authentication + seamless SSO, ADFS, 3rd party authentication providers. Once you have configured your AD, you use RBAC to allow your users to create storage accounts.
All cloud providers have the concept of users and groups. In AWS there is a single set of users and groups across an account. In GCP there is a single set of users and groups in every project. In Azure the users and groups depend upon how the domain was configured. Azure AD can sync all users from the domain or an admin can add users on the fly for their particular domain.
Credentials is a way for the end-user to prove their identity. By now you might have figured out that the services that help us create users will also provide us access to the storage services. This is true in the case on AWS and GCP, but not for Azure.
For AWS and GCP their respective IAM services allow us to generate a pair of Access Key and Secret Key for any user. These keys can later be used by the users to authenticate themselves to use cloud services which include AWS S3 and GCP cloud storage. For Azure the authentication for the containers is managed by the storage account. When a storage account is created, it creates a set of keys and an endpoint along with it. These keys and the endpoint along or the domain credentials are used for authentication.
Once a user has proved their identity, they need proper access rights to interact with the S3 buckets or GCP buckets or Azure containers.
AWS: In AWS this can be done in multiple ways. User can first be given access to S3 services using roles/permissions/policies and then on then can be given bucket level permissions using bucket policies or ACLs. Here is a small tutorial on how can a user give permissions for an S3 bucket. There are many other ways you can access buckets, but it’s always good to use some kind of authentication and authorization.
GCP: In GCP every project has its own IAM instance. Similar to AWS, you can control who can access the resource and how much access they will have. For our use case, this can be done using Cloud IAM permissions, Access Control Lists(ACLs), Signed URLs or Signed Policy Documents. GCP has a very thorough guide and documentation on these topics. Here is the list of permissions that you might want to use.
Azure: Azure has a lot of moving pieces considering it uses Azure AD as the default authentication mechanism. For now, we will assume that you are already authenticated to AD and only need to access the resources inside a storage account. Every storage account has its own IAM which you can provide a domain user permissions to access resources under the storage account. You can also use shared keys or shared access signatures for authorization.
Required Details for Operations
Now that we have authentication and authorized to our storage services we need some details to actually access our resources. Below are the details required for programmatic access:
- AWS S3: Access Key, Secret Key, Bucket name, region(optional)
- GCP Cloud storage: Access Key Secret Key, Bucket Name
- Azure: Storage Account name, Storage endpoint, Access Key, container name
This concludes my take on the key differences I noticed in a multi-cloud storage environment while working with the multi-cloud data controller, Zenko.
Let me know what you think or ask me a question on forum.