Thursday, September 2, 2021

Cost Management in the Cloud

Cloud environments differ from on-premise environments in many ways but one important aspect is the focus on operational expense (OPEX) versus capital expense (CAPEX). Many IT professionals are familiar with on-premise environments typically having lifecycle management policies for investment in hardware, software, and other infrastructure expenses. Network connectivity on-premise is usually a fixed/predictable cost and resource consumption is not a cost concern - hypervisor and storage equipment costs the same regardless of how much capacity is consumed. What is different in the cloud? The cloud pay-as-you go model requires a thinking differently about costs. Based on my experience with design, implementation, and management of projects in Amazon Web Services (AWS), these are some cost concerns to be aware of:

  • Cost Tools. In planning a new cloud deployment or migration, a cost estimate should be prepared using the AWS Pricing Calculator. The proposed design should meet the project's budget constraints and an alert created in AWS Budgets to ensure spending is in alignment with the budget. AWS Cost Explorer can be used to further analyze cost and usage of AWS resources.
  • Resource Sizing. In the cloud pay-as-you model, the size (CPU, memory, disk), performance (e.g. IOPS, network throughput), and number of virtual machines provisioned affect cost. In an on-premise data center, typically the virtualization platform, storage area network (SAN) or network attached storage (NAS) are paid up-front and sized to support the maximum workload, with some headroom for growth (almost always resulting in an over provisioning resources). Analyze workload requirements and size cloud resources appropriately. Use Auto Scaling Groups to automatically increase or decrease capacity based on demand, such as the number of connections, CPU, or memory utilization.
  • Shared versus Dedicated Hosts. When using Infrastructure as a Service (IaaS) in AWS, Elastic Compute Cloud (EC2) instances can be launched on shared infrastructure or Dedicated Hosts. Shared infrastructure is cheaper; however, be aware of compliance or licensing requirements which may require dedicated hosts. For example, the DoD Cloud Computing Security Requirements Guide requires dedicated hosts for Impact Level 5 workloads. Another reason for choosing dedicated hosting is software licenses bound to the number of sockets or physical cores of the host.
  • Commitment. In AWS, significant cost savings up to 72% can be realized by using Reserved Instances (RI) if you are able to commit to a 1 to 3 year term. Examine workloads - look for steady-state applications (e.g. authentication and authorization services, log collection and analysis, or any web application that is typically "always on") and make use of RIs. For ad-hoc or periodically-scheduled workloads such as batch processes, On-Demand or Spot instances can be used. If the workload can tolerate interruptions, Spot instances offer significant savings as compared to On-Demand instances. Spot instances are also a good choice for Auto Scaling Groups.
  • Egress and Cross Region Charges. AWS charges fees for data egress from their network to the Internet as well as across Regions within their network. There are also charges for Transit Gateway (TGW) and VPC Peering attachments and data transfer. On-premise network connectivity is typically a fixed monthly charge regardless of how much bandwidth is consumed. Variable data transfer charges which are unpredictable makes it difficult to factor in to the cost estimate; however, my experience has shown that egress charges amount to 1 to 3% of the overall cloud spending.
  • Tags. Tags attached to AWS resources are useful for configuration management, cost reporting, and cost efficiency. For example, by assigning a Tag with Key "Environment" and Value "Dev" to development resources, Amazon EventBridge and AWS Lambda can be used to turn those resources off after business hours thus realizing a cost savings up to 50% or more. By defining and activating a Tag for "Cost Center" as a User-Defined Cost Allocation Tag in the AWS Billing and Management console, costs can easily be tracked by customer, project, and/or team.
  • Abandoned Resources. Unused resources such as EC2 instances, Simple Storage Service (S3) buckets, Elastic Load Balancers (ELB), and databases in Amazon Relational Database Service (RDS) and DynamoDB can consume costs even though they are not in use. Use the cost tools mentioned above and audit monthly invoices to look for unused resources for which you are being charged.
  • Storage Tiering. AWS shared storage services such as S3 and Elastic File System (EFS) offer either automatic ("Intelligent Tiering") or user-defined lifecycle rules which migrate data to a cheaper storage class (e.g. "Infrequent Access") based on age (time since last accessed). Amazon FSx for Windows File Server and Amazon FSx for Lustre currently do not have the storage tiering feature.
  • Storage Lifecycle Management. Amazon Elastic Block Storage (EBS) Snaphots can be used to protect boot and data volumes attached to EC2 instances. In many environments I've found that because they are cheap, EBS snapshots are taken too frequently and retained indefinitely which overtime begins to increase cost. Analyze application requirements such as Recovery Point Objective (RPO) and retention policy and implement an EBS Snapshot management policy using  Amazon Data Lifecycle Manager. Amazon Data Lifecycle Manager can be used to control the frequency of EBS snapshots and automatically delete them once they reach a specified age thus meeting the RPO and retention requirements.
  • Serverless. AWS serverless offerings such as AWS Lambda, AWS Fargate, and Amazon API Gateway feature automatic scaling, built-in high availability, and a pay-for-use billing model to increase agility and optimize cost. By running a serverless application, we reduce costs by eliminating the need to provision EC2 instances and reduce operational expense by eliminating the need to patch, secure, backup, and monitor the underlying compute resources. We also reduce our cybersecurity exposure and labor expense in hardening operating system and web services (e.g. IIS, Apache, Nginx, etc.).
Resources: