Startup guide to data cost optimization – summary

I read a lot about FinOps and cloud cost optimization those days and I came across AWS short ebook about data cost optimization

Cost optimization is part of AWS’s well-architected framework. When we think about cost optimization, we usually only consider computing resources, while there are significant optimizations that can go beyond that – storage optimization, network, etc.

Below is a combination of the six sections that appear in the e-books with some comments –

Optimize the cost of information infrastructure – the main point in this section is to use Graviton instances where applicable.

Decouple storage data from compute data – 5 suggestions here which are pretty standard –

  1. Compress data when applicable, and use optimal data structures for your task.
  2. Consider data temperature when choosing data store and storage class – use the suitable s3 storage class and manage it using a life-cycle policy.
  3. Use low-cost compute resources, such as Spot Instances, when applicable – I have some dissonance here since I’m not sure that spot instances are attractive those days (see here), specifically with the overhead of taking care of preempted instances. 
  4. Deploy compute close to data to reduce data transfer costs – trivial.
  5. Use Amazon S3 Select and Amazon S3 Glacier Select to reduce data retrieval – Amazon S3 Select has several limitations (see here), so I’m not sure it is worth the effort and better query via Athena.

Plan and provision capacity for predictable workload usage

  1. Choosing the right instance type based on workload pattern and growth  – is common sense. You’ll save a little less if you purchase convertible reserve instances. However, in a fast-changing startup environment, there is a higher chance the commitment won’t be underutilized.
  2. Deploying rightsizing based on average or medium workload usage – this contradicts best practices described in Cloud FinOps book, so I’m a bit hesitant here.
  3. Using automatic scaling capabilities to meet peak demand – is the most relevant advice in this section. Use auto-scaling groups or similar to accommodate for both performance and cost.

Access capacity on demand for unpredictable workloads

  1. Use Amazon Athena for ad hoc SQL workloads – as mentioned above, I prefer Athena over AWS S3 Select.
  2. Use AWS Glue instead of Amazon EMR for infrequent ETL jobs – I don’t have a strong opinion here, but if you have a data strategy in mind, I will try to adjust to it. Additionally, I feel that other AWS can be even easier and cost-effective to work with—for example,  Apache Spark in Amazon Athena, step functions, etc.
  3. Use on-demand resources for transient workloads or short-term development and testing needs – having said that, you should still keep an eye on your production services, ensure they are utilized correctly and rightsize them if needed.

Avoid data duplication with a centralized storage layer

Implement a central storage layer to share data among tenants – I would shorten it to saying, “have a data strategy” – where you are, where you want to go, etc., which is not trivial in early startup days.

Leverage up to $100,000 in AWS Activate credits

This might be a bit contracting to the rest of the document since it feels like free money and delays your concern about cloud costs.

One thought on “Startup guide to data cost optimization – summary

Leave a comment