Few thoughts on Cloud FinOps Book

I just completed “Cloud FinOps” book by J.R. Storment and Mike Fuller, and here are a few thoughts –

  1. At first, I wondered whether I should read the 1st edition, which I had easy access to, or the 2nd, which I had to buy. After reading a sample, I decided to buy the 2nd edition and am glad. This domain and community move quickly; a 2019 version would have been outdated and misleading.
  2. FinOps involves a paradigm shift – developers should consider not only the performance of their architecture (i.e., memory and CPU consumption, speed, etc.) but the cost associated with the resources they will use. Procurement is not done and approved by the finance team anymore. Developers’ decisions can have a significant influence on the cloud bill. FinOps teams bridge the engineering and finance teams (and more) and speak the language of all parties, along with additional skill sets and an overview of the entire organization. 
  3. A general rule of thumb regarding commitments –
    1. Longer commitment period (3 years → 1 year) = lower price (higher discount)
    2. More upfront (full upfront → partial upfront → no upfront )= lower price (higher discount)
    3. More specific (RI → Convertible RI → SP, region, etc.) = lower price (higher discount)
  4. The FinOps team should be up-to-date about the new cloud technologies updates and cost reduction options. I have been familiar with reserve and spot instances for a long time, but there are many other cost reduction options bits and bytes to pay attention to. For example, the following 2 points –
    1. When purchasing saving plans (SP), which are monetary as appose to resource units commitments, the spend amount you commit to is post discount. Moreover, AWS will apply the SP to the resources that yield the highest discount. This implies that the discount rate diminishes when committing to more money.
    2. CloudFront security savings bundle (here) is a saving plan that ties together the usage of CloudFront and WAF. The book predicts that such plans, e.g., combining multiple product usage, will become common soon.
  5. Commitments (e.g., SP, RI) are one of many ways to reduce costs. Removing idle resources (e.g., unattached drives), using correct storage classes (e.g., infrequent access, glacier), or making architecture changes (e.g., rightsizing, moving from servers to serverless, going via VPC endpoints, etc.) can help avoid and reduce cost. Those activities can happen in parallel – centralized FinOps team to manage commitments (aka cost reduction) and decentralized engineering teams optimize the resources they use (aka cost avoidance). Ideally, it is a tango. Each team moves a little step at a time to optimize their part.
  6. The FinOps domain-specific knowledge goes even further. For example, costs that engineers tend to miss or wrongly estimate e.g. network traffic cost, number of events, data storage events.
  7. The inform phase is part of the FinOps lifecycle – making the data available to the relevant participants. The Prius effect, i.e., real-time feedback, instantly influences behavior even without explicit recommendations or guidance. Visualizations (done right) can help understand and react to the data better. A point emphasized multiple times in the book – put the data in the path of the engineers or any other stakeholder. Don’t ask them to log in to a different system to review the data; integrate with existing systems they use regularly.

Few resources I find helpful – 

  1. FinOps foundation website – includes many resources and community knowledge – https://www.finops.org/introduction/what-is-finops/
  1. FinOps podcast – https://www.finops.org/community/finops-podcast/
  2. Infracost lets engineers see a cost breakdown and understand costs before making changes in the terminal, VS Code, or pull requests. https://www.infracost.io/
  3. Cloud Custodian – “Cloud Custodian is a tool that unifies the dozens of tools and scripts most organizations use for managing their public cloud accounts into one open source tool” – https://cloudcustodian.io/
  4. FinOut – A Holistic Cost Management Solution For Your Cloud. I recently participated in a demo and that looks super interesting. https://www.finout.io/
  5. Startup guide to data cost optimization – my post summarizing AWS’s ebook about data cost optimization for startups – https://tomron.net/2023/06/01/startup-guide-to-data-cost-optimization-summary/
  6. Twitter thread I wrote in Hebrew about the book – https://twitter.com/tomron696/status/1657686198327062529

Startup guide to data cost optimization – summary

I read a lot about FinOps and cloud cost optimization those days and I came across AWS short ebook about data cost optimization

Cost optimization is part of AWS’s well-architected framework. When we think about cost optimization, we usually only consider computing resources, while there are significant optimizations that can go beyond that – storage optimization, network, etc.

Below is a combination of the six sections that appear in the e-books with some comments –

Optimize the cost of information infrastructure – the main point in this section is to use Graviton instances where applicable.

Decouple storage data from compute data – 5 suggestions here which are pretty standard –

  1. Compress data when applicable, and use optimal data structures for your task.
  2. Consider data temperature when choosing data store and storage class – use the suitable s3 storage class and manage it using a life-cycle policy.
  3. Use low-cost compute resources, such as Spot Instances, when applicable – I have some dissonance here since I’m not sure that spot instances are attractive those days (see here), specifically with the overhead of taking care of preempted instances. 
  4. Deploy compute close to data to reduce data transfer costs – trivial.
  5. Use Amazon S3 Select and Amazon S3 Glacier Select to reduce data retrieval – Amazon S3 Select has several limitations (see here), so I’m not sure it is worth the effort and better query via Athena.

Plan and provision capacity for predictable workload usage

  1. Choosing the right instance type based on workload pattern and growth  – is common sense. You’ll save a little less if you purchase convertible reserve instances. However, in a fast-changing startup environment, there is a higher chance the commitment won’t be underutilized.
  2. Deploying rightsizing based on average or medium workload usage – this contradicts best practices described in Cloud FinOps book, so I’m a bit hesitant here.
  3. Using automatic scaling capabilities to meet peak demand – is the most relevant advice in this section. Use auto-scaling groups or similar to accommodate for both performance and cost.

Access capacity on demand for unpredictable workloads

  1. Use Amazon Athena for ad hoc SQL workloads – as mentioned above, I prefer Athena over AWS S3 Select.
  2. Use AWS Glue instead of Amazon EMR for infrequent ETL jobs – I don’t have a strong opinion here, but if you have a data strategy in mind, I will try to adjust to it. Additionally, I feel that other AWS can be even easier and cost-effective to work with—for example,  Apache Spark in Amazon Athena, step functions, etc.
  3. Use on-demand resources for transient workloads or short-term development and testing needs – having said that, you should still keep an eye on your production services, ensure they are utilized correctly and rightsize them if needed.

Avoid data duplication with a centralized storage layer

Implement a central storage layer to share data among tenants – I would shorten it to saying, “have a data strategy” – where you are, where you want to go, etc., which is not trivial in early startup days.

Leverage up to $100,000 in AWS Activate credits

This might be a bit contracting to the rest of the document since it feels like free money and delays your concern about cloud costs.

5 interesting things (16/05/2023)

Women’s health research lacks funding – these charts show how – not a proper tech link but – I liked the infographic very much (it missed some hovering features) and believe this is an important topic.

https://www.nature.com/immersive/d41586-023-01475-2/index.html

Farewell to the Era of Cheap EC2 Spot Instances – spot instances were the holy grail of cloud cost reduction and required a suitable architecture to accommodate it. While the cloud vendors suggest more and more ways to reduce cost, this well seems to dry out and it is backed with data about 5.5 million spot instances they spun over almost seven months. I don’t know if it is the end of spot instances, but something goes on.

https://pauley.me/post/2023/spot-price-trends/

Uptime Guarantees — A Pragmatic Perspective – great down-to-earth analysis of uptime and the meaning of each additional nine –

https://world.hey.com/itzy/uptime-guarantees-a-pragmatic-perspective-736d7ea4

Evidence – Business Intelligence as Code – this project intrigued me. Developers often struggle with creating visualizations, the UI of most of the tools is confusing and complex for sporadic use, maybe evidence will unleash it for developers –

https://github.com/evidence-dev/evidence

How to Debug – “The missing Semester of your CS education” (here) influenced how I think of juniors and recently graduated employees. Debugging is a skill you usually don’t learn during formal studies and is essential in the industry. This post is a good starting point in the journey of debugging – 

https://philbooth.me/blog/how-to-debug

5 interesting things (25/04/2023)

Load balancing – excellent explanations and visualizations about load balancing and different approaches. I wish for follow-up posts about caching and stickiness that influence performance and practical setups – how to set loaded balancers in AWS under those considerations.

https://samwho.dev/load-balancing/

visitdata  – A terminal interface for exploring and arranging tabular data. I played with this tool a bit, it is very promising and, at the same time, has a stiff learning curve (think vi) that might keep people away.

https://www.visidata.org/

Software accessibility for users with Attention Deficit Disorder (ADHD) – software accessibility is a topic that I always try to keep in mind. The usual software accessibility patterns refer to visual impairment, e.g., color contrast, font size, etc. This post tackles the accessibility topic from the prism users with ADHD, and I find it groundbreaking. I find that the suggested patterns (e.g., recently opened subscription reminders, etc.) are primarily suitable UX for all users, not just those with ADHD.

https://uxdesign.cc/software-accessibility-for-users-with-attention-deficit-disorder-adhd-f32226e6037c

Minimum Viable Process – I liked the post very much and the following point was the one I relate the most to – Minimum Viable Process process is iterative – processes and procedures must be constantly refined. Processes should evolve along with the company and serve the company rather then the company serve the process.

https://mollyg.substack.com/p/minimum-viable-process

Interactive Calendar Heatmaps with Python — The Easiest Way You’ll Find – always wanted to create a GitHub-like activity visualization? Great, use plotly-calplot for that. See the example here – 

https://python.plainenglish.io/interactive-calendar-heatmaps-with-plotly-the-easieast-way-youll-find-5fc322125db7

Thoughts on Tech Debt

I have been thinking about tech debt for a while now and how to address it daily. Few dilemmas for example – 

  • Should we update our Python version or a version of one of the packages we use?
  • We thought of a more efficient implementation for one of our functions. Should we invest time in it? When?
  • What task should we prioritize for the next sprint? 

How do we measure our technical debt? Are we getting better, and what does better means? Or at least not worse?

So I have compiled a small reading list that I can share with my team and align on terminology and ideas. Feel free to add your thoughts.

The 4 main types of technical debt – this is an explanation of Martin Fowler’s Technical Debt Quadrant. This is a first step towards establishing a common language about technical debt.

https://blog.codacy.com/4-types-technical-debt/

The different types of technical debt – Another piece in the puzzle to use a common terminology. While the split into categories makes sense, I don’t entirely agree with the fixes and impacts.

https://techdebtguide.com/types-of-technical-debt

The 25 Percent Rule for Tackling Technical Debt – a post from Shopify engineering blog about their pillars of tech debt and a recommendation on how to invest time on those topics. That made me think about whether the discussion about tech debt should differ between different stages or sizes of companies

https://shopify.engineering/technical-debt-25-percent-rule

A Framework for Prioritizing Tech Debt – this post tries to get the analogy closer to real debt and examine the interest rate. If I took this analogy one step further, I would say that it is a loan, and the prioritization and the remediations are the terms that we choose on loan, and sometimes those terms change either because we recycle the loan, win a big sum and can pay back, etc. Or because something external changed, e.g., interest rate.

https://www.maxcountryman.com/articles/a-framework-for-prioritizing-tech-debt

Tech Debt Isn’t a Burden, It’s a Strategic Lever for Success – the approach here is closer to the loan analogy – “view tech debt as a strategic lever for your organization’s success over time”. That, together with other points in this post, made me think about the relations and interactions of product debt and tech debt – are they correlated or independent, and what influences them? are there any patterns in this duo or maybe trio (with business debt). I didn’t yet find something to read about the topic that I was happy with.

https://www.reforge.com/blog/managing-tech-debt

7 Top Metrics for Measuring Your Technical Debt – this post suggests several metrics to measure technical debt, such as code churn, cycle time, code quality, and tools for measuring technical debt. One issue is that it does not cover infrastructure debt or process debt. For example – the complexity of deploy process, duplication between repositories, etc. Additionally, the underlying assumption in this post is that “tech debt is bad,” while I view it as a strategy or a trade-off. I also don’t believe that one size fits all – if you want to measure, choose the one thing that is most important and informative for you, and don’t worry if it changes over time.

https://dev.to/alexomeyer/8-top-metrics-for-measuring-your-technical-debt-5bnm

Exploratory Data Analysis Course – Draft

Last week I gave an extended version of my talk about box plots in Noa Cohen‘s Introduction to Data Science class at Azrieli College of Engineering Jerusalem. Slides can be found here.

The students are 3rd and 4th-year students, and some will become data scientists and analysts. Their questions and comments and my experience with junior data analysts made me understand that a big gap they have in purchasing those positions and performing well is doing EDA – exploratory data analysis. This reminded me of the missing semester of your CS education – skills that are needed and sometimes perceived as common knowledge in the industry but are not taught or talked about in academia. 

“Exploratory Data Analysis (EDA) is the crucial process of using summary statistics and graphical representations to perform preliminary investigations on data in order to uncover patterns, detect anomalies, test hypotheses, and verify assumptions.” (see more here). EDA plays an important role in everyday life of anyone working with data – data scientists, analysts, and data engineers. It is often also relevant for managers and developers to solve the issues they face better and more efficiently and to communicate their work and findings.

I started rolling in my head how would a EDA course would look like –

Module 1 – Back to basics (3 weeks)

  1. Data types of variables, types of data
  2. Basic statistics and probability, correlation
  3. Anscombe’s quartet
  4. Hands on lab – Python basics (pandas, numpy, etc.)

Module 2 – Data visualization (3 weeks)

  1. Basic data visualizations and when to use them – pie chart, bar charts, etc.
  2. Theory of graphical representation (e.g Grammar of graphics or something more up-to-date about human perception)
  3. Beautiful lies – graphical caveats (e.g. box plot)
  4. Hands-on lab – python data visualization packages (matplotlib, plotly, etc.).

Module 3 – Working with non-tabular data (4 weeks)

  1. Data exploration on textual data
  2. Time series – anomaly detection
  3. Data exploration on images

Module 4 – Missing data (2 weeks)

  1. Missing data patterns
  2. Imputations
  • Hands-on lab – a combination of missing data \ non-tabular data

Extras if time allows-

  1. Working with unbalanced data
  2. Algorithmic fairness and biases
  3. Data exploration on graph data

I’m very open to exploring and discussing this topic more. Feel free to reach out – twitterLinkedIn

5 interesting things (03/11/2022)

How to communicate effectively as a developer. – writing effectively is the second most important skill after reading effectively and one of the skills that can differentiate you and push you forward. If you read only one thing today, read this – 

https://www.karlsutt.com/articles/communicating-effectively-as-a-developer/

26 AWS Security Best Practices to Adopt in Production – this is a periodic reminder to pay attention to our SecOps. This post is very well written and the initial table of AWS security best practices by service is great. 

https://sysdig.com/blog/26-aws-security-best-practices/

EVA Video Analytics System – “EVA is a new database system tailored for video analytics — think MySQL for videos.”. Looks cool on first glance and I can think off use cases for myself, yet I wonder if it could become a production-level grade.

https://github.com/georgia-tech-db/eva

I see it as somehow complementary to – https://github.com/impira/docquery

Forestplot – “This package makes publication-ready forest plots easy to make out-of-the-box.”. I like it when academia and technology meet and this is really usable, also for data scientists’ day-to-day work. The next step would probably be deep integration with scikit-learn to pandas.

https://github.com/lsys/forestplot

Bonus – Python DataViz cookbook – easy way to navigate between the different common python visualization practices (i.e via pandas vs using matplotlib / plotly /  seaborn). I would like to see it going to the next step – controlling the colors, grid, etc. from the UI and then switching between the frameworks but that’s a starting point.

https://dataviz.dylancastillo.co/

roadmap.sh – it is not always clear how to level up your skills, what you should learn next (best practices, technology – which, etc). Roadmap.sh attempts to create such roadmaps. While I don’t agree with everything there, I think that the format and references are nice and it is a good inspiration.

https://roadmap.sh/

Shameless plug – Growing A Python Developer (2021), I plan to write a small update in the near future.

Think outside of the Box Plot

Earlier today, I spoke at DataTLV conference about box plots – what they expose, what they hide, and how they mislead. My slides can be found here, and the code used to generate the plots is here

Key takeaways

  • Boxplots show 5 number statistics – min, max, median, q1 and,q3.
  • The flaws of Box Plots can be divided into two – data that is not present in the visualization (e.g. number of samples, distribution) and the visualization being counter-intuitive (e.g. quartiles is hard to grasp the concept).
  • I choose solutions that are easy to implement, either by leveraging existing packages code or by adding small tweaks. I used plotly.
  • Aside of those adjustment I many times box plot is just not the right graph for the job.
  • If the statistical literacy of your audience is not well founded I would try avoiding using box plot.

Topics I didn’t talk about and worth mentioning

  • Mary Eleanor Hunt Spear –  data visualization specialize who pioneered the development of the bar chart and box plot. I had a slide about her but went too fast, and skipped it. See here.
  • How percentiles are calculated – Several methods exist, and different Python packages use different default methods. Read more –http://jse.amstat.org/v14n3/langford.html

Resources I used to prepare the talk

5 interesting things (29/08/2022)

Human genetics 101 – a new blog about genetics by Nadav Brandes, who works at UCSF as part of the Ye lab. Reading is very accessible even to non-biologist (like me :).

https://incrementally.net/2022/07/16/human-genetics-101/

It’s probably time to stop recommending Clean Code – that’s a relatively old post (from 2020) discussing a book that was published in 2008. It is a very common recommendation in the industry, and therefore, I think this post is so important. It is detailed and gives good examples, and reminds us that everything has to be taken with a grain of salt. I agree with the concluding paragraphs – experienced developers will gain almost nothing from reading the book, and inexperienced developers would have a hard time separating the wheat from the chaff.

https://qntm.org/clean

Bonus – https://gordonc.bearblog.dev/dry-most-over-rated-programming-principle/

The many flavors of hashing – I like to be back to basic from time to time.

https://notes.volution.ro/v1/2022/07/notes/1290a79c/

Five Lessons Learned From Non-Profit Management That Apply to Tech Management – I like those mixes when practices and ideas from one domain of someone’s life emerge in another domain. Those intersections are usually very productive and insightful.

https://medium.com/management-matters/5-lessons-learned-from-non-profit-management-that-apply-to-tech-management-add47980498a

Demystifying the Parquet File Format – I finally feel I understand how the parquet format works (although there are many more optimizations).

https://towardsdatascience.com/demystifying-the-parquet-file-format-13adb0206705

5 interesting things (22/07/2022)


I analyzed 1835 hospital price lists so you didn’t have to
 – this post had a few interesting things. First, learning about CMS’s price transparency law. In Israel this is a non-issue since the healthcare system works differently, and most of the procedures are covered by the HMOs so there is no such concern. I would be interested in further analysis about the missing or non-missing prices. I.e., for which CPT codes most hospitals have prices, for which CPT codes most hospitals don’t have prices, can we cluster them (e.g. cardio codes? women’s health? procedures usually done on elder people?). This dataset has great potential, and I agree with most of the points in the “Dead On Arrival: What the CMS law got wrong” section.

https://www.dolthub.com/blog/2022-07-01-hospitals-compliance/

How to design better APIs – there are several things I liked in this post – first, it is written very clearly and gives both positive and negative examples. Second, it is language agnostic. That last tip – “Allow expanding resources” was mind-blowing to me, so simple to think of and I never thought of adding such an argument. Now I miss a cookie-cutter template to implement all that good advice.

https://r.bluethl.net/how-to-design-better-apis

min(DALL-E) – “This is a fast, minimal port of Boris Dayma’s DALL·E Mega. It has been stripped down for inference and converted to PyTorch. The only third-party dependencies are NumPy, requests, pillow, and torch”. Now you can easily generate images using min-dalle on your machine (but it might take a while),

https://github.com/kuprel/min-dalle

Bonus – https://openai.com/blog/dall-e-2-pre-training-mitigations/

4 Things I Learned From Analyzing Menopause Apps Reviews – Dalya Gartzman, She Knows Health CEO, writes about 4 lessons she learned from analyzing Menopause Apps Reviews. I think it is interesting in 2 ways – app reviews are first, as a product-market fit strategy, to see what users are telling, asking, or complaining about in related.

https://medium.com/sheknows-health/4-things-i-learned-from-analyzing-menopause-apps-reviews-2cabf9ca9226

Inconsistent thoughts on database consistency – this post discusses the many aspects and definitions of consistency and how it is used in different contexts. I absolutely love those topics. Having said that, I wonder if people hold those discussions in real life and not just use common cloud-managed solutions encapsulating some of those concerns.

https://www.alexdebrie.com/posts/database-consistency/