2024)

(Almost) Every infrastructure decision I endorse or regret after 4 years running infrastructure at a startup – in my current role as a CTO of an early-stage startup, I make many choices about tools, programming languages, architecture, vendors, etc. This retrospective view was fascinating not only for the tools themselves but also for the arguments.

https://cep.dev/posts/every-infrastructure-decision-i-endorse-or-regret-after-4-years-running-infrastructure-at-a-startup/

Everything You Can Do with Python’s textwrap Module – I have used Python for more than 10 years and never heard of textwrap model. Maybe you, too, haven’t heard of it.

https://towardsdatascience.com/everything-you-can-do-with-pythons-textwrap-module-0d82c377a4c8

It was never about LLM performance – I couldn’t agree more. The performance gaps between different LLMs are becoming neglectable. Now, it is about the experience you build using those models and the guardrails you put in to ensure the experience.

https://read.technically.dev/p/it-was-never-about-llm-performance

How to build an enterprise LLM application: Lessons from GitHub Copilot – the post ends with a summary of 3 key takeaways –

Identify a focused problem and thoughtfully discern an AI’s use cases.
Integrate experimentation and tight feedback loops into the design process
As you scale, continue to leverage user feedback and prioritize user needs

Those takeaways are general and correct for almost every product launch I can think of. The post provides more concrete tips for LLM applications. It is interesting to read about a product on such a scale that I use it on a daily basis.

https://github.blog/2023-09-06-how-to-build-an-enterprise-llm-application-lessons-from-github-copilot/

Speaking for Hackers – public speaking is hard. From choosing a topic, submitting a CFP, preparing your talk and slides, and wrapping it all up. Every step can be tricky, and each of us has other things that are harder for us. This site provides excellent materials for all the parts before, during, and after the talk, making it easier to step out of our shells and share the knowledge.

https://sfhbook.netlify.app/

5 interesting things (27/07/2023)

Designing Age-Inclusive Products: Guidelines And Best Practices – I have a 91-year-old grandmother who, in the last 10 years, cannot book a doctor’s appointment herself as she does not use a smartphone and cannot follow voice navigation. Even without a personal perspective, I am very interested in accessibility, and I try to pay attention to inclusivity and accessibility topics wherever relevant. However, I always wonder if those are general best practices or are limited to specific cohorts. Specifically, in this case, younger people usually have more technology literacy than older people and therefore can achieve their goals with less optimized flows and UI.

https://www.smashingmagazine.com/2023/07/designing-age-inclusive-products-guidelines-best-practices/

On Becoming VP of Engineering – A two-part blog post series by Emily Nakashima, Honeycomb’s first VP of Engineering. The first part focuses on her path – coming originally from design, frontend, and product engineering and becoming VP of Engineering that also manages the backend and infrastructure.

The second part talks about the day-to-day work and the shift in focus when moving from a director position to a VP position. I strongly agree with her saying, “Alignment is your most important deliverable,” and also think it is one of the hardest things to achieve.

https://www.honeycomb.io/blog/becoming-vp-of-engineering-pt1

https://www.honeycomb.io/blog/becoming-vp-of-engineering-pt2

Project Management for Software Engineers – “This article is a collection of techniques I’ve learned for managing projects over time, that attempts to combine agile best practices with project management best practices.”. While a degree in computer science teaches lots of algorithms, software development, and so on, it does not teach project management and time management. Those skills are usually not required in junior positions but can help you have a more significant impact. Having said that, one should find the exact practices that fit him or her and that can evolve over time.

https://sookocheff.com/post/engineering-management/project-management-for-software-engineers/

Designing Pythonic library APIs – A while ago (2 years+-), I looked for a post/tutorial / etc. regarding designing SDK best practices and could not find something I was happy with. I like the examples (both good and bad examples) in this post. If you are in a hurry, all the take aways are summarized in the end (but sometimes hard to understand without context).

https://benhoyt.com/writings/python-api-design/

Fern – “Fern is an open source toolkit for designing, building, and consuming REST APIs. With Fern, you can generate client libraries, API documentation, and boilerplate for your backend server.”. I haven’t tried it myself yet, but if it works, it seems like cookie-cutter on steroids. In the era of LLMs, the next step is to generate all of those from free text.

https://github.com/fern-api/fern

5 interesting things (06/07/2023)

Potential impacts of Large Language Models on Engineering Management – this post is an essential starter for a discussion, and I can think of other impacts. For example – how interviewing \ assessing skills of new team members affected by LLMs? What skills should be evaluated those days (focusing on engineering positions)?

One general caveat for using LLMs is completely trusting them without any doubts. This is crucial for a performance review. Compared to code, if the code does not work, it is easy to trace and fix. If the performance review needs to be corrected, it might be hard to pinpoint what and where it got wrong, and the person getting it might need more confidence to say something.

https://www.engstuff.dev/p/potential-impacts-of-large-language

FastAPI best practices – one of the most reasoned and detailed guides I read. Also, the issues serve as comments to this guide and are worth reading. Ideally, I would like to take most of the ideas and turn them into a cookie-cutter project that is easy to create.

https://github.com/zhanymkanov/fastapi-best-practices

How Product Strategy Fails in the Real World — What to Avoid When Building Highly-Technical Products – I saw all in action and hope to do better in the future.

https://review.firstround.com/how-product-strategy-fails-in-the-real-world-what-to-avoid-when-building-highly-technical-products

1 dataset 100 visualizations – I imagine this project as an assignment in a data visualization/data journalism course. Yes, there are many ways to display data. Are they all good? Do they convey the desired message?

There is a risk in being too creative, and there is some visualization there I cannot imagine using for anything reasonable.

https://100.datavizproject.com/

Automating Python code quality – one additional advantage of using tools like Black, isort, etc., is that it reduces the cognitive load when doing a code review. The code reviewer should no longer check for style issues and can focus on deeper issues.

https://blog.fidelramos.net/software/python-code-quality

Bonus – more extensive pre-commit template –

https://github.com/br3ndonland/template-python/blob/main/.pre-commit-config.yaml

Did you Miss me? PyCon IL 2023

Today I talked about working with missing data at PyCon IL. We started with a bit of theory about mechanisms of missing data –

MCAR – The fact that the data are missing is independent of the observed and unobserved data.
MAR – The fact that the data are missing is systematically related to the observed but not the unobserved data.
MNAR – The fact that the data are missing is systematically related to the unobserved data.

And deep-dived into an almost real-world example that utilizes the Python ecosystem – pandas, scikit-learn, and missingno.

My slides are available here and my code is here.

3 related posts I wrote about working with missing data in Python –

Pandas fillna vs scikit-learn SimpleImputer

Missing data is prevalent in real-world data and can be missing for various reasons. Gladly, both pandas and scikit-learn several imputation tools to deal with it. Pandas offers a basic yet powerful interface for univariate imputations using fillna and more advanced functionality using interpolate. scikit-learn offers both SimpleImputer for univariate imputations and KNNImputer and IterativeImputer for multivariate imputations. In this post, we will focus on fillna and SimpleImputer functionality and compare them.

Basic Functionality

SimpleImputer offers four strategies to fill in the nan values – mean, median, most_frequet, and constant.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
imp_mean = SimpleImputer(strategy='mean')
pd.DataFrame(imp_mean.fit_transform(df))

output –

      0    1    2
0   7.0  2.0  7.5
1   4.0  3.5  6.0
2  10.0  5.0  9.0

Can we achieve the same with pandas? Yes!

df.fillna(df.mean())

Want to impute with the most frequent value?

Asuume – df = pd.DataFrame(['a', 'a', 'b', np.nan])

With SimpleImputer –

imp_mode = SimpleImputer(
    strategy='most_frequent')
pd.DataFrame(
  
  imp_mode.fit_transform(df))

With fillna –

df.fillna(df.mode()[0])

And the output of both –

Different Strategies

Want to apply different strategies for different columns? using scikit-learn you will need several imputers, one per each strategy. Using fillna you can pass a dictionary, for example –

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
df.fillna({1: 10000, 2: df[2].mean()})

    0        1    2
0   7      2.0  7.5
1   4  10000.0  6.0
2  10      5.0  9.0

Advanced Usage

Want to impute values drawn from a normal distribution, no brainer –

mean = 5
scale = 2
df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
df.fillna(
    pd.DataFrame(
        (np.random.normal(mean, scale, df.shape))

    0         1         2
0   7  2.000000  3.857513
1   4  5.407452  6.000000
2  10  5.000000  9.000000

Missing indicator

Using SimpleImputer, one can add indicator columns that obtain 1 if the original column was missing, and 0 otherwise. This can also be done using MissingIndicator

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]])
mean_imp = SimpleImputer(strategy='mean', add_indicator=True)
mean_imp.fit_transform(df)
pd.DataFrame(mean_imp.fit_transform(df))

      0    1    2    3    4
0   7.0  2.0  7.5  0.0  1.0
1   4.0  3.5  6.0  1.0  0.0
2  10.0  5.0  9.0  0.0  0.0

Note that a missing column (i.e., columns 3 and 4 in the example above) corresponds only to columns with missing values. Therefore there is no missing indicator column corresponding to the column 0. If you are converting back and forth to pandas dataframes you should note this nuance.

Another nuance to note when working with SimpleImputer is that columns that contain only missing values are dropped by default –

df =  pd.DataFrame(
    [[7, 2, np.nan, np.nan], [4, np.nan, 6, np.nan],
    [10, 5, 9, np.nan]])
mean_imp = SimpleImputer(strategy='mean')
pd.DataFrame(mean_imp.fit_transform(df))

      0    1    2
0   7.0  2.0  7.5
1   4.0  3.5  6.0
2  10.0  5.0  9.0

This behavior is controllable using setting keep_empty_features=True. While it is manageable, tracing columns might be challenging –

mean_imp = SimpleImputer(
    strategy='mean',
    keep_empty_features=True,
    add_indicator=True)
pd.DataFrame(mean_imp.fit_transform(df))

      0    1    2    3    4    5    6
0   7.0  2.0  7.5  0.0  0.0  1.0  1.0
1   4.0  3.5  6.0  0.0  1.0  0.0  1.0
2  10.0  5.0  9.0  0.0  0.0  0.0  1.0

There is an elegant way to achieve similar behavior in pandas –

df = pd.DataFrame(
    [[7, 2, np.nan, np.nan], [4, np.nan, 6, np.nan],
     [10, 5, 9, np.nan]])
pd.concat(
    [df.fillna(df.mean()), 
     df.isnull().astype(int).add_suffix("_ind")], axis=1)

    0    1    2   3  0_ind  1_ind  2_ind  3_ind
0   7  2.0  7.5 NaN      0      0      1      1
1   4  3.5  6.0 NaN      0      1      0      1
2  10  5.0  9.0 NaN      0      0      0      1

Working with dates

Want to work with dates and fill several columns with different types? No problem with pandas –

df = pd.DataFrame(
    {"date": [
        datetime(2023, 6, 20), np.nan,
        datetime(2023, 6, 18), datetime(2023, 6, 16)],
     "values": [np.nan, 1, 3, np.nan]})
df.fillna(df.mean())

Before –

        date  values
0 2023-06-20     NaN
1        NaT     1.0
2 2023-06-18     3.0
3 2023-06-16     NaN

After –

        date  values
0 2023-06-20     2.0
1 2023-06-18     1.0
2 2023-06-18     3.0
3 2023-06-16     2.0

Working with dates is an advantage that fillna has over SimpleImputer.

Backward and forward filling

So far, we treated the records and their order as independent. That is, we could have shuffled the records and that would not affect the expected imputed value. However, there are cases, for example, when representing time series when the order matters and we would like to impute based on later values (backfill) or earlier values (forward fill). This is done by setting the method property.

df = pd.DataFrame(
    [[7, 2, np.nan], [4, np.nan, 6],
     [10, np.nan, 9], [np.nan, 5, 10]])
df.fillna(method='bfill')

      0    1     2
0   7.0  2.0   6.0
1   4.0  5.0   6.0
2  10.0  5.0   9.0
3   NaN  5.0  10.0

One can also limit the number of consecutive values which are imputed –

df.fillna(method='bfill', limit=1)

      0    1     2
0   7.0  2.0   6.0
1   4.0  NaN   6.0
2  10.0  5.0   9.0
3   NaN  5.0  10.0

Note that when using bfill or ffill and moreover, when specifying limit to value other than None it is possible that not all the values would be imputed.

For me, that’s a killer feature of fillna comparing to SimpleImputer

Treat Infinite values as na

Setting pd.options.mode.use_inf_as_na = True will treat infinite values (i.e. np.inf, np.INF, np.NINF) values as missing values, for example –

df = pd.DataFrame([1, 2, np.inf, np.nan])
df.fillna(1000)

pd.options.mode.use_inf_as_na = False

pd.options.mode.use_inf_as_na = True

Note that inf and na are not treated the same for other use cases, e.g. – df[0].value_counts(dropna=False)–

0
1.0    1
2.0    1
NaN    1
NaN    1

Summary

Both pandas and scikit-learn offer a basic functionality to deal with missing values. Assuming you are working with pandas Dataframe, pandas fillna functionality can achieve everything SimpleImputer can do and more – working with dates, back and forward fill, etc. Additionally, there are some edge cases and specific behaviors to pay attention to when choosing what to use. For example when using bfill or ffill method some values may not be imputed if there are the last ones or first ones respectively.

5 interesting things (25/04/2023)

Load balancing – excellent explanations and visualizations about load balancing and different approaches. I wish for follow-up posts about caching and stickiness that influence performance and practical setups – how to set loaded balancers in AWS under those considerations.

https://samwho.dev/load-balancing/

visitdata – A terminal interface for exploring and arranging tabular data. I played with this tool a bit, it is very promising and, at the same time, has a stiff learning curve (think vi) that might keep people away.

https://www.visidata.org/

Software accessibility for users with Attention Deficit Disorder (ADHD) – software accessibility is a topic that I always try to keep in mind. The usual software accessibility patterns refer to visual impairment, e.g., color contrast, font size, etc. This post tackles the accessibility topic from the prism users with ADHD, and I find it groundbreaking. I find that the suggested patterns (e.g., recently opened subscription reminders, etc.) are primarily suitable UX for all users, not just those with ADHD.

https://uxdesign.cc/software-accessibility-for-users-with-attention-deficit-disorder-adhd-f32226e6037c

Minimum Viable Process – I liked the post very much and the following point was the one I relate the most to – Minimum Viable Process process is iterative – processes and procedures must be constantly refined. Processes should evolve along with the company and serve the company rather then the company serve the process.

https://mollyg.substack.com/p/minimum-viable-process

Interactive Calendar Heatmaps with Python — The Easiest Way You’ll Find – always wanted to create a GitHub-like activity visualization? Great, use plotly-calplot for that. See the example here –

https://python.plainenglish.io/interactive-calendar-heatmaps-with-plotly-the-easieast-way-youll-find-5fc322125db7

Exploratory Data Analysis Course – Draft

Last week I gave an extended version of my talk about box plots in Noa Cohen‘s Introduction to Data Science class at Azrieli College of Engineering Jerusalem. Slides can be found here.

The students are 3rd and 4th-year students, and some will become data scientists and analysts. Their questions and comments and my experience with junior data analysts made me understand that a big gap they have in purchasing those positions and performing well is doing EDA – exploratory data analysis. This reminded me of the missing semester of your CS education – skills that are needed and sometimes perceived as common knowledge in the industry but are not taught or talked about in academia.

“Exploratory Data Analysis (EDA) is the crucial process of using summary statistics and graphical representations to perform preliminary investigations on data in order to uncover patterns, detect anomalies, test hypotheses, and verify assumptions.” (see more here). EDA plays an important role in everyday life of anyone working with data – data scientists, analysts, and data engineers. It is often also relevant for managers and developers to solve the issues they face better and more efficiently and to communicate their work and findings.

I started rolling in my head how would a EDA course would look like –

Module 1 – Back to basics (3 weeks)

Data types of variables, types of data
Basic statistics and probability, correlation
Anscombe’s quartet
Hands on lab – Python basics (pandas, numpy, etc.)

Module 2 – Data visualization (3 weeks)

Basic data visualizations and when to use them – pie chart, bar charts, etc.
Theory of graphical representation (e.g Grammar of graphics or something more up-to-date about human perception)
Beautiful lies – graphical caveats (e.g. box plot)
Hands-on lab – python data visualization packages (matplotlib, plotly, etc.).

Module 3 – Working with non-tabular data (4 weeks)

Data exploration on textual data
Time series – anomaly detection
Data exploration on images

Module 4 – Missing data (2 weeks)

Missing data patterns
Imputations

Hands-on lab – a combination of missing data \ non-tabular data

Extras if time allows-

Working with unbalanced data
Algorithmic fairness and biases
Data exploration on graph data

I’m very open to exploring and discussing this topic more. Feel free to reach out – twitter, LinkedIn.

5 interesting things (03/11/2022)

How to communicate effectively as a developer. – writing effectively is the second most important skill after reading effectively and one of the skills that can differentiate you and push you forward. If you read only one thing today, read this –

https://www.karlsutt.com/articles/communicating-effectively-as-a-developer/

26 AWS Security Best Practices to Adopt in Production – this is a periodic reminder to pay attention to our SecOps. This post is very well written and the initial table of AWS security best practices by service is great.

https://sysdig.com/blog/26-aws-security-best-practices/

EVA Video Analytics System – “EVA is a new database system tailored for video analytics — think MySQL for videos.”. Looks cool on first glance and I can think off use cases for myself, yet I wonder if it could become a production-level grade.

https://github.com/georgia-tech-db/eva

I see it as somehow complementary to – https://github.com/impira/docquery

Forestplot – “This package makes publication-ready forest plots easy to make out-of-the-box.”. I like it when academia and technology meet and this is really usable, also for data scientists’ day-to-day work. The next step would probably be deep integration with scikit-learn to pandas.

https://github.com/lsys/forestplot

Bonus – Python DataViz cookbook – easy way to navigate between the different common python visualization practices (i.e via pandas vs using matplotlib / plotly / seaborn). I would like to see it going to the next step – controlling the colors, grid, etc. from the UI and then switching between the frameworks but that’s a starting point.

https://dataviz.dylancastillo.co/

roadmap.sh – it is not always clear how to level up your skills, what you should learn next (best practices, technology – which, etc). Roadmap.sh attempts to create such roadmaps. While I don’t agree with everything there, I think that the format and references are nice and it is a good inspiration.

https://roadmap.sh/

Shameless plug – Growing A Python Developer (2021), I plan to write a small update in the near future.

Think outside of the Box Plot

Earlier today, I spoke at DataTLV conference about box plots – what they expose, what they hide, and how they mislead. My slides can be found here, and the code used to generate the plots is here.

Key takeaways

Boxplots show 5 number statistics – min, max, median, q1 and,q3.
The flaws of Box Plots can be divided into two – data that is not present in the visualization (e.g. number of samples, distribution) and the visualization being counter-intuitive (e.g. quartiles is hard to grasp the concept).
I choose solutions that are easy to implement, either by leveraging existing packages code or by adding small tweaks. I used plotly.
Aside of those adjustment I many times box plot is just not the right graph for the job.
If the statistical literacy of your audience is not well founded I would try avoiding using box plot.

Topics I didn’t talk about and worth mentioning

Mary Eleanor Hunt Spear – data visualization specialize who pioneered the development of the bar chart and box plot. I had a slide about her but went too fast, and skipped it. See here.
How percentiles are calculated – Several methods exist, and different Python packages use different default methods. Read more –http://jse.amstat.org/v14n3/langford.html

Resources I used to prepare the talk

	Nicole S on 5 Python NLP pacakges
	blissful4bdd2399fa on CSV to radar plot
	tom on CSV to radar plot
	Matt on CSV to radar plot
	“ – Tom… on 📚 Book club Q1 2024 – 3…