5 interesting things (06/05/2021)

How Hashicorp works – Hashicorp develops open-source products that are widely used in the industry including Terraform, Vault, Consul, etc. “How HashiCorp Works” provides a glimpse of Hashicorp’s culture and practices. I appreciate this kind of transparency and chance to learn. 

https://works.hashicorp.com/

Make boring plans – a more accurate title would be “make predictable plans”. That is, the next tasks should be predictable based on the team’s knowledge regarding the product pains, bug, customers’ requests, etc.A possible good way to measure how boring the plans are is to ask the team to prioritize the top-k tasks we should work on in the next period (quarter \ sprint, etc.) and check if the tasks overlap. Disclaimer – each team member has its’ own view, pain points, and features they would like to develop and might be biased towards it.

https://skamille.medium.com/make-boring-plans-9438ce5cb053

Explainable AI Cheat Sheet  – cheat sheet, video and resources regarding XAI. This is a very good way to get into this field.

https://ex.pegg.io/

I’ve code reviewed over 750 pull requests at Amazon. Here’s my exact thought process – code review is an art and is a way personal relations manifest themselves. One day I might write a longer post about code reviews but for now I want to focus on the last 2 points in this post – “I approve when the PR is good, not perfect” and “I seek feedback for whether I’m reviewing well”. “Good not perfect” – this depends on the team standards, DoD, the PR scope, etc. Specifically, in startups when the time and money are limited each delay has its’ costs. “I seek feedback” – how is the quality of my CR is measured? what are the goals of CR (familiarity with the code, finding bugs, enforcing standards, something else?)?. I would like to see or find ways to assess the quality of the CR and give feedback to the code reviewer.

https://curtiseinsmann.medium.com/ive-code-reviewed-over-750-pull-requests-at-amazon-here-s-my-exact-thought-process-cec7c942a3a4


My Clean and Tidy Checklist for Clean and Tidy Data – it is commonly believed that “Data scientists spend 80% of their time cleaning data”. This post provides a conceptual framework to clean data so the time data scientist spend on cleaning data might drop to 79% 😉

https://towardsdatascience.com/my-clean-and-tidy-checklist-for-clean-and-tidy-data-fbdeacb3736c

5 interesting things (23/04/21)

You Are Probably Not Making The Most of Pandas “read_csv” Function – this might seems trivial and everything can be found in the documentation but it is well served here with many examples.
https://towardsdatascience.com/you-are-probably-not-making-the-most-of-pandas-read-csv-function-51bcf069e646

Disasters I’ve seen in a microservices world – I experienced most of the disasters described in this post and totally agree with the bottom line – “These edge cases become the new normal at a certain scale, and we should cope with them.”

https://world.hey.com/joaoqalves/disasters-i-ve-seen-in-a-microservices-world-a9137a51

Chess2Vec – while there are many x2Vec works in recent years this work is passion-based. The writer informatics profess that wanted to apply the algorithm to a hobby of his – chess. I think this is a great example of side project and I would love to see more such combinations.
https://towardsdatascience.com/chess2vec-map-of-chess-moves-712906da4de9

Driving Cultural Change Through Software Choices – there are several approaches on who to drive changes this post presents a somehow more immediate approach, straight-forward and role model approach. The author’s idea is that if you choose or provide the tools that reflect your values your team will also adopt them.

https://skamille.medium.com/driving-cultural-change-through-software-choices-bf69d2db6539

Letter to (new) managers – an insightful post for managers and people who strive to become managers. Two quotes I liked – “Trust is consistency over time” and “We start managing others the way we manage ourselves, but to do better, we need to learn new tools and use them adaptively.”. Managing others the way we manage ourselves is one of the most common mistakes I saw managers do and I try to be super aware to it myself.

https://productlessons.substack.com/p/letter-to-new-managers

5 tips for using Pandas

Recently, I worked closely with Pandas and found out a few things that are might common knowledge but were new to me and helped me write more efficient code in less time.


1. Don’t drop the na

Count the number of unique values including Na values.

Consider the following pandas DataFrame –

df = pd.DataFrame({"userId": list(range(5))*2 +[1, 2, 3],
                   "purchaseId": range(13),
                   "discountCode": [1, None]*5 + [2, 2, 2]})

Result

If I want to count the discount codes by type I might use –  df['discountCode'].value_counts() which yields – 

1.0    5
2.0    3

This will miss the purchases without discount codes. If I also care about those, I should do –

df['discountCode'].value_counts(dropna=False)

which yields –

NaN    5
1.0    5
2.0    3

This is also relevant for nuniqiue. For example, if I want to count the number of unique discount codes a user used – df.groupby("userId").agg(count=("discountCode", lambda x: x.nunique(dropna=False)))

See more here – https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

2. Margin on Row \ columns  only

 Following the above example, assume you want to know for each discount code which users used it and for each user which discount code she used. Additionally you want to know has many unique discount codes each user used and how many unique users used each code, you can use pivot table with margins argument –

df.pivot_table(index="userId", columns="discountCode",
               aggfunc="nunique", fill_value=0,
               margins=True)

Result –

It would be nice to have the option to get margins only for rows or only for columns. The dropna option does not act as expected – the na values are taken into account in the aggregation function but not added as a column or an index in the resulted Dataframe.

3. plotly backend


Pandas plotting capabilities is nice but you can go one step further and use plotly very easy by setting plotly as pandas plotting backend.  Just add the following line after importing pandas (no need to import plotly, you do need to install it) –

pd.options.plotting.backend = "plotly"

Note that plotly still don’t support all pandas plotting options (e.g subplots, hexbins) but I believe it will improve in the future. 


See more here – https://plotly.com/python/pandas-backend/


4. Categorical dtype and qcut

Categorical variables are common – e.g., gender, race, part of day, etc. They can be ordered (e.g part of day) or unordered (e.g gender). Using categorical data type one can validate data values better and compare them in case they are ordered (see user guide here). qcut allows us to customize binning for discrete and categorical data.

See documentation here and the post the caught my attention about it here – https://medium.com/datadriveninvestor/5-cool-advanced-pandas-techniques-for-data-scientists-c5a59ae0625d

5. tqdm integration


tqdm is a progress bar that wraps any Python iterable, you can also use to follow the progress of pandas apply functionality using progress_apply instead of apply (you need to initialize tqdm before by doing tqdm.pandas()).

See more here – https://github.com/tqdm/tqdm#pandas-integration

5 interesting things (04/12/2020)

How to set compensation using commonsense principles – yet another artwork by Erik Bernhardsson. I like his analytics approach and the way he models his ideas. His manifest regarding compensation systems (Good/bad compensation systems) is brilliant. I believe most of us agree with him while he put it into words. His modeling has some drawbacks that he is aware of. For example, assuming certainty in employee productivity, almost perfect knowledge of the market. Yet, it is totally worth your time.

https://erikbern.com/2020/06/08/how-to-set-compensation-using-commonsense-principles.html

7 Over Sampling techniques to handle Imbalanced Data – imbalanced data is a common real world scenario, specifically in healthcare where most of the patients don’t have a certain condition one is looking for. Over-sampling is a method to handle imbalanced data, this post describes several techniques to handle it. Interestingly, at least in this specific example, most of the techniques do not bring significant improvement. I would therefore compare several techniques and won’t just try one of them. 
https://towardsdatascience.com/7-over-sampling-techniques-to-handle-imbalanced-data-ec51c8db349f

This post uses a the following package which I didn’t know before (it would be great if it could become part of scikit-learn) – https://imbalanced-learn.readthedocs.io/en/stable/index.html

It would be nice to see a similar post fo downsampling techniques.


Python’s do’s and don’t do – very nicely and written with good examples – 
https://towardsdatascience.com/10-quick-and-clean-coding-hacks-in-python-1ccb16aa571b

Every Complex DataFrame Manipulation, Explained & Visualized Intuitively – can’t remember how pandas function work? great, you are not alone. You can use this guide to quickly remind you how melt, explode, pivot and others work.
https://medium.com/analytics-vidhya/every-dataframe-manipulation-explained-visualized-intuitively-dbeea7a5529e

Causal Inference that’s not A/B Testing: Theory & Practical Guide – Causality is often overlooked in the industry. Many times you developed a model that is “good enough” and move on. However, this might increase bias and lead to unfavourable results. This post suggests a hands-on approach to causality accompanied by code samples.

https://towardsdatascience.com/causal-inference-thats-not-a-b-testing-theory-practical-guide-f3c824ac9ed2

Plotly back to back horizontal bar chart

Yesterday I read Boris Gorelik post – Before and after: Alternatives to a radar chart (spider chart) and I also wanted to used this visualization but using Plotly.

So I created the following gist –

import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
women_pop = np.array([5., 30., 45., 22.])
men_pop = np.array( [5., 25., 50., 20.])
y = list(range(len(women_pop)))
fig = go.Figure(data=[
go.Bar(y=y, x=women_pop, orientation='h', name="women", base=0),
go.Bar(y=y, x=men_pop, orientation='h', name="men", base=0)
])
fig.update_layout(
barmode='stack',
title={'text': f"Men vs Women",
'x':0.5,
'xanchor': 'center'
})
fig.update_yaxes(
ticktext=['aa', 'bb', 'cc', 'dd'],
tickvals=y
)
fig.show()

How to manage remote teams – 3 insights from GitLab’s course in Coursera

I have recently listened to GitLab’s “Remote management course” in Coursera. While there are companies that are remote-first and were built like that or chose to adopt this structure (or other remote work structures) with the outbreak of COVID-19 many companies were forced to change and adopt remote work. This course timing is relevant than ever. Here are few insights –

  1. This course is a great PR for GitLab they present a well-studied background and ideas of remote work, including pros, cons, and trade-offs. They present their take on remote work and how it is implemented in GitLab and refer to their own handbook and resources. Additionally, course lecturers are very diverse, which I believe can be attractive to many candidates. 
  2. Physical health and specifically mental health are mentioned multiple times during the course. Working remotely raises mental health challenges which are important to mention and I’m glad they did it.  Specifically during the pandemic where interactions with other decreases and many employees didn’t choose this form of work in advance.
  3. What is a vital capability for remote employees? communication. They preach to value strong communication skills and emphasize that it is crucial in an asynchronous work environment where your colleagues are in different time zones, have varied cultural backgrounds and it might take them hours or days to answer. Well, I think it is also important for colocated employees.  Interactions with others shape our days, sometimes more than our actual tasks and it is important to be in an environment where we feel comfortable and even if we have disagreements (and I would be worried if there aren’t) they can be discussed and settled. Generally, I find many of their ideas also relevant for colocated work.

Personally, I like working remotely as it saves the commute and allows me flexible working hours. However, in the past, I had the experience of being the only remote employee and that didn’t work well. Many of the communication was in the coffee corner and was not accessible to me, promotion paths were blocked, etc.

If you are looking for a remote position, I strongly suggest this great resource – Established remote companies.

Getting started with Fairness in ML – 5 resources

Mirror Mirror – Reflections on Quantitative Fairness – this is one of the first pieces I read about algorithmic fairness and caught my attention. It surveys the most common definitions of fairness together with relevant examples in a readable format.

https://shiraamitchell.github.io/fairness/

Fairness in Machine Learning book – a WIP of an online textbook about fairness and machine learning by very notable researchers in this field – Solon Barocas, Moritz Hardt, Arvind Narayanan.

https://fairmlbook.org/

Equality of Opportunity in Supervised Learning – if you want to read a research paper about fairness this paper by Moritz Hardt, Eric Price and Nathan Srebro is a good starting point. It is central and relatively easy to read. It defines “Equalized odd” and “Equal opportunity” fairness measures which are commonly used and also gives a geometric intuition.

https://arxiv.org/abs/1610.02413

Responsible Data Science course – a wider view than just fairness. This course is given by Julia Stoyanovich at New York University and includes topics in data cleaning, anonymity explainability, etc. If you look for a pertinent reading list you can find it there.

https://dataresponsibly.github.io/courses/spring20/

AIF360 – A tooling containing metrics for datasets and models to test for biases and algorithms to mitigate bias in datasets and models. The package is available in both Python and R. 
AIF360 was originally developed by IBM and was recently donated to Linux Foundation AIAmong the currently available software tools, I find this the most baked and stable one. 

https://github.com/Trusted-AI/AIF360

5 interesting things (8/6/2020)

DBScan practitioners guide – DBScan is a density-based clustering method. One important advantage comparing to K-means is DBScan’s ability to identify noise and outliers. I feel that  DBScan is often under-estimated. See this guide to learn more on how DBScan works, how to choose hyperparameters and more.

https://towardsdatascience.com/a-practical-guide-to-dbscan-method-d4ec5ab2bc99

Fourier Transform for Data Science – When I was in undergrad school I learned FFT out of context, it was just an algorithm in the textbook and I didn’t understand what it was good for. Later I was asked about it in an oral in grad school and was able to mumble something. Much later I tried to pull some analysis on ECG waves and then I finally understood what it was about.Read this post if you want to demystify Fourier transform. 

https://medium.com/swlh/fourier-transformation-for-a-data-scientist-1f3731115097

Bonus – OpenCV tutorial on Fourier Transform

https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_transforms/py_fourier_transform/py_fourier_transform.html

Dataset shift – dataset shift happens when the test set and the train set come from different distributions. There are multiple expressions of this phenomenon, such as covariate shift, concept shift, prior distribution shift. I believe that every data scientist working in the industry came across at least one of those manifestations. This post provides a very good introduction to the topic and useful links if you want to delve.

https://towardsdatascience.com/understanding-dataset-shift-f2a5a262a766

A Practical Framework for AI Adoption: A Five-Step Process – having several years of experience as a data scientist I have noticed that data products are often not deployed, do not meet stakeholders’ expectations, not used as the data scientist intended, etc. This post introduces a framework that tries to remedy some of those problems.

https://datascientia.blog/2020/05/23/framework-for-ai-adoption-a-five-step-process/

Diagrams as code – a code-based tool to draw system diagrams, possibly easier than fighting draw.io. It contains many icons including several cloud providers (AWS, GCP, Azure, etc). common servicers (K8S, Elastic, spark), etc. All in all, seems very promising.

https://diagrams.mingrammer.com/

Connected Papers is Here

Almost a year ago I published a blog post about “5 ways to follow publications in your field“. Yesterday I was exposed to a new tool – Connected Papers.

Connected Papers present graph of papers that are similar to each other. However, note that this is not a citation tree. Each paper is a node and there is an edge between papers if they are similar. The similarity matrix is based on co-citation and co-references.
The node’s colour represent the publication year and the node’s size correspond to the number of citation. The graphs are built on the fly given a link or title.

The interface presents the papers’ abstract which makes it easier to browse and jump between the different graphs.

Two small features I can think of is to filter papers according to a publication year and an option to download citation (i.e. bibtex, APA).

I believe that I’ll used it extensively when working on related work.

5 interesting things (11/3/2020)

Never attribute to stupidity that which is adequately explained by opportunity cost – if you have a capacity for only 10 words, those are the 10 words you should take – “prioritization is the most value creating activity in any company”. One general, I really like Erik Bernhardsson writing and ideas, I find his posts insightful.

Causal inference tutorials thread – check this thread if you are interested in causal inference.
The (Real) 11 Reasons I Don’t Hire you – I follow Charity Major’s blog since I heard here in ״Distributed Matters” in 2015. Besides being a very talented engineer, she also writes about being a founder and manager. Hiring and is hard for both sides and although it is hard to believe it is not always personal and a candidate has to be a good match for the presented company, for the future company, to the team. The company would like to have exciting and challenging tasks for the candidate so she will be happy and grow in the company, And of course, we are all human and make mistakes from time to time. Read Charity’s post in order to examine the not hiring decision from the employer’s point of view.

The 22 Most-Used Python Packages in the World – an analysis of the most downloaded Python packages on PyPI over the past 365 days. Some of the results are surprising – I expected pip to be the most used package and it is only the fourth after urllib3, six and boto core, and requests to be ranked a bit higher. Some of the packages are internals we are not even aware of such as idna and pyasn1. Interesting reflection.

Time-based cross-validation – it might seem obvious when reading but there are few things to notice when training a model that has time dependency. This post also includes Python code to support the suggested solutions.