Getting started with Fairness in ML – 5 resources

Mirror Mirror – Reflections on Quantitative Fairness – this is one of the first pieces I read about algorithmic fairness and caught my attention. It surveys the most common definitions of fairness together with relevant examples in a readable format.

https://shiraamitchell.github.io/fairness/

Fairness in Machine Learning book – a WIP of an online textbook about fairness and machine learning by very notable researchers in this field – Solon Barocas, Moritz Hardt, Arvind Narayanan.

https://fairmlbook.org/

Equality of Opportunity in Supervised Learning – if you want to read a research paper about fairness this paper by Moritz Hardt, Eric Price and Nathan Srebro is a good starting point. It is central and relatively easy to read. It defines “Equalized odd” and “Equal opportunity” fairness measures which are commonly used and also gives a geometric intuition.

https://arxiv.org/abs/1610.02413

Responsible Data Science course – a wider view than just fairness. This course is given by Julia Stoyanovich at New York University and includes topics in data cleaning, anonymity explainability, etc. If you look for a pertinent reading list you can find it there.

https://dataresponsibly.github.io/courses/spring20/

AIF360 – A tooling containing metrics for datasets and models to test for biases and algorithms to mitigate bias in datasets and models. The package is available in both Python and R. 
AIF360 was originally developed by IBM and was recently donated to Linux Foundation AIAmong the currently available software tools, I find this the most baked and stable one. 

https://github.com/Trusted-AI/AIF360

5 interesting things (8/6/2020)

DBScan practitioners guide – DBScan is a density-based clustering method. One important advantage comparing to K-means is DBScan’s ability to identify noise and outliers. I feel that  DBScan is often under-estimated. See this guide to learn more on how DBScan works, how to choose hyperparameters and more.

https://towardsdatascience.com/a-practical-guide-to-dbscan-method-d4ec5ab2bc99

Fourier Transform for Data Science – When I was in undergrad school I learned FFT out of context, it was just an algorithm in the textbook and I didn’t understand what it was good for. Later I was asked about it in an oral in grad school and was able to mumble something. Much later I tried to pull some analysis on ECG waves and then I finally understood what it was about.Read this post if you want to demystify Fourier transform. 

https://medium.com/swlh/fourier-transformation-for-a-data-scientist-1f3731115097

Bonus – OpenCV tutorial on Fourier Transform

https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_transforms/py_fourier_transform/py_fourier_transform.html

Dataset shift – dataset shift happens when the test set and the train set come from different distributions. There are multiple expressions of this phenomenon, such as covariate shift, concept shift, prior distribution shift. I believe that every data scientist working in the industry came across at least one of those manifestations. This post provides a very good introduction to the topic and useful links if you want to delve.

https://towardsdatascience.com/understanding-dataset-shift-f2a5a262a766

A Practical Framework for AI Adoption: A Five-Step Process – having several years of experience as a data scientist I have noticed that data products are often not deployed, do not meet stakeholders’ expectations, not used as the data scientist intended, etc. This post introduces a framework that tries to remedy some of those problems.

https://datascientia.blog/2020/05/23/framework-for-ai-adoption-a-five-step-process/

Diagrams as code – a code-based tool to draw system diagrams, possibly easier than fighting draw.io. It contains many icons including several cloud providers (AWS, GCP, Azure, etc). common servicers (K8S, Elastic, spark), etc. All in all, seems very promising.

https://diagrams.mingrammer.com/

Connected Papers is Here

Almost a year ago I published a blog post about “5 ways to follow publications in your field“. Yesterday I was exposed to a new tool – Connected Papers.

Connected Papers present graph of papers that are similar to each other. However, note that this is not a citation tree. Each paper is a node and there is an edge between papers if they are similar. The similarity matrix is based on co-citation and co-references.
The node’s colour represent the publication year and the node’s size correspond to the number of citation. The graphs are built on the fly given a link or title.

The interface presents the papers’ abstract which makes it easier to browse and jump between the different graphs.

Two small features I can think of is to filter papers according to a publication year and an option to download citation (i.e. bibtex, APA).

I believe that I’ll used it extensively when working on related work.

5 interesting things (11/3/2020)

Never attribute to stupidity that which is adequately explained by opportunity cost – if you have a capacity for only 10 words, those are the 10 words you should take – “prioritization is the most value creating activity in any company”. One general, I really like Erik Bernhardsson writing and ideas, I find his posts insightful.

Causal inference tutorials thread – check this thread if you are interested in causal inference.
The (Real) 11 Reasons I Don’t Hire you – I follow Charity Major’s blog since I heard here in ״Distributed Matters” in 2015. Besides being a very talented engineer, she also writes about being a founder and manager. Hiring and is hard for both sides and although it is hard to believe it is not always personal and a candidate has to be a good match for the presented company, for the future company, to the team. The company would like to have exciting and challenging tasks for the candidate so she will be happy and grow in the company, And of course, we are all human and make mistakes from time to time. Read Charity’s post in order to examine the not hiring decision from the employer’s point of view.

The 22 Most-Used Python Packages in the World – an analysis of the most downloaded Python packages on PyPI over the past 365 days. Some of the results are surprising – I expected pip to be the most used package and it is only the fourth after urllib3, six and boto core, and requests to be ranked a bit higher. Some of the packages are internals we are not even aware of such as idna and pyasn1. Interesting reflection.

Time-based cross-validation – it might seem obvious when reading but there are few things to notice when training a model that has time dependency. This post also includes Python code to support the suggested solutions.

DevSkiller 2019 report – few comments

I read a blog post about DevSkiller report analyzing some trends and I read the original report and I have some comments to make.

“Companies from Israel are the most selective”

According to the post, companies in Israel are considered the most selective since they consider only 12.26% of the developers they test. Another point of view is that Israeli companies give less weight to résumés and would rather test the applicant skills. This can increase the diversity and give a chance to more people. I think that it is a good practice for an industry that lacks more than 10,000 professionals (see here)

“72% of companies are looking for JavaScript developers and JS is the most popular IT skill developers are tested in (40%)”

JavaScript is used both front-end (e.g. react) and back-end tool (e.g. node.js) and I think that the popularity 

. I think grouping it together is too broad. Also, the SQL skill is somehow a side requirement of many positions and is almost worthless on its own. The big gap between JavaScript (40%) and HTML\CSS (20%) is weird, specifically when they compare it to the StackOverflow 2019 report where they were JavaScript had 67.8% and HTML\CSS have 63.5% and they were one after the other. The numbers themselves do not matter, just the gap and the order.

 

On a side comment, interpersonal skills such as team player, leadership, responsibility are ignored in this report and that’s a shame. They are sometimes more important when hiring someone. 

“React, Spring, ASP.NET, MySQL, HTML, Data Analysis, and Laravel are the most popular technologies in their respective tech stacks”

CSS is a tech stack? Weird analysis from my point of view, maybe web development would have been a better title.

Is it a surprise that HTML and CSS are tested together? I find it hard to believe that an employer looks for an employee that is skilled only with one of those, they are tightly coupled.

In this section in the post, I miss a mention of NoSQL technologies.

General comments

Obviously, they use their data which is great but there might be a selection bias to pay attention to. For example, companies that look for JavaScript developers are not able to screen for themselves and therefore use their services. Comparing to companies the loop for Python developers. In the section about technologies tested together, they point out that the most common combination last year was Java+SQL and this year it is JavaScript + CSS. Maybe their screening service for Java+SQL is not as good as their screening for JavaScript + CSS and therefore companies do not use it.

The completion rate of the test is impressive (93%).

There is a detailed analysis of the geography of the hiring companies and the candidates’ origins. I am also interested in demographics such as age, gender, marital status. I know that not everything is legal to have or ask. But I wonder if parents are more likely or less likely to complete the tests (or even start them). Are women as likely as men to pass the tests? Are there feminine and masculine technologies?

Data about years of experience and the correlation to technologies used and the probability passing the tests would also be interesting.

 

argparse choices

I saw this post in Medium regarding argparse, which suggests the following –

parser.add_argument("--model", help="choose model architecture from: vgg19 vgg16 alexnet", type=str)

I think the following variant is better –

parser.add_argument("--model", help="choose model architecture from: vgg19 vgg16 alexnet", type=str, choices=['vgg19', 'vgg16', 'alexnet'], default='alexnet'])

If an illegal parameter is given, for example –model vgg20, the desired behavior of almost every program is to throw an exception. This won’t happen in the first case. If the user mistypes the model name the script will use Alexnet pre-trained instead of throwing an exception (implemented later in the script). Using the choices argument will solve this. Adding default=’alexent’, takes care in the case where the user does not choose a model actively. For the example presented in the original post this is the desired behavior.

See more here –

Missing data in Python – 5 resources

Bonus – R-miss-tastic – theoretical background and resources which relate to R missing values package. I recommend the lecture notes.
https://rmisstastic.netlify.com/lectures/

Working with missing data in Pandas – pandas is the swiss knife of data scientists, Pandas allows dropping records with missing values, fill missing values, interpolation of missing data points, etc.

https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

Missing data visualization – provides several levels and types of visualizations – per sample, per feature, features heat map and dendrogram in order to gain a better understanding of missing values in a dataset.

https://github.com/ResidentMario/missingno

FancyImput – Multivariate imputation and matrix completion algorithms implemented. This package was partially merged to scikit-learn. This package focus on viewing the data as a matrix and not a composition of columns, unfortunately, it is no longer actively maintained but maybe in the future.

https://github.com/iskandr/fancyimpute

Missingpy – scikit-learn consistent API for data imputation. Implements KNN imputation (also implemented in FancyImput) and Random Forest imputation (MissForest). Seems unmaintained.

https://github.com/epsilon-machine/missingpy

MDI – Missing Data Imputation Package – accompanying code to Missing Data Imputation for Supervised Learning (https://arxiv.org/abs/1610.09075)

https://github.com/rafaelvalle/MDI

3 follow-up notes on 3 python list comprehension tricks

I saw the following post about list comprehension tricks in Python. I really like python comprehension functionality – dict, set, list, I don’t discriminate. So 3 follow up notes about this post –

1. Set Comprehension

Beside dictionary and lists, comprehensions also work for sets –

{s for s in [1, 2, 1, 0]}
#set(0,1,2))
{s**2 for s in [1,2,1,0,-1]}
#set(0,1,4)

2. Filtering (and a glimpse to generators)

In order to filter a list, one can iterate over the list or generator, apply the filter function and output a list or can use the build-in filter function and receive a generator that is more efficient as described further in the original post.

words = ['deified', 'radar', 'guns']
palindromes = filter(lambda w: w==w[::-1], words)
list(palindromes)
#['deified', 'radar']

Additional nice to know the build-in function is the map function, that for example can yield the words’ lengths as generators – 

words = ['deified', 'radar', 'guns']
lengths = map(lambda w: len(w), words)
list(lengths)
#[7, 5, 4]

3. Generators

Another nice usage of generators is to create an infinite sequence – 


def infinite_sequence():

    num=0

    while True:

        yield num

        num+=1


gen = infinite_sequence()

next(gen)

#0

next(gen)

#1

next(gen)

#2

Generators can be piped, return multiple outputs, and more. I recommend this postto a better understand generators.

 

3 interesting features of NetworkX

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.”

NetworkX lets the user create a graph and then study it. For example – find the shortest path between nodes, find node degree, find the maximal clique, find coloring of a graph and so on. In this post, I’ll present a few features I find interesting and are maybe less known.

Multigraphs

Multigraph is a graph that can store multiedges. Multiedges are multiple edges between two nodes (it is different from hypergraph where an edge can connect any number of nodes and no just two). NetworkX has 4 graph types – the well-known commonly used directed and undirected graph and 2 multigraphs –  nx.MultiDiGraph for directed multigraph and nx.MultiGraph for undirected multigraph.

In the example below, we see that if the graph type is not defined correctly, functionalities such as degree calculation may yield the wrong value –

import networkx as nx</pre>
G = nx.MultiGraph() G.add_nodes_from([1, 2, 3]) G.add_edges_from([(1, 2), (1, 3), (1, 2)]) print(G.degree()) #[(1, 3), (2, 2), (3, 1)] H = nx.Graph() H.add_nodes_from([1, 2, 3]) H.add_edges_from([(1, 2), (1, 3), (1, 2)]) print(H.degree()) #[(1, 2), (2, 1), (3, 1)] 

Create a graph from pandas dataframe

Pandas is the swiss knife of every data scientist, so naturally, it would be a good idea to create a graph from pandas dataframe. The other way around is also possible. See the documentation here. The example below shows how to create a multigraph from a pandas dataframe where each edge has a weight property.

import pandas as pd</pre>
df = pd.DataFrame([[1, 1, 4], [2, 1, 5], [3, 2, 6], [1, 1, 3]], columns=['source', 'destination', 'weight']) print(df) # source destination weight # 0 1 1 4 # 1 2 1 5 # 2 3 2 6 # 3 1 1 3 G = nx.from_pandas_edgelist(df, 'source', 'destination', ['weight'], create_using=nx.MultiGraph) print(nx.info(G)) # Name: # Type: MultiGraph # Number of nodes: 3 # Number of edges: 4 # Average degree: 2.6667 

Graph generators

One of the features I find the most interesting and powerful. The graph generator interface allows creating several types we just one line of code. Some of the graphs are deterministic given a parameter (e.g complete graph of k nodes) while some are random (e.g. binomial graph). Below are a few examples of deterministic graphs and random graphs. The examples below are the tip of the iceberg of the graph generator capabilities.

Complete graph – creates a graph with n nodes and an edge between every two nodes.

Empty graph – creates a graph with n nodes and no edges.

Star graph – create a graph with one central node connected to n external nodes.

G = nx.complete_graph(n=9)
print(len(G.edges()), len(G.nodes()))
# 36 9
H = nx.complete_graph(n=9, create_using=nx.DiGraph)
print(len(H.edges()), len(H.nodes()))
# 72 9
J = nx.empty_graph(n=9)
print(len(J.edges()), len(J.nodes()))
# 0 9
K = nx.star_graph(n=9)
print(len(K.edges()), len(K.nodes()))
# 9 10

Binomial Graph – create a graph with n nodes and each edge is created with probability p (alias for gnp_random_graph and erdos_renyi_graph).

G1 = nx.binomial_graph(n=9, p=0.5, seed=1)
G2 = nx.binomial_graph(n=9, p=0.5, seed=1)
G3 = nx.binomial_graph(n=9, p=0.5)
print(G1.edges()==G2.edges(), G1.edges()==G3.edges())
# True False

Random regular graph – creates a graph with n nodes, edges are created randomly and each node has degree d.

G = nx.random_regular_graph(d=4, n=10)
nx.draw(G)
plt.show()

Random regula graph

Random tree – create a uniformly random tree of n nodes.

G = nx.random_tree(n=10)
nx.draw(G)
plt.show()

random_tree

All the code in this post can be found here

Additional Resource

Official site

SO questions

https://www.datacamp.com/community/tutorials/networkx-python-graph-tutorial

https://www.geeksforgeeks.org/directed-graphs-multigraphs-and-visualization-in-networkx/amp/