DevSkiller 2019 report – few comments

I read a blog post about DevSkiller report analyzing some trends and I read the original report and I have some comments to make.

“Companies from Israel are the most selective”

According to the post, companies in Israel are considered the most selective since they consider only 12.26% of the developers they test. Another point of view is that Israeli companies give less weight to résumés and would rather test the applicant skills. This can increase the diversity and give a chance to more people. I think that it is a good practice for an industry that lacks more than 10,000 professionals (see here)

“72% of companies are looking for JavaScript developers and JS is the most popular IT skill developers are tested in (40%)”

JavaScript is used both front-end (e.g. react) and back-end tool (e.g. node.js) and I think that the popularity 

. I think grouping it together is too broad. Also, the SQL skill is somehow a side requirement of many positions and is almost worthless on its own. The big gap between JavaScript (40%) and HTML\CSS (20%) is weird, specifically when they compare it to the StackOverflow 2019 report where they were JavaScript had 67.8% and HTML\CSS have 63.5% and they were one after the other. The numbers themselves do not matter, just the gap and the order.


On a side comment, interpersonal skills such as team player, leadership, responsibility are ignored in this report and that’s a shame. They are sometimes more important when hiring someone. 

“React, Spring, ASP.NET, MySQL, HTML, Data Analysis, and Laravel are the most popular technologies in their respective tech stacks”

CSS is a tech stack? Weird analysis from my point of view, maybe web development would have been a better title.

Is it a surprise that HTML and CSS are tested together? I find it hard to believe that an employer looks for an employee that is skilled only with one of those, they are tightly coupled.

In this section in the post, I miss a mention of NoSQL technologies.

General comments

Obviously, they use their data which is great but there might be a selection bias to pay attention to. For example, companies that look for JavaScript developers are not able to screen for themselves and therefore use their services. Comparing to companies the loop for Python developers. In the section about technologies tested together, they point out that the most common combination last year was Java+SQL and this year it is JavaScript + CSS. Maybe their screening service for Java+SQL is not as good as their screening for JavaScript + CSS and therefore companies do not use it.

The completion rate of the test is impressive (93%).

There is a detailed analysis of the geography of the hiring companies and the candidates’ origins. I am also interested in demographics such as age, gender, marital status. I know that not everything is legal to have or ask. But I wonder if parents are more likely or less likely to complete the tests (or even start them). Are women as likely as men to pass the tests? Are there feminine and masculine technologies?

Data about years of experience and the correlation to technologies used and the probability passing the tests would also be interesting.


How To Capitalize All Words in a String in Python – The Easy way

I run into this medium post today – “How To Capitalize All Words in a String in Python“.

It explains how to convert “hello world” to “Hello World” but it does it the hard way. An easier solution would be to use title function.

x = "hello world"
#Hello World

See more documentation here

argparse choices

I saw this post in Medium regarding argparse, which suggests the following –

parser.add_argument("--model", help="choose model architecture from: vgg19 vgg16 alexnet", type=str)

I think the following variant is better –

parser.add_argument("--model", help="choose model architecture from: vgg19 vgg16 alexnet", type=str, choices=['vgg19', 'vgg16', 'alexnet'], default='alexnet'])

If an illegal parameter is given, for example –model vgg20, the desired behavior of almost every program is to throw an exception. This won’t happen in the first case. If the user mistypes the model name the script will use Alexnet pre-trained instead of throwing an exception (implemented later in the script). Using the choices argument will solve this. Adding default=’alexent’, takes care in the case where the user does not choose a model actively. For the example presented in the original post this is the desired behavior.

See more here –

Missing data in Python – 5 resources

Bonus – R-miss-tastic – theoretical background and resources which relate to R missing values package. I recommend the lecture notes.

Working with missing data in Pandas – pandas is the swiss knife of data scientists, Pandas allows dropping records with missing values, fill missing values, interpolation of missing data points, etc.

Missing data visualization – provides several levels and types of visualizations – per sample, per feature, features heat map and dendrogram in order to gain a better understanding of missing values in a dataset.

FancyImput – Multivariate imputation and matrix completion algorithms implemented. This package was partially merged to scikit-learn. This package focus on viewing the data as a matrix and not a composition of columns, unfortunately, it is no longer actively maintained but maybe in the future.

Missingpy – scikit-learn consistent API for data imputation. Implements KNN imputation (also implemented in FancyImput) and Random Forest imputation (MissForest). Seems unmaintained.

MDI – Missing Data Imputation Package – accompanying code to Missing Data Imputation for Supervised Learning (

3 follow-up notes on 3 python list comprehension tricks

I saw the following post about list comprehension tricks in Python. I really like python comprehension functionality – dict, set, list, I don’t discriminate. So 3 follow up notes about this post –

1. Set Comprehension

Beside dictionary and lists, comprehensions also work for sets –

{s for s in [1, 2, 1, 0]}
{s**2 for s in [1,2,1,0,-1]}

2. Filtering (and a glimpse to generators)

In order to filter a list, one can iterate over the list or generator, apply the filter function and output a list or can use the build-in filter function and receive a generator that is more efficient as described further in the original post.

words = ['deified', 'radar', 'guns']
palindromes = filter(lambda w: w==w[::-1], words)
#['deified', 'radar']

Additional nice to know the build-in function is the map function, that for example can yield the words’ lengths as generators – 

words = ['deified', 'radar', 'guns']
lengths = map(lambda w: len(w), words)
#[7, 5, 4]

3. Generators

Another nice usage of generators is to create an infinite sequence – 

def infinite_sequence():


    while True:

        yield num


gen = infinite_sequence()







Generators can be piped, return multiple outputs, and more. I recommend this postto a better understand generators.


3 interesting features of NetworkX

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.”

NetworkX lets the user create a graph and then study it. For example – find the shortest path between nodes, find node degree, find the maximal clique, find coloring of a graph and so on. In this post, I’ll present a few features I find interesting and are maybe less known.


Multigraph is a graph that can store multiedges. Multiedges are multiple edges between two nodes (it is different from hypergraph where an edge can connect any number of nodes and no just two). NetworkX has 4 graph types – the well-known commonly used directed and undirected graph and 2 multigraphs –  nx.MultiDiGraph for directed multigraph and nx.MultiGraph for undirected multigraph.

In the example below, we see that if the graph type is not defined correctly, functionalities such as degree calculation may yield the wrong value –

import networkx as nx</pre>
G = nx.MultiGraph() G.add_nodes_from([1, 2, 3]) G.add_edges_from([(1, 2), (1, 3), (1, 2)]) print( #[(1, 3), (2, 2), (3, 1)] H = nx.Graph() H.add_nodes_from([1, 2, 3]) H.add_edges_from([(1, 2), (1, 3), (1, 2)]) print( #[(1, 2), (2, 1), (3, 1)] 

Create a graph from pandas dataframe

Pandas is the swiss knife of every data scientist, so naturally, it would be a good idea to create a graph from pandas dataframe. The other way around is also possible. See the documentation here. The example below shows how to create a multigraph from a pandas dataframe where each edge has a weight property.

import pandas as pd</pre>
df = pd.DataFrame([[1, 1, 4], [2, 1, 5], [3, 2, 6], [1, 1, 3]], columns=['source', 'destination', 'weight']) print(df) # source destination weight # 0 1 1 4 # 1 2 1 5 # 2 3 2 6 # 3 1 1 3 G = nx.from_pandas_edgelist(df, 'source', 'destination', ['weight'], create_using=nx.MultiGraph) print( # Name: # Type: MultiGraph # Number of nodes: 3 # Number of edges: 4 # Average degree: 2.6667 

Graph generators

One of the features I find the most interesting and powerful. The graph generator interface allows creating several types we just one line of code. Some of the graphs are deterministic given a parameter (e.g complete graph of k nodes) while some are random (e.g. binomial graph). Below are a few examples of deterministic graphs and random graphs. The examples below are the tip of the iceberg of the graph generator capabilities.

Complete graph – creates a graph with n nodes and an edge between every two nodes.

Empty graph – creates a graph with n nodes and no edges.

Star graph – create a graph with one central node connected to n external nodes.

G = nx.complete_graph(n=9)
print(len(G.edges()), len(G.nodes()))
# 36 9
H = nx.complete_graph(n=9, create_using=nx.DiGraph)
print(len(H.edges()), len(H.nodes()))
# 72 9
J = nx.empty_graph(n=9)
print(len(J.edges()), len(J.nodes()))
# 0 9
K = nx.star_graph(n=9)
print(len(K.edges()), len(K.nodes()))
# 9 10

Binomial Graph – create a graph with n nodes and each edge is created with probability p (alias for gnp_random_graph and erdos_renyi_graph).

G1 = nx.binomial_graph(n=9, p=0.5, seed=1)
G2 = nx.binomial_graph(n=9, p=0.5, seed=1)
G3 = nx.binomial_graph(n=9, p=0.5)
print(G1.edges()==G2.edges(), G1.edges()==G3.edges())
# True False

Random regular graph – creates a graph with n nodes, edges are created randomly and each node has degree d.

G = nx.random_regular_graph(d=4, n=10)

Random regula graph

Random tree – create a uniformly random tree of n nodes.

G = nx.random_tree(n=10)


All the code in this post can be found here

Additional Resource

Official site

SO questions

5 interesting things (23/09/2019)

All the best engineering advice I stole from non-technical people – the length of this post could definitely be cut into half. However, I find the idea that useful, meaningful advice and insights can pop around and you just need to listen importantly. There is something to learn from everyone, and we just need to be willing to do that.

The Ultimate Guide to Acing Your First Coding Interview – looking for a job and especially your first job as a junior developer is a hard job. Read Dana’s tips here –

(full disclosure – I know Dana and helped to edit the post a bit)

Python Libraries for Interpretable Machine Learning – Interpretability in machine learning becomes an important topic lately. While I have my doubts about it and what interpretability really is, this post presents the main Python packages for interpretability –
Bonus point – missing package SHAP(SHapley Additive exPlanations)  –

Document Embedding techniques – (another) excellent post by Shay Palachy which reviews the main and prominent approaches regarding document embedding including many references to the relevant literature.

Style Pandas Dataframe Like a Master – while I always knew I can do better with my styling I didn’t know how far I can take it. Check this post for a quick ramp-up on the topic.

5 ways to follow publications in your field

This post was published on Medium

An important part of the data scientists and researchers’ life is to keep track of publications in their field. Depends on your field and needs publications range from papers in academic conferences and proceedings (some of them you can find as youtube videos), new technology and code packages, blog posts, etc. This post focus on who to keep track of academic research and innovation.

  1. Follow the relevant conferences, journals — make sure you are familiar with the main conferences in your field (ee.g.List of Machine Learning and Deep Learning conferences in 2019 / 2020) and follow their publications. You can usually read the accepted papers in the conference website when the paper acceptance is published. Talk slides and videos are usually accessible a short while after the conference. Identifying the relevant conferences may require some initial effort but once you identified it, it is easy to get it going.
  2. Google scholar e-mail alerts — track authors and \ or keywords you find relevant for you. E.g if you are interested in causal inference you would probably want to follow Judea Pearl. You can track new articles, citations and new articles related by author or keywords. I prefer to track only new articles because I found the benefit from citations and related articles low. You can also get email alerts by more complex queries. Set your alerts here.
  3. arXiv E-Mail Alerting Service — arXiv provides a daily digest of new submissions by subject, it is less granular and less focused than google scholar but can give you access to the newest, hottest submissions. Subscribe to arXiv E-Mail Alerting Service here).
  4. Follow blogs and publications of companies and research institutes which interest you — those are usually softer publications that give you a taste of the company’s recent advances and research. If this lights up your imagination, move on to reading the full paper. Examples of such blogs — facebook research blog, OpenAI blog, Google AI blog.
  5. Social media — follow researchers which are relevant to your field in twitter, see the papers they publish and recommend, read the discussions they are involved in. Join facebook groups that discuss the topics you are interested in.

Now, you can lean back and enjoy the new ideas coming to you. The next challenge is to wisely invest your time and to pick the papers which will be most beneficial for you.

Junior Data Science — Choosing your first job

This post was publish on Medium

While there are many people who would like to become a data scientist and are looking for their first position, junior data science positions are rare. Data science positions range from very research oriented positions in companies which also publish in scientific conferences (quite rare) to positions which are more hands-on and involve lots of coding. (Junior) Data scientists also come from diverse backgrounds: recent grads (bsc, msc and PhDs in different fields), experienced developers which would like to learn new skills, retraining and so on.

While the junior data science positions are rare, it is important to make an accurate choice and avoid common pitfalls. This post was triggered by Ori Cohen’s post “Data-Science Recruitment — Why You May Be Doing It Wrong” which was oriented to the recruiting side. This post is for the data scientist who is looking for their first job. Here are few insights.

Don’t be the first data scientist in the company

This sounds like a very sexy position — you recently graduated from the university and you were able to impress a small startup with your skills. They offer you to be the first data scientist in the company, boom! You will be able to shape the methods, process and tools the right way, like you always envisioned!

״In theory, theory and practice are the same. In practice, they are not״(Benjamin Brewster).

Many practical tasks are not like in the textbook or in Andrew Ng’s course. You will most probably need guidance and advice from an experienced data scientist who already made her mistakes, is familiar with the data and with the product’s constraints and is simply more experienced. The skills you want to learn varies over time but it always a good idea always have someone around that you can learn from.

An additional issue is that small companies usually have little data, usually not enough to train models, and the data quality might also be an issue. This will require changes in the product which should be defined and implemented. As a junior data scientist it might be complicated to do both the technical part and the politics which is required for such a change.

How would you know you are interviewing for the first data science position:

  1. You will be told so explicitly — “you will be our first data scientist”
  2. None of your interviewers is a data scientist and the questions they ask don’t reflect a deep understanding of the topic.

People Don’t Quit Jobs — They Quit Bosses

And before quitting — people work for bosses.

Interviews are two-sided. The company interviews you, but you also interview the company. Does the product excites you? Do you think the company has the right values and culture fit for you? Would you like to work for this manager?

Most likely you will work closely with your manager and teammates. Did they impress you? Would you value their feedback?

In order to learn and improve, a lot of feedback and communication is required, especially when you are in a junior position. Are there regular 1:1s? Is there an on boarding plan? Do they participate in conferences \ is there an education budget? Does the company have the work-life-balance you are looking for?

During an interview, the interviewer might want to please you so if you’ll ask these questions directly they might answer what you expect to hear. Talking with teammates and other co-workers in the company can give you additional insights about the team and the company.

Tools and Technologies

If you mainly focus on research you might find this point secondary. However, for your next position, hands on experience may be required. Be sure to choose a place which uses reasonable technologies and not a niche, esoteric technologies. E.g using assembly for machine learning, working in mainframe environment, etc.

Current reasonable technology stack for data scientist includes : python (maybe scala, maybe R depends on your risk aversion) and scientific python packages (pandas, numpy, scipy, etc), cloud environment, some kind of database (postgres \ mysql \ elasticsearch \ mongodb).

Last but not least — choose something you are passionate about so you will be happy to go to work in the morning and dream about your code at night 🙂

Special thanks to Liad Pollak and Idit Cohen who made this text readable

5 interesting things (23/07/2019)

Five Talks from spaCy-IRL Worth Watching – great summarisation of 5 talks from spaCy-IRL conference which took place in Berlin in the beginning of July. The summarisations are very exact – not too deep, not too shallow and makes you want to watch the talks. From the meta perspective – a very nice connection between academia and industry leveraging ideas from academia to solve industry problems.


King – Man + Woman = King ? In 2016 “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” was published and showed that the pre-trained word2vec model which was trained on Google News articles exhibited gender stereotypes to, “a disturbing extent.”. Apparently, according to “Fair is Better than Sensational:Man is to Doctor as Woman is to Doctor” at least some of the bias stems from optimisations \ restrictions done in order to present better results. Most significant one the answer to “a to b is like c to ..?” cannot be b. This does not mean that there is no bias, it only means that it was not measured and formalised correctly. This emphasises once again the need to understand the algorithms we use and their limitations.



Bonus – linear digression episode – Revisiting Biased Word Embeddings


10 tips for code review – code review can be a stressful task for both the reviewer and the person her work is being reviewed. This post is from the reviewer point of view, how to make this process more efficient and constructive to both sides. A good follow up post would be how to listen and reach to code review. From my experience, many times it is a boiling point for relationships inside teams and can break teams when not done correctly.



How to label data – if you ever did a data science project you know that obtaining tagged data is a real hassle. You often discover that you don’t have enough data, the tagging is not what you need, etc. This guide will help you avoid pitfalls when issuing a labelling project.



Data-Science Recruitment — Why You May Be Doing It Wrong – post by data science team lead in Zencity regarding do’s and don’t do in the interviewing process for data scientists. In the last few years I widnessed many of this flaws – asking non relevant riddles, given a very long home exercise, not well defined with doubtful data. I would like to emphasise for candidates, specially junior candidates, that if  you have doubts during the interview process consider looking for another place.