5 interesting things (29/12/2021)

7 PyTest Features and Plugins That Will Save You Tons of Time– I read many tutorials and posts about PyTest and this is the first time I run into those flags (features 1-5) which I find very useful. As always – if you can use your superpowers to read the documentation directly. 


Patterns for Authorization in Microservices – I find this post interesting since I currently face a similar problem of setting authorization and authentication architecture in the product I work on that can have complex access patterns such as a user can access multiple resources on different access levels owned by different organizations.


Related bonus – https://blog.miguelgrinberg.com/post/api-authentication-with-tokens

Database Indexing Anti-Patterns – I find this post slightly too high level. Yes, it state possible issues with indexing but a more effective post would be how to detect those anti-patterns on specific databases. E.g measure Mongo index usage on Mongo – https://docs.mongodb.com/manual/tutorial/measure-index-use/

This link is part of this post as a periodic reminder to think and take care of those topics before they become issues.

How to safely think in systems. – “Effective systems thinking comes from the tension between model and reality, without a healthy balance you’ll always lose the plot.”. I’m not sure if this post should be in the parenting category or in the career \ professional \ management category. 

How Improvised Stand-up Comedy Taught Me to Interview Better – “After all, questions in an interview are mostly a means for getting to know the candidate better, just as pulling words out of a hat is just the framework for a show.”. Great post that connects two domains that usually aren’t brought up together.


We raised 12M$ seed funding

In the last year and 3 months, I work for Lynx.md. We develop a medical data science platform that bridges the gap between data owners and data consumers while taking care of de-identifications, privacy, and security aspects of sharing data.

10 days ago we announced that we raised a 12M$ seed funding and we are hiring – DevOps engineers, data engineers, data scientists, backend \ full-stack \ frontend developers, product managers. Our tech stack includes – Python mainly using FastAPI, Django, pandas, etc., AWS (but will soon add Azure too), Postgres, elastic search, Redis, Docker. Super interesting challenges with added value. Feel free to reach out if you want to learn more.

Read more about us – https://www.calcalistech.com/ctech/articles/0,7340,L-3924888,00.html

5 interesting things (1/12/21)

Tests aren’t enough: Case study after adding type hints to urllib3 – I read those posts as thrillers (and some of them are the same length :). This post describes the effort of adding type hints to urllib3 and what the maintainers’ team learned during this process. Super interesting.


Why you shouldn’t invoke setup.py directly – post by Paul Ganssle, python core developer setuptool. TL;DR of the TL;DR in the post – “The setuptools team no longer wants to be in the business of providing a command-line interface and is actively working to become just a library for building packages”. See the table in the summary section for a quick how-to guide.


Python pathlib Cookbook: 57+ Examples to Master It (2021) – I had a post about pathlib for a while in my drafts and now I can delete it since this guide is much more extensive. In short, pathlib is part of the Python standard library since Python 3.4 and it provides an abstraction for filesystem paths over different operating systems. If you still work with os for paths this is a good time to switch.


10 Tips For Using OKRs Effectively – I think a lot about OKRs for my team and moreover on personal OKRs and how to grow both the team and the product. I found this post (and the associated links) insightful.


How to Choose the Right Structure for Your Data Team – I started with those posts and soon enough read many more posts by y Bar Moses, Co-Founder, and CEO, Monte Carlo. Her posts have two dimensions that are relevant for me – team building (specifically around data-intensive products) and data engineering. If you find at least one of those topics interesting I believe you’ll enjoy her posts.


Other pie chart

This morning I read “20 ideas for better data visualization“. I liked it very much and specially I found 8th idea – “Limit the number of slices displayed in a pie chart” very relevant for me. So I jumped into the plotly express code and created a figure of type other_pie which given a number (n) and a label (other_label) created a pie chart with n sectors. n-1 of those sectors are the top values according to the `values` column and the other section is the sum of the other rows.

A gist of the code can be found here (check here how to build plotly)

I used the following code to generate standard pie chart and pie chart with 5 sectors –

import plotly.express as px
df = px.data.gapminder().query("year == 2007").query("continent == 'Europe'")
df.loc[df['pop'] < 2.e6, 'country'] = 'Other countries' # Represent only large countries
pie_fig = px.pie(df, values='pop', names='country', title='Population of European continent')
otherpie_fig = px.other_pie(df, values='pop', names='country', title='Population of European continent', n=5, other_label="others")

And this is how it looks like –

Pie chart
Pie chart
Other pie chart

5 interesting things (21/10/21)

4 Things Tutorials Don’t Tell You About PyPI – this hands-on experience together with the explanations is priceless. Even if you don’t plan to upload a package to PyPI anytime soon those glimpses of how PyPI works are interesting.


Responsible Tech Playbook – I totally agree with Martin Fowler statement that “Whether asked to or not, we have a duty to ensure our systems don’t degrade our society”. This post promotes the text book about Responsible Tech published by Fowler and his colleagues from Thoughtworks. It also references additional resources such as Tarot Cards of Tech Ethical Explorer.


A Perfect Match – A Python 3.10 Brain Teaser – Python 3.10 was released earlier this month and the most talked about feature is Pattern Matching. Read this post to make sure you get it correctly.


How I got my career back on track – careers is not a miracle. That’s totally ok if you don’t want to have one but if you do and have aspirations you have to own it and manage your way there. 


PyCatFlow –  A big part of current data is time series data combined with categorical data. E.g., change in the mix of medical diagnosis \ shopping categories over time etc. PyCatFlow is a visualization tool which allows the representation of temporal developments, based on categorical data. Check their Jupyter Notebook with interactive widgets that can be run online.


Growing A Python Developer (2021)

I recently run into a team lead question regarding how to grow a backend Python Developer in her team. Since I also iterated around this topic with my team I already had few ideas in mind.

Few disclaimers before we start. First, I believe that the developer also has a share in the process and should express her interest and aspirations. The team lead or tech lead can direct and light blind spots but does not hold all the responsibility. It is also ok to dive into an idea are a tool that is not required at the moment. They might come in handy in the future and they can inspire you. Second, my view is limited to the areas I work in. Different organizations or products have different needs and focus. Third, build habits to constantly learn and grow – read blogs and books, listen to podcasts, take online or offline courses, watch videos, whatever works for you as long as you keep moving.

Consider the links below as appetizers. Each subject below has many additional resources besides the ones that I posted. Most likely I’m just not familiar with them, please feel free to add them and I’ll update the post. Some subjects are so broad and product dependent, e.g. cloud so I didn’t add links at all. Additionally, when using a specific product \ service \ package read the documentation and make it your superpower. Know Python standard library well (e.g itertoolsfunctoolscollectionspathlib, etc), it can save you a lot of time, effort, and bugs.

General ideas and concepts

  1. Clean code – bookbook summary
  2. Design patterns – refactoring bookrefactoring gurupython design patterns GitHub repo
  3. Distributed design patterns – Patterns of Distributed Systems
  4. SOLID principles – SOLID coding in Python
  5. Cloud
  6. Deployment – CI\CDdockerKubernetes
  7. Version control – git guide
  8. Databases – Using Databases with Pythondatabases tutorials
  9. Secure Development – Python cheat sheet by SnykOWASP

Python specific

  1. Webservices – flask, Django, FastAPI
  2. Testing – Unit Testing in Python — The Basics
  3. Packaging –  Python Packaging User Guide
  4. Data analysis – pandas, NumPy, sci-kit-learn
  5. Visualization – plotly, matlpotlib
  6. Concurrency – Speed Up Your Python Program With Concurrency
  7. Debugging – debugging with PDBPython debugging in VS Code
  8. Dependency management – Comparison of Pip, Pipenv and Poetry dependency management tools
  9. Type annotation – Type Annotations in Python
  10. Python 3.10 – What’s New in Python 3.10?, Why you can’t switch to Python 3.10 just yet

Additional resources

  1. Podcast.__init__ – The weekly podcast about Python and its use in machine learning and data science.
  2. The real python podcast
  3. Top 8 Python Podcasts You Should Be Listening to
  4. Python 3 module of the week
  5. Lazy programmer – courses on Udemy mainly AI and ML using Python
  6. cloudonaut – podcast and blog about AWS

5 tips to ace coding interview assignments

Now days, it is a very common practice to give a coding home tests as part of interview process. Beside solving the task you are ask to I believe there are few additional things you can do in order to impress the reviewers and ace this step of the process.

1. Push the code to a private repository and share it with the reviewers – this creates a two fold advantage. First, it demonstrates the reviewers that you are familiar with version control tools and second it shows your working process and that you keep track of your work. Don’t forget to write meaningful commit messages.

2. Write README file – the readme file gives a context to the entire project and reflects the way you understand the assignment. One of the annoting things as a reviewer is to guess how to run the code, what are the requirments and so on. Beside packaging or building the code in a way the runs smoothly (e.g in Python if using pip add a requirements.txt) a README file should help me find my way inside the project. In such assignments where you don’t have a direct communication with the reviewers the README files also serves as a place to document your decisions and thoughts.

What should you include in the README file? Short introduction explaining the project purpose and scre.  How to install and run or use it, preferably with some snippet that one can just copy-paste. How to run the tests (see next section :). Additional sections can include explainations about choices you made architecture wise or implementation wise, charts, performance evaluation, future ideas, dependencies, etc. This will help the reviewers get into your code quickly and run, understand your thinking and show that you are eager to share your knowledge with your peers.

For ease of use, check the template suggested here.

3. Write tests – usually unit tests are enough for this scope. This will help you debug your code and make sure it works properly. It will also signal the reviewers that you care about the quality of your code and know your job.

4. Run linters, spell check and do a proof reading for everything – make sure your code uses the common conventions and style for the tools you are using (e.g PEP-8 for python). Bonus points if you add the linters as pre-commit hooks to your repository. This make your code smoother and easier for the reviewers to read. The formatted code indicates that you are used to sharing your code with others and the hooks signal that you are productive and lazy by automating stuff.

5. Document everything – the idea behind this tip is not to annoy the reviewers by letting them guess what you meant. That is document what each module, function and parameter does. For example, in Python, use type annotation and docstrings.

5 interesting things (22/09/21)

Writing a Great CV for Your First Technical Role – a series of 3 parts about best practices, mistakes, and pitfalls in CV showing both good and bad examples. I find the posts relevant not just for first rolls but also as a good reminder when updating your CV.


Patterns in confusing explanations – writing and technical writing are superpowers. Being able to communicate your ideas in a clear way that others can engage with is a very impactful skill. In this post, Julia Evans describes 13 patterns of bad explanation and accompanies that with positive examples.


How We Design Our APIs at Slack – not only that I agree with those advices and had some bad experiences with similar issues both as a API supplier and consumer. Many times when big companies describe their architecture and process they are irrelevant to small companies due to cost, lack of data or resources or other reasons ,but the great thing about this post is that it also fits small companies and relatively easy to implement.


Python Anti-Pattern – this post describes a bug that is at the intersection of Python and AWS lambda functions. One can say that it is an extreme case but I tend to think it is more common than one would think and may spend hours debugging it. It is well written and very important to know if you are using lambda functions.


Architectural Decision Records – sharing knowledge is hard. Sometimes what is clear for you is not clear for others, sometimes it is not taken into account in the time estimation or takes longer than expected, other times you just want to move on and deliver. Having templates and conventions make it easier both for the writers and the readers. ADRs answer specific need.


5 interesting things (19/08/2021)

The 7 Branches of Software Gardening – “A small refactoring a day keeps the tech debt away ” (paraphrasing^2). Great examples of small activities and improvments every developer can make on a daily basis and would pile up to big impact.


What is the right level of specialization? For data teams and anyone else – I like Erik Bernhardsson’s posts, I linked to them more than once in the past. Bernhardsson highlights the tensions between being very professional and specific (“I only ETL process on Wednesdays 15:03-16:42 on Windowss Machines”) versus being less proficient in more concepts and technologies . And this leads us to the next item.


7 Key Roles and Responsibilities in Enterprise MLOps – in this post Domino Data Lab introduce their view about the different roles and responsibilities in MLOps \ Data teams. For sure it is more suitable in enterprises, small and even medium companies cannot afford themselves and sometimes don’t need all those roles and as suggested in Erik Bernhardsson’s post, very specific specializaion make it harder to move people between teams according to the organization needs. Having said that, title is a signal (inside and outside the organizaiton) of what a person likes to do and which capabilities (no necessarily specific tools) s\he are probable to have.


The Limitations of Chaos Engineering – I decided to read a bit about Chaos engineering as I never experienced with it before and came across this post which is almost 4 years old. While it is important to validate the reselience of our architecture and its’ implementation de facto, the common practice of fault injection also has its limitations which is good to know.


It’s Time to Retire the CSV – if you ever worked with CSV you probably see this title and yell “Hell Yes!“. If you want to gain an historic view and few more arguments have a look here –


pandas read_csv and missing values

I read Domino Lab post about “Data Exploration with Pandas Profiler and D-Tale” where they load diagnostic mammograms used in the diagnostic of breast cancer from UCI website. Instead of missiing values the data contains ?. When reading the data using pandas read_csv function naively interpret the value as string value and change the column type to be object instead of float in this case.

In the post mentioned above the authors dealt with the issue in the following way –

masses = masses.replace('?', np.NAN)
masses.loc[:,names[:-1]] = masses.loc[:,names[:-1]].apply(pd.to_numeric)

That is, they first replaced the ? values in all the columns with np.NAN and then convert all the columns to numeric. Let’s call this method the manual method.

If we know the know the non default missing values in advance, can we do something better? The answer is yes!

See code here

Use na_values parameter

df = pd.read_csv(url, names=names, na_values=["?"])

na_values parameter can get scalar, string, list-like or dict parameters. If you pass a scalar, string or list-like parameter all columns are treated the same way. If you pass dict you can specify different set of NaN values per column.

The advantage of this method over the manual method is that you don’t need to convert the columns after replacing the nan values. In the manual method the column types are specified (in the given case they are all numeric), if there are multiple columns types you need to know it and specify it in advance.

Side note – likewise, for non trivial boolean values you can use true_values and false_values parameters.

Use converters parameter

df = pd.read_csv(url, names=names, converters={"BI-RADS": lambda x: x if x!="?" else np.NAN})

This is usually used to convert values of specific columns. If you would like to convert values in all the columns in the same way this is not the preferred method since you will have to add an entry for each column and if new column is added you won’t take care of it by default (this can be both advantage and disadvantage). However, for other use-cases, converters can help with more complex conversions.

Note that the result here is different then the result in the other methods since we only converted the values in one column.


Pandas provides several ways to deal with non-trivial missing values. If you know the non-trivial value in advance you are good to go and na_values is most likely the best way to go.

Performance wise (time) all methods perform roughly the same for the given dataset but that can change as a function on the dataset size (columns and rows), row types, number of non-trivial missing values.

On top of it, make reading documentation your superpower. It can use your tools smarter and more efficient and it can save you a lot of time.

See pandas read_csv documentation here