Other pie chart

This morning I read “20 ideas for better data visualization“. I liked it very much and specially I found 8th idea – “Limit the number of slices displayed in a pie chart” very relevant for me. So I jumped into the plotly express code and created a figure of type other_pie which given a number (n) and a label (other_label) created a pie chart with n sectors. n-1 of those sectors are the top values according to the `values` column and the other section is the sum of the other rows.

A gist of the code can be found here (check here how to build plotly)

I used the following code to generate standard pie chart and pie chart with 5 sectors –

import plotly.express as px
df = px.data.gapminder().query("year == 2007").query("continent == 'Europe'")
df.loc[df['pop'] < 2.e6, 'country'] = 'Other countries' # Represent only large countries
pie_fig = px.pie(df, values='pop', names='country', title='Population of European continent')
otherpie_fig = px.other_pie(df, values='pop', names='country', title='Population of European continent', n=5, other_label="others")

And this is how it looks like –

Pie chart
Pie chart
Other pie chart

5 interesting things (21/10/21)

4 Things Tutorials Don’t Tell You About PyPI – this hands-on experience together with the explanations is priceless. Even if you don’t plan to upload a package to PyPI anytime soon those glimpses of how PyPI works are interesting.

https://blog.paoloamoroso.com/2021/09/4-things-tutorials-dont-tell-you-about.html

Responsible Tech Playbook – I totally agree with Martin Fowler statement that “Whether asked to or not, we have a duty to ensure our systems don’t degrade our society”. This post promotes the text book about Responsible Tech published by Fowler and his colleagues from Thoughtworks. It also references additional resources such as Tarot Cards of Tech Ethical Explorer.

https://martinfowler.com/articles/2021-responsible-tech-playbook.html

A Perfect Match – A Python 3.10 Brain Teaser – Python 3.10 was released earlier this month and the most talked about feature is Pattern Matching. Read this post to make sure you get it correctly.

https://medium.com/pragmatic-programmers/a-perfect-match-ef552dd1c1b1

How I got my career back on track – careers is not a miracle. That’s totally ok if you don’t want to have one but if you do and have aspirations you have to own it and manage your way there. 

https://rinaarts.com/how-i-got-my-career-back-on-track

PyCatFlow –  A big part of current data is time series data combined with categorical data. E.g., change in the mix of medical diagnosis \ shopping categories over time etc. PyCatFlow is a visualization tool which allows the representation of temporal developments, based on categorical data. Check their Jupyter Notebook with interactive widgets that can be run online.

https://medium.com/@bumatic/pycatflow-visualizing-categorical-data-over-time-b344102bcce2

Growing A Python Developer (2021)

I recently run into a team lead question regarding how to grow a backend Python Developer in her team. Since I also iterated around this topic with my team I already had few ideas in mind.

Few disclaimers before we start. First, I believe that the developer also has a share in the process and should express her interest and aspirations. The team lead or tech lead can direct and light blind spots but does not hold all the responsibility. It is also ok to dive into an idea are a tool that is not required at the moment. They might come in handy in the future and they can inspire you. Second, my view is limited to the areas I work in. Different organizations or products have different needs and focus. Third, build habits to constantly learn and grow – read blogs and books, listen to podcasts, take online or offline courses, watch videos, whatever works for you as long as you keep moving.

Consider the links below as appetizers. Each subject below has many additional resources besides the ones that I posted. Most likely I’m just not familiar with them, please feel free to add them and I’ll update the post. Some subjects are so broad and product dependent, e.g. cloud so I didn’t add links at all. Additionally, when using a specific product \ service \ package read the documentation and make it your superpower. Know Python standard library well (e.g itertoolsfunctoolscollectionspathlib, etc), it can save you a lot of time, effort, and bugs.

General ideas and concepts

  1. Clean code – bookbook summary
  2. Design patterns – refactoring bookrefactoring gurupython design patterns GitHub repo
  3. Distributed design patterns – Patterns of Distributed Systems
  4. SOLID principles – SOLID coding in Python
  5. Cloud
  6. Deployment – CI\CDdockerKubernetes
  7. Version control – git guide
  8. Databases – Using Databases with Pythondatabases tutorials
  9. Secure Development – Python cheat sheet by SnykOWASP

Python specific

  1. Webservices – flask, Django, FastAPI
  2. Testing – Unit Testing in Python — The Basics
  3. Packaging –  Python Packaging User Guide
  4. Data analysis – pandas, NumPy, sci-kit-learn
  5. Visualization – plotly, matlpotlib
  6. Concurrency – Speed Up Your Python Program With Concurrency
  7. Debugging – debugging with PDBPython debugging in VS Code
  8. Dependency management – Comparison of Pip, Pipenv and Poetry dependency management tools
  9. Type annotation – Type Annotations in Python
  10. Python 3.10 – What’s New in Python 3.10?, Why you can’t switch to Python 3.10 just yet

Additional resources

  1. Podcast.__init__ – The weekly podcast about Python and its use in machine learning and data science.
  2. The real python podcast
  3. Top 8 Python Podcasts You Should Be Listening to
  4. Python 3 module of the week
  5. Lazy programmer – courses on Udemy mainly AI and ML using Python
  6. cloudonaut – podcast and blog about AWS

5 tips to ace coding interview assignments

Now days, it is a very common practice to give a coding home tests as part of interview process. Beside solving the task you are ask to I believe there are few additional things you can do in order to impress the reviewers and ace this step of the process.


1. Push the code to a private repository and share it with the reviewers – this creates a two fold advantage. First, it demonstrates the reviewers that you are familiar with version control tools and second it shows your working process and that you keep track of your work. Don’t forget to write meaningful commit messages.

2. Write README file – the readme file gives a context to the entire project and reflects the way you understand the assignment. One of the annoting things as a reviewer is to guess how to run the code, what are the requirments and so on. Beside packaging or building the code in a way the runs smoothly (e.g in Python if using pip add a requirements.txt) a README file should help me find my way inside the project. In such assignments where you don’t have a direct communication with the reviewers the README files also serves as a place to document your decisions and thoughts.

What should you include in the README file? Short introduction explaining the project purpose and scre.  How to install and run or use it, preferably with some snippet that one can just copy-paste. How to run the tests (see next section :). Additional sections can include explainations about choices you made architecture wise or implementation wise, charts, performance evaluation, future ideas, dependencies, etc. This will help the reviewers get into your code quickly and run, understand your thinking and show that you are eager to share your knowledge with your peers.

For ease of use, check the template suggested here.

3. Write tests – usually unit tests are enough for this scope. This will help you debug your code and make sure it works properly. It will also signal the reviewers that you care about the quality of your code and know your job.

4. Run linters, spell check and do a proof reading for everything – make sure your code uses the common conventions and style for the tools you are using (e.g PEP-8 for python). Bonus points if you add the linters as pre-commit hooks to your repository. This make your code smoother and easier for the reviewers to read. The formatted code indicates that you are used to sharing your code with others and the hooks signal that you are productive and lazy by automating stuff.

5. Document everything – the idea behind this tip is not to annoy the reviewers by letting them guess what you meant. That is document what each module, function and parameter does. For example, in Python, use type annotation and docstrings.

5 interesting things (22/09/21)

Writing a Great CV for Your First Technical Role – a series of 3 parts about best practices, mistakes, and pitfalls in CV showing both good and bad examples. I find the posts relevant not just for first rolls but also as a good reminder when updating your CV.

https://naomikriger.medium.com/writing-a-great-cv-for-your-first-technical-role-part-1-75ffc372e54e

Patterns in confusing explanations – writing and technical writing are superpowers. Being able to communicate your ideas in a clear way that others can engage with is a very impactful skill. In this post, Julia Evans describes 13 patterns of bad explanation and accompanies that with positive examples.

https://jvns.ca/blog/confusing-explanations/

How We Design Our APIs at Slack – not only that I agree with those advices and had some bad experiences with similar issues both as a API supplier and consumer. Many times when big companies describe their architecture and process they are irrelevant to small companies due to cost, lack of data or resources or other reasons ,but the great thing about this post is that it also fits small companies and relatively easy to implement.

https://slack.engineering/how-we-design-our-apis-at-slack/

Python Anti-Pattern – this post describes a bug that is at the intersection of Python and AWS lambda functions. One can say that it is an extreme case but I tend to think it is more common than one would think and may spend hours debugging it. It is well written and very important to know if you are using lambda functions.

https://valinsky.me/articles/python-anti-pattern/

Architectural Decision Records – sharing knowledge is hard. Sometimes what is clear for you is not clear for others, sometimes it is not taken into account in the time estimation or takes longer than expected, other times you just want to move on and deliver. Having templates and conventions make it easier both for the writers and the readers. ADRs answer specific need.

https://adr.github.io/

5 interesting things (19/08/2021)

The 7 Branches of Software Gardening – “A small refactoring a day keeps the tech debt away ” (paraphrasing^2). Great examples of small activities and improvments every developer can make on a daily basis and would pile up to big impact.

https://martinthoma.medium.com/the-6-branches-of-software-gardening-a90b3c0d6220

What is the right level of specialization? For data teams and anyone else – I like Erik Bernhardsson’s posts, I linked to them more than once in the past. Bernhardsson highlights the tensions between being very professional and specific (“I only ETL process on Wednesdays 15:03-16:42 on Windowss Machines”) versus being less proficient in more concepts and technologies . And this leads us to the next item.

https://erikbern.com/2021/07/23/what-is-the-right-level-of-specialization.html

7 Key Roles and Responsibilities in Enterprise MLOps – in this post Domino Data Lab introduce their view about the different roles and responsibilities in MLOps \ Data teams. For sure it is more suitable in enterprises, small and even medium companies cannot afford themselves and sometimes don’t need all those roles and as suggested in Erik Bernhardsson’s post, very specific specializaion make it harder to move people between teams according to the organization needs. Having said that, title is a signal (inside and outside the organizaiton) of what a person likes to do and which capabilities (no necessarily specific tools) s\he are probable to have.

https://blog.dominodatalab.com/7-roles-in-mlops/

The Limitations of Chaos Engineering – I decided to read a bit about Chaos engineering as I never experienced with it before and came across this post which is almost 4 years old. While it is important to validate the reselience of our architecture and its’ implementation de facto, the common practice of fault injection also has its limitations which is good to know.

https://sharpend.io/the-limitations-of-chaos-engineering/

It’s Time to Retire the CSV – if you ever worked with CSV you probably see this title and yell “Hell Yes!“. If you want to gain an historic view and few more arguments have a look here –

https://www.bitsondisk.com/writing/2021/retire-the-csv/

pandas read_csv and missing values

I read Domino Lab post about “Data Exploration with Pandas Profiler and D-Tale” where they load diagnostic mammograms used in the diagnostic of breast cancer from UCI website. Instead of missiing values the data contains ?. When reading the data using pandas read_csv function naively interpret the value as string value and change the column type to be object instead of float in this case.

In the post mentioned above the authors dealt with the issue in the following way –

masses = masses.replace('?', np.NAN)
masses.loc[:,names[:-1]] = masses.loc[:,names[:-1]].apply(pd.to_numeric)

That is, they first replaced the ? values in all the columns with np.NAN and then convert all the columns to numeric. Let’s call this method the manual method.

If we know the know the non default missing values in advance, can we do something better? The answer is yes!

See code here

Use na_values parameter

df = pd.read_csv(url, names=names, na_values=["?"])

na_values parameter can get scalar, string, list-like or dict parameters. If you pass a scalar, string or list-like parameter all columns are treated the same way. If you pass dict you can specify different set of NaN values per column.

The advantage of this method over the manual method is that you don’t need to convert the columns after replacing the nan values. In the manual method the column types are specified (in the given case they are all numeric), if there are multiple columns types you need to know it and specify it in advance.

Side note – likewise, for non trivial boolean values you can use true_values and false_values parameters.

Use converters parameter

df = pd.read_csv(url, names=names, converters={"BI-RADS": lambda x: x if x!="?" else np.NAN})

This is usually used to convert values of specific columns. If you would like to convert values in all the columns in the same way this is not the preferred method since you will have to add an entry for each column and if new column is added you won’t take care of it by default (this can be both advantage and disadvantage). However, for other use-cases, converters can help with more complex conversions.

Note that the result here is different then the result in the other methods since we only converted the values in one column.

Conclusion

Pandas provides several ways to deal with non-trivial missing values. If you know the non-trivial value in advance you are good to go and na_values is most likely the best way to go.

Performance wise (time) all methods perform roughly the same for the given dataset but that can change as a function on the dataset size (columns and rows), row types, number of non-trivial missing values.

On top of it, make reading documentation your superpower. It can use your tools smarter and more efficient and it can save you a lot of time.

See pandas read_csv documentation here

Things I learned today (04/08/2021)

AWS Lambda functions can now mount an Amazon Elastic File System (Amazon EFS)

AWS announcement

What is AWS Lambda?

AWS Lambda is FaaS (function as a service) offering. It is an event-driven, serverless computing platform which integrates with many other AWS services. For example you can trigger lambda function from API gateway, S3 event notification, etc.

AWS Lambda runtime includes Python, Node.js. ruby, Java, Go and C#.

It is very useful and cost-effective when you have infrequent and relatively short executions so you don’t need to provision any infrastructure. Lambda has it’s limitations, mainly it’s running time – max 15 minutes. Storage was also a limitation up to this announcement but this is breakthrough.

What is Amazon EFS?

Amazon Elastic File System (EFS) is a cost-optimized file storage (not setup costs, just pay as you use) that can automatically scale from gigabytes to petabytes of data without needing to provision storage. It also allow multiple instances to connect to it simultaneously.

EFS are accessible from EC2 instances, ECS containers, EKS and AWS Fargate and AWS lambda.

Comparing to EBS, EFS is usually more expensive. However, the use case is different. EFS is a NFS file system (which means that it is not supported on Windows instances) and EBS is block storage and is usually not multi-attached (there are some EC2 + EBS configurations which allow multi-attach but that’s not the main use case).

Why does it matter?

By default, lambda can /tmp storage of up to 512Mb this enables working with larger files. This means that you can import large machine learning models or packages. This also means that you can use an up-to-date version of files since it is easy to share.

Additional you can share information or state across invocations since EFS is a shared drive. I would not say it is optimal and generally I would rather to decouple it but it is possible and it is faster than S3.

In some cases it can also enable moving data intensive workloads (in AWS or on-premise) to AWS lambda and save cost.

See more here

Things I learned today (26/07/2021)

ElastiCache for Redis is HIPAA compliant while ElastiCache for Memcached is not


What is ElastiCache?

ElastiCache is “Fully managed in-memory data store, compatible with Redis or Memcached. Power real-time applications with sub-millisecond latency” (here).

Most common use cases for ElastiCache are session store, general cache to increase throughput and decrease the load of other services or database, deployment of machine learning models and real time analytics.

AWS offers two flavours of ElastiCache – ElastiCache for Redis and ElastiCache for Memcached. To understand the difference better and recommendation on how to choose an engine see here.


What is HIPAA?

“The Healthcare Insurance Portability and Accountability Act (HIPAA) is an act of legislation passed in 1996 which originally had the objective of enabling workers to carry forward healthcare insurance and healthcare rights between jobs. “

https://www.hipaajournal.com/hipaa-explained/


Over the years and specifically after 2013 HIPAA rules were updated to fit to the technology development and expand the requirements to include business associates, where previously only covered entities were held to uphold the HIPAA restrictions.


Why does it matter?

Better safe than sorry – If you develop a product that needs to be HIPAA compliant it is better to choose in advance the right and compliant services rather than replacing it later

To read more –