5 interesting things (19/08/2021)

The 7 Branches of Software Gardening – “A small refactoring a day keeps the tech debt away ” (paraphrasing^2). Great examples of small activities and improvments every developer can make on a daily basis and would pile up to big impact.

https://martinthoma.medium.com/the-6-branches-of-software-gardening-a90b3c0d6220

What is the right level of specialization? For data teams and anyone else – I like Erik Bernhardsson’s posts, I linked to them more than once in the past. Bernhardsson highlights the tensions between being very professional and specific (“I only ETL process on Wednesdays 15:03-16:42 on Windowss Machines”) versus being less proficient in more concepts and technologies . And this leads us to the next item.

https://erikbern.com/2021/07/23/what-is-the-right-level-of-specialization.html

7 Key Roles and Responsibilities in Enterprise MLOps – in this post Domino Data Lab introduce their view about the different roles and responsibilities in MLOps \ Data teams. For sure it is more suitable in enterprises, small and even medium companies cannot afford themselves and sometimes don’t need all those roles and as suggested in Erik Bernhardsson’s post, very specific specializaion make it harder to move people between teams according to the organization needs. Having said that, title is a signal (inside and outside the organizaiton) of what a person likes to do and which capabilities (no necessarily specific tools) s\he are probable to have.

https://blog.dominodatalab.com/7-roles-in-mlops/

The Limitations of Chaos Engineering – I decided to read a bit about Chaos engineering as I never experienced with it before and came across this post which is almost 4 years old. While it is important to validate the reselience of our architecture and its’ implementation de facto, the common practice of fault injection also has its limitations which is good to know.

https://sharpend.io/the-limitations-of-chaos-engineering/

It’s Time to Retire the CSV – if you ever worked with CSV you probably see this title and yell “Hell Yes!“. If you want to gain an historic view and few more arguments have a look here –

https://www.bitsondisk.com/writing/2021/retire-the-csv/

Advertisement

pandas read_csv and missing values

I read Domino Lab post about “Data Exploration with Pandas Profiler and D-Tale” where they load diagnostic mammograms used in the diagnostic of breast cancer from UCI website. Instead of missiing values the data contains ?. When reading the data using pandas read_csv function naively interpret the value as string value and change the column type to be object instead of float in this case.

In the post mentioned above the authors dealt with the issue in the following way –

masses = masses.replace('?', np.NAN)
masses.loc[:,names[:-1]] = masses.loc[:,names[:-1]].apply(pd.to_numeric)

That is, they first replaced the ? values in all the columns with np.NAN and then convert all the columns to numeric. Let’s call this method the manual method.

If we know the know the non default missing values in advance, can we do something better? The answer is yes!

See code here

Use na_values parameter

df = pd.read_csv(url, names=names, na_values=["?"])

na_values parameter can get scalar, string, list-like or dict parameters. If you pass a scalar, string or list-like parameter all columns are treated the same way. If you pass dict you can specify different set of NaN values per column.

The advantage of this method over the manual method is that you don’t need to convert the columns after replacing the nan values. In the manual method the column types are specified (in the given case they are all numeric), if there are multiple columns types you need to know it and specify it in advance.

Side note – likewise, for non trivial boolean values you can use true_values and false_values parameters.

Use converters parameter

df = pd.read_csv(url, names=names, converters={"BI-RADS": lambda x: x if x!="?" else np.NAN})

This is usually used to convert values of specific columns. If you would like to convert values in all the columns in the same way this is not the preferred method since you will have to add an entry for each column and if new column is added you won’t take care of it by default (this can be both advantage and disadvantage). However, for other use-cases, converters can help with more complex conversions.

Note that the result here is different then the result in the other methods since we only converted the values in one column.

Conclusion

Pandas provides several ways to deal with non-trivial missing values. If you know the non-trivial value in advance you are good to go and na_values is most likely the best way to go.

Performance wise (time) all methods perform roughly the same for the given dataset but that can change as a function on the dataset size (columns and rows), row types, number of non-trivial missing values.

On top of it, make reading documentation your superpower. It can use your tools smarter and more efficient and it can save you a lot of time.

See pandas read_csv documentation here

Things I learned today (04/08/2021)

AWS Lambda functions can now mount an Amazon Elastic File System (Amazon EFS)

AWS announcement

What is AWS Lambda?

AWS Lambda is FaaS (function as a service) offering. It is an event-driven, serverless computing platform which integrates with many other AWS services. For example you can trigger lambda function from API gateway, S3 event notification, etc.

AWS Lambda runtime includes Python, Node.js. ruby, Java, Go and C#.

It is very useful and cost-effective when you have infrequent and relatively short executions so you don’t need to provision any infrastructure. Lambda has it’s limitations, mainly it’s running time – max 15 minutes. Storage was also a limitation up to this announcement but this is breakthrough.

What is Amazon EFS?

Amazon Elastic File System (EFS) is a cost-optimized file storage (not setup costs, just pay as you use) that can automatically scale from gigabytes to petabytes of data without needing to provision storage. It also allow multiple instances to connect to it simultaneously.

EFS are accessible from EC2 instances, ECS containers, EKS and AWS Fargate and AWS lambda.

Comparing to EBS, EFS is usually more expensive. However, the use case is different. EFS is a NFS file system (which means that it is not supported on Windows instances) and EBS is block storage and is usually not multi-attached (there are some EC2 + EBS configurations which allow multi-attach but that’s not the main use case).

Why does it matter?

By default, lambda can /tmp storage of up to 512Mb this enables working with larger files. This means that you can import large machine learning models or packages. This also means that you can use an up-to-date version of files since it is easy to share.

Additional you can share information or state across invocations since EFS is a shared drive. I would not say it is optimal and generally I would rather to decouple it but it is possible and it is faster than S3.

In some cases it can also enable moving data intensive workloads (in AWS or on-premise) to AWS lambda and save cost.

See more here