5 interesting things (22/09/21)

Writing a Great CV for Your First Technical Role – a series of 3 parts about best practices, mistakes, and pitfalls in CV showing both good and bad examples. I find the posts relevant not just for first rolls but also as a good reminder when updating your CV.

https://naomikriger.medium.com/writing-a-great-cv-for-your-first-technical-role-part-1-75ffc372e54e

Patterns in confusing explanations – writing and technical writing are superpowers. Being able to communicate your ideas in a clear way that others can engage with is a very impactful skill. In this post, Julia Evans describes 13 patterns of bad explanation and accompanies that with positive examples.

https://jvns.ca/blog/confusing-explanations/

How We Design Our APIs at Slack – not only that I agree with those advices and had some bad experiences with similar issues both as a API supplier and consumer. Many times when big companies describe their architecture and process they are irrelevant to small companies due to cost, lack of data or resources or other reasons ,but the great thing about this post is that it also fits small companies and relatively easy to implement.

https://slack.engineering/how-we-design-our-apis-at-slack/

Python Anti-Pattern – this post describes a bug that is at the intersection of Python and AWS lambda functions. One can say that it is an extreme case but I tend to think it is more common than one would think and may spend hours debugging it. It is well written and very important to know if you are using lambda functions.

https://valinsky.me/articles/python-anti-pattern/

Architectural Decision Records – sharing knowledge is hard. Sometimes what is clear for you is not clear for others, sometimes it is not taken into account in the time estimation or takes longer than expected, other times you just want to move on and deliver. Having templates and conventions make it easier both for the writers and the readers. ADRs answer specific need.

https://adr.github.io/

5 interesting things (19/08/2021)

The 7 Branches of Software Gardening – “A small refactoring a day keeps the tech debt away ” (paraphrasing^2). Great examples of small activities and improvments every developer can make on a daily basis and would pile up to big impact.

https://martinthoma.medium.com/the-6-branches-of-software-gardening-a90b3c0d6220

What is the right level of specialization? For data teams and anyone else – I like Erik Bernhardsson’s posts, I linked to them more than once in the past. Bernhardsson highlights the tensions between being very professional and specific (“I only ETL process on Wednesdays 15:03-16:42 on Windowss Machines”) versus being less proficient in more concepts and technologies . And this leads us to the next item.

https://erikbern.com/2021/07/23/what-is-the-right-level-of-specialization.html

7 Key Roles and Responsibilities in Enterprise MLOps – in this post Domino Data Lab introduce their view about the different roles and responsibilities in MLOps \ Data teams. For sure it is more suitable in enterprises, small and even medium companies cannot afford themselves and sometimes don’t need all those roles and as suggested in Erik Bernhardsson’s post, very specific specializaion make it harder to move people between teams according to the organization needs. Having said that, title is a signal (inside and outside the organizaiton) of what a person likes to do and which capabilities (no necessarily specific tools) s\he are probable to have.

https://blog.dominodatalab.com/7-roles-in-mlops/

The Limitations of Chaos Engineering – I decided to read a bit about Chaos engineering as I never experienced with it before and came across this post which is almost 4 years old. While it is important to validate the reselience of our architecture and its’ implementation de facto, the common practice of fault injection also has its limitations which is good to know.

https://sharpend.io/the-limitations-of-chaos-engineering/

It’s Time to Retire the CSV – if you ever worked with CSV you probably see this title and yell “Hell Yes!“. If you want to gain an historic view and few more arguments have a look here –

https://www.bitsondisk.com/writing/2021/retire-the-csv/

pandas read_csv and missing values

I read Domino Lab post about “Data Exploration with Pandas Profiler and D-Tale” where they load diagnostic mammograms used in the diagnostic of breast cancer from UCI website. Instead of missiing values the data contains ?. When reading the data using pandas read_csv function naively interpret the value as string value and change the column type to be object instead of float in this case.

In the post mentioned above the authors dealt with the issue in the following way –

masses = masses.replace('?', np.NAN)
masses.loc[:,names[:-1]] = masses.loc[:,names[:-1]].apply(pd.to_numeric)

That is, they first replaced the ? values in all the columns with np.NAN and then convert all the columns to numeric. Let’s call this method the manual method.

If we know the know the non default missing values in advance, can we do something better? The answer is yes!

See code here

Use na_values parameter

df = pd.read_csv(url, names=names, na_values=["?"])

na_values parameter can get scalar, string, list-like or dict parameters. If you pass a scalar, string or list-like parameter all columns are treated the same way. If you pass dict you can specify different set of NaN values per column.

The advantage of this method over the manual method is that you don’t need to convert the columns after replacing the nan values. In the manual method the column types are specified (in the given case they are all numeric), if there are multiple columns types you need to know it and specify it in advance.

Side note – likewise, for non trivial boolean values you can use true_values and false_values parameters.

Use converters parameter

df = pd.read_csv(url, names=names, converters={"BI-RADS": lambda x: x if x!="?" else np.NAN})

This is usually used to convert values of specific columns. If you would like to convert values in all the columns in the same way this is not the preferred method since you will have to add an entry for each column and if new column is added you won’t take care of it by default (this can be both advantage and disadvantage). However, for other use-cases, converters can help with more complex conversions.

Note that the result here is different then the result in the other methods since we only converted the values in one column.

Conclusion

Pandas provides several ways to deal with non-trivial missing values. If you know the non-trivial value in advance you are good to go and na_values is most likely the best way to go.

Performance wise (time) all methods perform roughly the same for the given dataset but that can change as a function on the dataset size (columns and rows), row types, number of non-trivial missing values.

On top of it, make reading documentation your superpower. It can use your tools smarter and more efficient and it can save you a lot of time.

See pandas read_csv documentation here

Things I learned today (04/08/2021)

AWS Lambda functions can now mount an Amazon Elastic File System (Amazon EFS)

AWS announcement

What is AWS Lambda?

AWS Lambda is FaaS (function as a service) offering. It is an event-driven, serverless computing platform which integrates with many other AWS services. For example you can trigger lambda function from API gateway, S3 event notification, etc.

AWS Lambda runtime includes Python, Node.js. ruby, Java, Go and C#.

It is very useful and cost-effective when you have infrequent and relatively short executions so you don’t need to provision any infrastructure. Lambda has it’s limitations, mainly it’s running time – max 15 minutes. Storage was also a limitation up to this announcement but this is breakthrough.

What is Amazon EFS?

Amazon Elastic File System (EFS) is a cost-optimized file storage (not setup costs, just pay as you use) that can automatically scale from gigabytes to petabytes of data without needing to provision storage. It also allow multiple instances to connect to it simultaneously.

EFS are accessible from EC2 instances, ECS containers, EKS and AWS Fargate and AWS lambda.

Comparing to EBS, EFS is usually more expensive. However, the use case is different. EFS is a NFS file system (which means that it is not supported on Windows instances) and EBS is block storage and is usually not multi-attached (there are some EC2 + EBS configurations which allow multi-attach but that’s not the main use case).

Why does it matter?

By default, lambda can /tmp storage of up to 512Mb this enables working with larger files. This means that you can import large machine learning models or packages. This also means that you can use an up-to-date version of files since it is easy to share.

Additional you can share information or state across invocations since EFS is a shared drive. I would not say it is optimal and generally I would rather to decouple it but it is possible and it is faster than S3.

In some cases it can also enable moving data intensive workloads (in AWS or on-premise) to AWS lambda and save cost.

See more here

Things I learned today (26/07/2021)

ElastiCache for Redis is HIPAA compliant while ElastiCache for Memcached is not


What is ElastiCache?

ElastiCache is “Fully managed in-memory data store, compatible with Redis or Memcached. Power real-time applications with sub-millisecond latency” (here).

Most common use cases for ElastiCache are session store, general cache to increase throughput and decrease the load of other services or database, deployment of machine learning models and real time analytics.

AWS offers two flavours of ElastiCache – ElastiCache for Redis and ElastiCache for Memcached. To understand the difference better and recommendation on how to choose an engine see here.


What is HIPAA?

“The Healthcare Insurance Portability and Accountability Act (HIPAA) is an act of legislation passed in 1996 which originally had the objective of enabling workers to carry forward healthcare insurance and healthcare rights between jobs. “

https://www.hipaajournal.com/hipaa-explained/


Over the years and specifically after 2013 HIPAA rules were updated to fit to the technology development and expand the requirements to include business associates, where previously only covered entities were held to uphold the HIPAA restrictions.


Why does it matter?

Better safe than sorry – If you develop a product that needs to be HIPAA compliant it is better to choose in advance the right and compliant services rather than replacing it later

To read more – 

Things I learned today (23/07/2021)

S3 events notifications supports standard SNS topics and standard SQS queues as destinations but don’t support SNS FIFO and SQS FIFO.

S3 events notifications enables you to be notified whenever a specific event happens in your bucket. To receive the notification you must define the events you are interested in and the destination. The notifications are usually triggered in seconds but sometimes can take longer.

The events are –

  • New object creation
  • Object removal (versioned and non-versioned objects)
  • Object restore (e.g from Glacier)
  • Object lost on a reduced redundancy storage
  • Object Replication

The possible destinations include –

  • SQS – as mentioned above standard queues only and not FIFO queues
  • SNS- as mentioned above standard topics only and not FIFO topics
  • Lambda

If when processing the events you write back to S3 be careful not to create an execution loop

See more here – https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html

Bonus for the weekend – bumped into this and I cannot deny (of service) that it can happen to me too

Things I learned today (21/07/2021)

You can use Amazon SNS FIFO (first in, first out) topics and Amazon Simple Queue Service (Amazon SQS) FIFO queues together to provide strict message ordering and message deduplication

AWS documentation


While, SQS FIFO queues were introduced in 2016, SNS FIFO capabilities were introduced only on October 2020.

This capability is important for cases in which the order matters. E.g. bank transactions were you commit a transaction only if the balance remains non-negative.

Messages are grouped and ordered according to the message group ID. When sending a message you must specify a message group ID otherwise the action fails. If all the messages have the same message group ID then all the messages are sent and received in strict order. The message group id can be any value, e.g 12, “hello”, “user_id-123”, etc.


Note that as in the SQS case, the topic name must end with .fifo, a limitation that counts to the 80 characters restriction as well. 

For further reading –

Things I learned today (20/07/2021)

Delay queues let you postpone the delivery of new messages to a queue for a number of seconds

AWS documentation

This means that all the messages which are pushed to this queue would be visible to the consumer after the delay period. The minimum delay which is also the default delay is 0 and the maximum is 15 minutes.

Note that when changing the delay of a queue the behaviour of FIFO queues and standard queues is different – 


For standard queues, the per-queue delay setting is not retroactive—changing the setting doesn’t affect the delay of messages already in the queue.

For FIFO queues, the per-queue delay setting is retroactive—changing the setting affects the delay of messages already in the queue.

AWS documentation

If you need to delay the visibility of specific messages and not all messages in the queue you can use message timers and add an initial invisibility period for a message. This is only supported by standard queues.Note that setting a message timer for individual messages overrides the delay period of the delay queue.

See the image below to understand message timeline in a queue –

See more here –
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-message-timers.html

Things I learned today (19/07/2021)

[AWS] independently map Availability Zones too names for each account

AWS documentation


This means that eu-west-1a in my account is not necessarily the same as eu-west-1a in your account.

Why does this matter? for example if you want to share subnets across accounts. Or maybe you want to ensure that services in different accounts are not in the same availability zone.

So how can you achieve this? use availability zone ids which are unique and consistent identifiers for availability zones.

See more here – https://docs.aws.amazon.com/ram/latest/userguide/working-with-az-ids.html