Spark, EMR, Lambda, AWS

I know the title of this post looks like a collection of buzz words but there is code behind it.

AWS Lambda function is a service which allow you to create an action (in this example add an EMR step) according to all kind of events. Such events can be cron expressions or schedule event (once an hour, once a day, etc.), change in S3 files, change in DynamoDB table, etc.

The goal of the code is to add an EMR step to an existing EMR cluster. The step can actually be anything- Map Reduce, Spark job, JAR step , etc. This example show how to add a Spark job but it is easy to adjust it to your needs.

The parameters for the Spark job can depend on the specific event – for example, put file event data will include data about the specific file.

In this example there is only a place holder for the script parameters and the spark configuration parameters.

Note – in order to run this Spark job your code should be in Spark master machine. Possible this could be done in the script as well but here we assume it is already in CODE_DIR.

	import sys
	import time

	import boto3

	def lambda_handler(event, context):
	conn = boto3.client("emr")
	# chooses the first cluster which is Running or Waiting
	# possibly can also choose by name or already have the cluster id
	clusters = conn.list_clusters()
	# choose the correct cluster
	clusters = [c["Id"] for c in clusters["Clusters"]
	if c["Status"]["State"] in ["RUNNING", "WAITING"]]
	if not clusters:
	sys.stderr.write("No valid clusters\n")
	sys.stderr.exit()
	# take the first relevant cluster
	cluster_id = clusters[0]
	# code location on your emr master node
	CODE_DIR = "/home/hadoop/code/"

	# spark configuration example
	step_args = ["/usr/bin/spark-submit", "–spark-conf", "your-configuration",
	CODE_DIR + "your_file.py", '–your-parameters', 'parameters']

	step = {"Name": "what_you_do-" + time.strftime("%Y%m%d-%H:%M"),
	'ActionOnFailure': 'CONTINUE',
	'HadoopJarStep': {
	'Jar': 's3n://elasticmapreduce/libs/script-runner/script-runner.jar',
	'Args': step_args
	}
	}
	action = conn.add_job_flow_steps(JobFlowId=cluster_id, Steps=[step])
	return "Added step: %s"%(action)

view raw

spark_aws_lambda.py

hosted with ❤ by GitHub

5 interesting things (25/02/2016)

Caffe On Spark – if it pretty clear that on of the next steps for Spark MLLib is to add deep learning. There are several deep learning frameworks out there and I guess few (if not all of them) will try to integrate with Spark on one way or another and in the end “the fittest will survive”, i.e. will be officially supported by Spark. What will that be? my guess is an Apache project, maybe SINGA, maybe a popular project that will become an Apache project. However, other extensions will continue to exist for specific use-cases and for users who prefer specific frameworks (e.g. reusing already existing code).

https://github.com/yahoo/CaffeOnSpark

Kibana plugins basics – I discovered this post while looking for a way to add a reasonable KPI visualization for elasticsearch. This is actually a series of 4 posts going over how to write and deploy (i.e. register) a custom Kibana plugin. Unfortunately we use AWS Elasticsearch service which does not (currently) support customized plugins and therefore we suspend it for now.

https://www.timroes.de/2015/12/02/writing-kibana-4-plugins-basics/

Bonus track – creating traffic light visualization – http://logz.io/blog/kibana-visualizations/

Thoughtsworks radar – every few months Thoughtsworks publish their technology rating. There are 4 ratings – adopt, trail, asses and hold. The products / tools / methodology are divided to 4 groups – techinques, tools, platforms and languages and frameworks. It is specially interesting to see how ranks change over time. I found database products to be a little under represented. I guess there is also a bit of bias towards tools and companies they work with but nonetheless it is a interesting read.

https://www.thoughtworks.com/radar

. files – dot files are a common way to set all kind of configurations from .gitignore to .bashrc and everything in the middle. You probably have your configuration that you like and copy it from project to project. You still might find this repository helpful as it collects common configuration settings –

https://github.com/holman/dotfiles

Bonus track – git ignore by programming language

https://github.com/github/gitignore

SQL tabs – simple but get the job done. Yet another visualization tool above postgresql. Super simple installation and start up (on mac os environment) even for non technical people.

http://www.sqltabs.com/

Why not AWS Elasticsearch service

Almost 5 month ago AWS exposed a new service – “Elasticsearch service”. So we can now have a managed elasticsearch cluster and we can have Kibana on it without the hassle of managing it. But then we need to work with it and discover the bleeding edges –

Old version – current Elasticsearch version is 2.2.0 and 1.7.* is maintained. Elasticsearch version in AWS is 1.5.2. It is 8 months old version, it is ages in this industry. Kibana current version is 4.4.0 while Kibana version offered on AWS is 4.0.3. Many enhancements, bug fixes, breaking changes and AWS is behind. This is specially interesting comparing to the recent velocity on EMR images. As there are some breaking changes between 1.* versions and 2.* I would at least expected the option to choose a version. Even more annoying there is currently no data about when and how AWS going to upgrade the versions.
Plugins not supported – adding plugins is relevant for expanding Elasticsearch usability, monitoring and for adding customized Kibana visualisations (e.g for KPI’s). I believe every company which uses Elasticsearch will have this need in some point. AWS currently support only a close set of plugins and don’t enable installing your own plugin and there is no data about future plans.
Scripting not supported – scripting is relevant for adding data, processing data, interactions between fields, etc. Again a need that I believes that raises in every company after a while. This features also is not enabled in AWS Elasticsearch Service. There are ways to by-pass some of it but why work hard if Elastic already created this functionality? Here too, not clear if there are any future plans to enable it.
No VPC support – this means that the Elasticsearch endpoint is publicly-accessible. There are few solutions but it is a security issues and more over it does not comply with the other AWS products logic.
Backup issues – no easy way to export data and to control how frequent and when your data is backed up. AWS cloud plugin for Elasticsearch can solved it, but it is not supported..

Conclusion – Elasticsearch is a very strong tool and even stronger when combined with Kibana. On the other hand, AWS usually produce very good products.But this time, the combination of those two is not there yet. It seems to me like they might jumped to the water too early (possibly in order to compete with Found \ Elastic Cloud) with a product which is not mature enough.

Most of the services AWS offer are based on their own innovation and \ or development and are not coupled so strong with external product (the other product I can think of is elasticcache which I have not worked with yet). From a business point of view, it is weird for me to see such a dependency on external product. If there is a high demand for Elasticsearch on AWS I would prefer to have a tutorial explaining how to set a cluster rather than a doubtful product.

5 interesting things (13/02/2016)

Ads @ stackoverflow – Steve from Stackoverflow ad managment team explains their view about ads on the site. This is a very user centric approach along with strong branding and differentiating themselves. This approach is not scalable for a sites with many topics and diverse audience but it fits their niche.

https://blog.stackoverflow.com/2016/02/why-stack-overflow-doesnt-care-about-ad-blockers/

Serverless Python web applications – taking scalability and cost effective to the next level. Zappa is a Python web applications where the server is created using AWS Lambda function after an HTTP request came through AWS API Gateway. The python web framework under it is currently Django but it should be customizable to also support Flask, pylons, etc.

https://gun.io/blog/announcing-zappa-serverless-python-aws-lambda/

Slack @ Home – the interesting thing for me is the transformation between business tools and free time tools \ technologies. Can we use the knowledge, tools, methods we use in work at home and the other way around?

The first feature for me would be shopping – collecting easily the list of things I need to get in the supermarket, drug store, etc and updating it back to the slack.

Scheduling is also interesting but there a google calendar integration so I think it is pretty easy. And maybe integration with WhatsApp possibly for image backup and sharing.

Of course, it does not necessary have to be slack, any other equivalent, preferred open source tool is good as well.

http://labs.earthpeople.se/2016/02/my-family-uses-slack/

Towards an Open Banking API Standard – one can probably not underestimate this move. I believe that an open banking API will lead many technology driven progress in the personal banking. For example – aggregating data from several account, comparing fees, etc. It will be interesting to see how the banks will adopt this standard as banks are heavily lead by regulations and are slow moving organizations by their nature. This specific effort is directed to UK banking systems but it will be interesting to follow if additional financial organization in and outside of UK will adopt it as well.

http://jackgavigan.com/2016/02/10/towards-an-open-banking-api-standard/

3.57 Degrees of separation – as a matter of fact I find the idea ok six degrees of separation romantic and not very interesting. In the edge of social networks and big data we can actually check it, or evaluate it. This post by Facebook research tries to estimate it. The interesting part for me in this post was the mathematical background and the engineering effort of scaling it. Please note that data is mostly available for first world countries and therefore although it is an estimation it probably need another factor of distance.

https://research.facebook.com/blog/three-and-a-half-degrees-of-separation/

Cassandra Day Berlin – 8 things I learnt about Apache Cassandra

Since I didn’t know anything about Apache Cassandra (beside being a distributed database) I could probably replace the 8 in the title to any number.. Here is a summary of few things I learned in Cassandra Day today.

What is Apache Cassandra – open source distributed database donated to Apache foundation by Facebook (actually first to google code and then to ASF). It is decentralized – meaning all nodes are born equal and there are no masters or slaves. It is schema-full database see here more about the Myth of Schema-less databases. It supports replications, i.e. data redundancy with default replication factor of 3 and multiple data centers (both physical and virtual). and you can control the consistency level (see next). Therefore it is AP in the CAP theorem. It uses CQL – Cassandra Query Language.
Controlling consistency – having the data replicated in several nodes one can read and write data in 3 ways. The higher consistency level the longer the latency is.
- One – reading or writing to one node is enough.
- Quorum – i.e. value was written \ retrieved at least from half of the relevant nodes. Latest win – when retrieving the data from several nodes the value with the latest time-stamp counts.
- All – need to write \ retrieve data from all the relevant nodes. T Note that all mode is dangerous since it cancels the high available. If one of the replicas is not available we will get no answer.
Data Modelling is important – Isn’t it right to every database? Well yes. But the issue here is again the trade off. One on hand I feel that data modelling is sometimes neglected on NoSQL databases since we can just through data there and scale it. On the other hand, due to some limitations (see CQL next..) data modelling for Cassandra is quite opposite to best practices in RDBMS. Know your queries in advance and build the schema (e.g keyspace) accordingly.
CQL – Cassandra Query Language. At the first talk they said “Yes, it is exactly like SQL”. Well, not exactly. First – no joins, this influences dramatically about the data modelling. Limited aggregations functionality. Update command always work even if record does not exist (can be controlled). See here more about CQL vs SQL.
CQL containers – there are 3 types of containers
- Set – a container of items sorted by the type compare operand.
- List – a container of items sorted by the insertion operand
- Map – a key, value container sorted by the type compare operand of the keys. This is kind of a hack which allow you to have it a bit schema less. Up to 64k items in a map.
Spark connector – connecting to one of the most trending technologies.
Solr integration – Cassandra is not a document database. However, sometime users how choose Cassandra as their main solution have indexing \ search needs. Datastax have a patch which also allow you to search the data in RAM (data which was not yet written to the Lucene indices).
How to go on from here? How can I learn more –
- DataStax academy
- Official documentation – by version
- Stackoverflow
- Planet Cassandra
- Al’s Cassandra 2.1 tuning guide

There are much more for me to learn about Apache Cassandra and more things I learned in this day but this is a short review.

Swiss Python Summit 2016

I participated in Swiss Python Summit today, where I gave a talk about “Python’s guide to the Galaxy”. Code and slides are available in my GitHub. All the talk were recorded and they will probably be online within a few hours \ days.

20160205_173113

Beside the best signs I can think of the program was great and very diverse in the topics – music, 3d graphics, astrophysics and so on. Even the order of the program was very good – super cool talks after lunch to keep everyone a wake. The speakers were also relatively diverse – 4 out of 9 speaker were women, different backgrounds, etc. Unfortunately there were only few women in the crowd.

I was very lucky to give the first talk so I could then concentrate on the other talks. One very strong point of the conference was the diversity of the talks.

The second was “API Design is Hard” by Dave Halter, creator and maintainer of Jedi. It was more conceptional talk then a technical one which was enriched by lessons he learned creating and maintaining Jedi. I believe this is an important talk regardless the specific programming language.

Next session opened with Armin Rigo talking about CFFI – “C Foreign Function Interface and PyPy interrupter. Interesting datum – CFFI is currently ranked 23rd by PyPI Ranking above Django and aws cli. Probably thank to the fact that it is integrated in Cryptography package and others. Pypy – is a python interpreter written in Python (rather then in C comparing to the most common implementation) and is designed for speed optimization.

Martin Christen had 3 parts to his talk. First he talked about 3d modeling with Blender, then he introduced some more low levels possibilities for 3d modeling and he ended up introducing some of the many open source projects he works on. Those projects mostly involve open 3d maps and globe maps.

After lunch came Matthieu Amiguet, co-founder of Les Chemins de Travers showed how you can manipulate sound using the relevant packages. This was really outstanding and non standard talk. One of the packages he currently uses is pyo.

Next talk was given by Chihway Chang from ETH Zürich who spoke about ” Coding / Decoding the Cosmos – Python application in Astrophysics”. As she said, and I fully agree – “Python is everywhere in science today”. She talked about the found usage of Python in her department, both in Astrophysics dedicated packages (Astropysics) and using more general purposes packages (SciPy, Numpy, pandas) and packages they write themselves. She presented 2 issues they work on – mapping dark matter and calibrating radio telescope with drone. I strongly suggest to see at least a second part of this talk to original scientific thinking.

20160205_145917

Michael Rüegg demonstrated a Scrapy pipeline which start in crawling running races results and ends up in elastic search. This was probably the most technical talk of the day and the concrete project was of additional interest for me.

Jacinda Shelly showed us how magic is done, live coding. Specifically speaking about magic functions in IPython and other cool features.

Last but not least, Florian Bruhin speaking about why and how we should use, how we can expand the basic functionality and so on. In addition he also spoke about other projects he is involved in: qutebrowser – a python browser and tox – a generic virtualenv management and test command line tool.

This is the first edition of the conference and for me it was super interesting and I really loved the mixture of the talks. Hope to be there again next year.

	Nicole S on 5 Python NLP pacakges
	blissful4bdd2399fa on CSV to radar plot
	tom on CSV to radar plot
	Matt on CSV to radar plot
	“ – Tom… on 📚 Book club Q1 2024 – 3…