Spark, EMR, Lambda, AWS

I know the title of this post looks like a collection of buzz words but there is code behind it.

AWS Lambda function is a service which allow you to create an action (in this example add an EMR step) according to all kind of events. Such events can be cron expressions or schedule event (once an hour, once a day, etc.), change in S3 files, change in DynamoDB table, etc.

The goal of the code is to add an EMR step to an existing EMR cluster. The step can actually be anything- Map Reduce, Spark job, JAR step , etc. This example show how to add a Spark job but it is easy to adjust it to your needs.

The parameters for the Spark job can depend on the specific event – for example, put file event data will include data about the specific file.

In this example there is only a place holder for the script parameters and the spark configuration parameters.

Note – in order to run this Spark job your code should be in Spark master machine. Possible this could be done in the script as well but here we assume it is already in CODE_DIR.

 


import sys
import time
import boto3
def lambda_handler(event, context):
conn = boto3.client("emr")
# chooses the first cluster which is Running or Waiting
# possibly can also choose by name or already have the cluster id
clusters = conn.list_clusters()
# choose the correct cluster
clusters = [c["Id"] for c in clusters["Clusters"]
if c["Status"]["State"] in ["RUNNING", "WAITING"]]
if not clusters:
sys.stderr.write("No valid clusters\n")
sys.stderr.exit()
# take the first relevant cluster
cluster_id = clusters[0]
# code location on your emr master node
CODE_DIR = "/home/hadoop/code/"
# spark configuration example
step_args = ["/usr/bin/spark-submit", "–spark-conf", "your-configuration",
CODE_DIR + "your_file.py", '–your-parameters', 'parameters']
step = {"Name": "what_you_do-" + time.strftime("%Y%m%d-%H:%M"),
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3n://elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': step_args
}
}
action = conn.add_job_flow_steps(JobFlowId=cluster_id, Steps=[step])
return "Added step: %s"%(action)

 

5 interesting things (25/02/2016)

Caffe On Spark – if it pretty clear that on of the next steps for Spark MLLib is to add deep learning. There are several deep learning frameworks out there and I guess few (if not all of them) will try to integrate with Spark on one way or another and in the end “the fittest will survive”, i.e. will be officially supported by Spark. What will that be? my guess is an Apache project, maybe SINGA, maybe a popular project that will become an Apache project. However, other extensions will continue to exist for specific use-cases and for users who prefer specific frameworks (e.g. reusing already existing code).

https://github.com/yahoo/CaffeOnSpark

Kibana plugins basics – I discovered this post while looking for a way to add a reasonable KPI visualization for elasticsearch. This is actually a series of 4 posts going over how to write and deploy (i.e. register) a custom Kibana plugin. Unfortunately we use AWS Elasticsearch service which does not (currently) support customized plugins and therefore we suspend it for now.

https://www.timroes.de/2015/12/02/writing-kibana-4-plugins-basics/

Bonus track – creating traffic light visualization – http://logz.io/blog/kibana-visualizations/

Thoughtsworks radar – every few months Thoughtsworks publish their technology rating. There are 4 ratings – adopt, trail, asses and hold. The products / tools / methodology are divided to 4 groups – techinques, tools, platforms and languages and frameworks. It is specially interesting to see how ranks change over time. I found database products to be a little under represented. I guess there is also a bit of bias towards tools and companies they work with but nonetheless it is a interesting read.

https://www.thoughtworks.com/radar

. files – dot files are a common way to set all kind of configurations from .gitignore to .bashrc and everything in the middle. You probably have your configuration that you like and copy it from project to project. You still might find this repository helpful as it collects common configuration settings –

https://github.com/holman/dotfiles

Bonus track – git ignore by programming language

https://github.com/github/gitignore

SQL tabs – simple but get the job done. Yet another visualization tool above postgresql. Super simple installation and start up (on mac os environment) even for non technical people.

http://www.sqltabs.com/

Why not AWS Elasticsearch service

Almost 5 month ago AWS exposed a new service – “Elasticsearch service”. So we can now have a managed elasticsearch cluster and we can have Kibana on it without the hassle of managing it. But then we need to work with it and discover the bleeding edges –

  1. Old version – current Elasticsearch version is 2.2.0 and 1.7.* is maintained. Elasticsearch version in AWS is 1.5.2. It is 8 months old version, it is ages in this industry. Kibana current version is 4.4.0 while Kibana version offered on AWS is 4.0.3.  Many enhancements, bug fixes, breaking changes and AWS is behind. This is specially interesting comparing to the recent velocity on EMR images. As there are some breaking changes between 1.* versions and 2.* I would at least expected the option to choose a version. Even more annoying there is currently no data about when and how AWS going to upgrade the versions.
  2. Plugins not supported – adding plugins is relevant for expanding Elasticsearch usability, monitoring and for adding customized Kibana visualisations (e.g for KPI’s). I believe every company which uses Elasticsearch will have this need in some point. AWS currently support only a close set of plugins and don’t enable installing your own plugin and there is no data about future plans.
  3. Scripting not supported – scripting is relevant for adding data, processing data, interactions between fields, etc. Again a need that I believes that raises in every company after a while. This features also is not enabled in AWS Elasticsearch Service. There are ways to by-pass some of it but why work hard if Elastic already created this functionality? Here too, not clear if there are any future plans to enable it.
  4. No VPC support – this means that the Elasticsearch endpoint is publicly-accessible. There are few solutions but it is a security issues and more over it does not comply with the other AWS products logic.
  5. Backup issues – no easy way to export data and to control how frequent and when your data is backed up. AWS cloud plugin for Elasticsearch can solved it, but it is not supported..

Conclusion – Elasticsearch is a very strong tool and even stronger when combined with Kibana. On the other hand, AWS usually produce very good products.But this time, the combination of those two is not there yet. It seems to me like they might jumped to the water too early (possibly in order to compete with Found \ Elastic Cloud) with a product which is not mature enough.

Most of the services AWS offer are based on their own innovation and \ or development and are not coupled so strong with external product (the other product I can think of is elasticcache which I have not worked with yet). From a business point of view, it is weird for me to see such a dependency on external product. If there is a high demand for Elasticsearch on AWS I would prefer to have a tutorial explaining how to set a cluster rather than a doubtful product.

5 interesting things (13/02/2016)

Ads @ stackoverflow – Steve from Stackoverflow ad managment team explains their view about ads on the site. This is a very user centric approach along with strong branding and differentiating themselves. This approach is not scalable for a sites with many topics and diverse audience but it fits their niche.

https://blog.stackoverflow.com/2016/02/why-stack-overflow-doesnt-care-about-ad-blockers/

Serverless Python web applications – taking scalability and cost effective to the next level. Zappa is a Python web applications where the server is created using AWS Lambda function after an HTTP request came through AWS API Gateway. The python web framework under it is currently Django but it should be customizable to also support Flask, pylons, etc.

https://gun.io/blog/announcing-zappa-serverless-python-aws-lambda/

Slack @ Home – the interesting thing for me is the transformation between business tools and free time tools \ technologies. Can we use the knowledge, tools, methods we use in work at home and the other way around?

The first feature for me would be shopping – collecting easily the list of things I need to get in the supermarket, drug store, etc and updating it back to the slack.

Scheduling is also interesting but there a google calendar integration so I think it is pretty easy. And maybe integration with WhatsApp possibly for image backup and sharing.

Of course, it does not necessary have to be slack, any other equivalent, preferred open source tool is good as well.

http://labs.earthpeople.se/2016/02/my-family-uses-slack/

Towards an Open Banking API Standard – one can probably not underestimate this move. I believe that an open banking API will lead many technology driven progress in the personal banking. For example – aggregating data from several account, comparing fees, etc. It will be interesting to see how the banks will adopt this standard as banks are heavily lead by regulations and are slow moving organizations by their nature. This specific effort is directed to UK banking systems but it will be interesting to follow if additional financial organization in and outside of UK will adopt it as well.

http://jackgavigan.com/2016/02/10/towards-an-open-banking-api-standard/

3.57 Degrees of separation – as a matter of fact I find the idea ok six degrees of separation romantic and not very interesting. In the edge of social networks and big data we can actually check it, or evaluate it. This post by Facebook research tries to estimate it. The interesting part for me in this post was the mathematical background and the engineering effort of scaling it. Please note that data is mostly available for first world countries and therefore although it is an estimation it probably need another factor of distance.

https://research.facebook.com/blog/three-and-a-half-degrees-of-separation/

Cassandra Day Berlin – 8 things I learnt about Apache Cassandra

Since I didn’t know anything about Apache Cassandra (beside being a distributed database) I could probably replace the 8 in the title to any number.. Here is a summary of few things I learned in Cassandra Day today.

  1. What is Apache Cassandra – open source distributed database donated to Apache foundation by Facebook (actually first to google code and then to ASF). It is decentralized – meaning all nodes are born equal and there are no masters or slaves. It is schema-full database see here more about the Myth of Schema-less databases. It supports replications, i.e. data redundancy with default replication factor of 3 and multiple data centers (both physical and virtual). and you can control the consistency level (see next). Therefore it is AP in the CAP theorem. It uses CQL – Cassandra Query Language.
  2. Controlling consistency – having the data replicated in several nodes one can read and write data in 3 ways. The higher consistency level the longer the latency is.
    • One – reading or writing to one node is enough.
    • Quorum – i.e. value was written \ retrieved at least from half of the relevant nodes. Latest win – when retrieving the data from several nodes the value with the latest time-stamp counts.
    • All – need to write \ retrieve data from all the relevant nodes. T Note that all mode is dangerous  since it cancels the high available. If one of the replicas is not available we will get no answer.
  3. Data Modelling is important – Isn’t it right to every database? Well yes. But the issue here is again the trade off. One on hand I feel that data modelling is sometimes neglected on NoSQL databases since we can just through data there and scale it. On the other hand, due to some limitations (see CQL next..) data modelling for Cassandra is quite opposite to best practices in RDBMS. Know your queries in advance and build the schema (e.g keyspace) accordingly.
  4. CQL – Cassandra Query Language. At the first talk they said “Yes, it is exactly like SQL”. Well, not exactly. First – no joins, this influences dramatically about the data modelling. Limited aggregations functionality. Update command always work even if record does not exist (can be controlled). See here more about CQL vs SQL.
  5. CQL containers – there are 3 types of containers
    • Set – a container of items sorted by the type compare operand.
    • List – a container of items sorted by the insertion operand
    • Map – a key, value container sorted by the type compare operand of the keys. This is kind of a hack which allow you to have it a bit schema less. Up to 64k items in a map.
  6. Spark connector – connecting to one of the most trending technologies.
  7. Solr integration – Cassandra is not a document database. However, sometime users how choose Cassandra as their main solution have indexing \ search needs. Datastax have a patch which also allow you to search the data in RAM (data which was not yet written to the Lucene indices).
  8. How to go on from here? How can I learn more –
There are much more for me to learn about Apache Cassandra and more things I learned in this day but this is a short review.

Swiss Python Summit 2016

I participated in Swiss Python Summit today, where I gave a talk about “Python’s guide to the Galaxy”. Code and slides are available in my GitHub. All the talk were recorded and they will probably be online within a few hours \ days.

20160205_173113

Beside the best signs I can think of the program was great and very diverse in the topics – music, 3d graphics, astrophysics and so on. Even the order of the program was very good – super cool talks after lunch to keep everyone a wake. The speakers were also relatively diverse – 4 out of 9 speaker were women, different backgrounds, etc. Unfortunately there were only few women in the crowd.

I was very lucky to give the first talk so I could then concentrate on the other talks. One very strong point of the conference was the diversity of the talks.

The second was “API Design is Hard” by Dave Halter, creator and maintainer of Jedi. It was more conceptional talk then a technical one which was enriched by lessons he learned creating and maintaining Jedi.  I believe this is an important talk regardless the specific programming language.

Next session opened with Armin Rigo talking about CFFI – “C Foreign Function Interface and PyPy interrupter. Interesting datum – CFFI is currently ranked 23rd by PyPI Ranking above Django and aws cli. Probably thank to the fact that it is integrated in Cryptography package and others. Pypy – is a python interpreter written in Python (rather then in C comparing to the most common implementation) and is designed for speed optimization.

Martin Christen had 3 parts to his talk. First he talked about 3d modeling with Blender, then he introduced some more low levels possibilities for 3d modeling and he ended up introducing some of the many open source projects he works on. Those projects mostly involve open 3d maps and globe maps.

After lunch came Matthieu Amiguet, co-founder of Les Chemins de Travers showed how you can manipulate sound using the relevant packages. This was really outstanding and non standard talk. One of the packages he currently uses is pyo.

Next talk  was given by Chihway Chang from ETH Zürich who spoke about ” Coding / Decoding the Cosmos – Python application in Astrophysics”. As she said, and I fully agree – “Python is everywhere in science today”. She talked about the found usage of Python in her department, both in Astrophysics dedicated packages (Astropysics) and using more general purposes packages (SciPy, Numpy, pandas) and packages they write themselves. She presented 2 issues they work on – mapping dark matter and calibrating radio telescope with drone. I strongly suggest to see at least a second part of this talk to original scientific thinking.

20160205_145917

Michael Rüegg demonstrated a Scrapy pipeline which start in crawling running races results and ends up in elastic search. This was probably the most technical talk of the day and the concrete project was of additional interest for me.

Jacinda Shelly showed us how magic is done, live coding. Specifically speaking about magic functions in IPython and other cool features.

Last but not least, Florian Bruhin speaking about why and how we should use, how we can expand the basic functionality and so on. In addition he also spoke about other projects he is involved in: qutebrowser – a python browser and tox – a generic virtualenv management and test command line tool.

This is the first edition of the conference and for me it was super interesting and I really loved the mixture of the talks. Hope to be there again next year.

5 interesting things (24/01/2015)

Top 30 books ranked by total number of links to Amazon in Hacker News comments – I must say this analysis surprises me. First because of the low numbers – most common book was mentioned only 53 times. Second, I expected “The Lean Startup” to be one of the 30 top books. I guess one of the caveats as described in the post itself are the links themselves that may differ a bit or change over the years and therefore hard to match.

http://ramiro.org/vis/hn-most-linked-books/

LDA2VEC – getting the best from both worlds LDA + word2vec. Stitch fix definitely brand themselves as one of the leading companies  technology and research wise doing some very interesting things.

http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec-57135994

Automatic colorizing – coloring gray scale images using deep learning. There are many works now trying and checking tensor flow particularly and deep learning on general. I really liked this case, maybe also because it uses a set which was already trained for other purposes.
Gestures Typing – First of all I liked the question – answer format of the post. The post looks for an optimal keyboard. But what is optimal keyboard? depends who you ask. On the other hand I think qwerty (or qwertz in Germany) is such a strong standard that I wonder what will make it change.
Roomba + voice recognition – problems of the very rich part of the world. Yesterday night we spoke with friends which complained their Roomba sometimes get stuck in all kind of corners, cable and so on in their house. We talked about several ideas to solve it, one was voice command and this morning I found a guy which already tried it –

5 interesting things (10/01/2016)

Everything.me open source their inheritance – Everything.me was an Israeli startup closed few weeks ago. They now open source major parts of their code and tools including their prediction algorithm, Re:dash and others.

https://medium.com/@joeysim/everythingme-open-and-out-6ed94b436e4c

Python 3 module of the week – over the years Python Module of the week by Doug Hellman was one of the most reliable documentation resources for Python’s standard library. The documentation is always accompanied with very good examples to almost any functionality. It is also available as a command line tool and was translated to Chinese, German, Italian, Spanish and Japanese. And.. it is now updated to Python version 3.5. Kudos.

https://pymotw.com/3/

The Star wars social network – although I’m not a star wars fan I think this provides a very accessible introduction to graph algorithms and measurements.

http://evelinag.com/blog/2015/12-15-star-wars-social-network/

D3 in Jupyter – an intersection between 2 tools which I use quite a lot and find very important for data scientist. Not those tools specifically but tools which make the data science magic more approachable to others so one can share its’ findings and get feedback.

http://multithreaded.stitchfix.com/blog/2015/12/15/d3-jupyter/

Bonus – Building interactive dashboards with Jupyter – http://blog.dominodatalab.com/interactive-dashboards-in-jupyter/

Tl;dr man page – given a bash command creates a tldr of the man page. More of a gimmick but a nice one.

http://www.ostera.io/tldr.jsx/

Surviving Black Friday & Turning Behavioural Signals Into User Profiles

I was on a visit in Israel and went to “Surviving Black Friday & Turning Behavioural Signals Into User Profiles” meetup. It is a very long name which actually implying on 2 talks. The meetup took palce in Sears Israel offices, which is the department behind Shop Your Way.

The first talk – “Surviving Black Friday” was given by Omri Fima who is a resilience tech lead in Sears. Omri talked about resilience and scalability lessons learned based on Black Friday. Every internet shopping site knows that the traffic is much higher on Black Friday, so how can you prepare and test if your system can deal with such a load? What happens if one service fails? What is a graceful failure and what is less graceful? He presented 7 steps to make your service more stable and mentioned few tools both for testing and development.
Pablo Rosenman, VP Development @ Adience gave the second talk – “Turning Behavioural Signals Into User Profiles” (slides, video). Pablo presented 2 of Adience products – Adience SDK and Events SDK and showed how they use AWS services in their pipeline. He talked about Adience pipeline and what were their main concerns and focus when designing and implementing it – scability, decoupling  and cost effective. In the end he also presented what they would do differently if they would design it today.
All in all, interesting talks and a very good atmosphere. Looking forward that Omri’s slide will also be available online.

5 interesting things (13/12/2015)

Systemml – distributed and declarative machine learning platform. Looks like a promising project which now joined Apache Software foundation and was initially developed by IBM. I wonder how it will influence the development of Spark MLLib.

http://systemml.apache.org/

Monopoly as Markov Chain – I guess I am developing a fetish to Markov chains. Although this model is not always realistic (so as in this case..) it is amazing what we can get out of it and the approximate simulations we can create. This post simulates a Monopoly game using Markov chains and does a very interesting job. Monopoly is linear in the sense that you must move according to the dice and the choices you make are limited (buy or don’t buy). In contrast to backgammon where you also have strategy involved and therefore it is a better choice to model it with Markov chains.

http://koaning.io/monopoly-simulations.html

Vocabulary – “Python Module to get Meanings, Synonyms and what not for a given word”. This module brands itself as an alternative to NLTK presenting data about meaning, synonyms, antonyms, part of speech, pronunciation, etc with a leaner approach and more pythonic approach. I don’t know if is as good and NLTK or will evolve there but it sure looks like an alternative worth checking.

http://vocabulary.readthedocs.org/en/latest/

Probability recap – If you forgot probability class from university this will probably be a good recap. However, if you work as a data scientist you probably used those daily. But code visualization, good examples and good way to share your knowledge with other colleagues.

http://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/

What we talk about when we talk about distributed systems – for once a non misleading title. Thinking about the distributed systems course I took on grad school this would have been a great introduction.

http://videlalvaro.github.io/2015/12/learning-about-distributed-systems.html