I got a free ticket to distributed matters conference winning the challenge of the python big data analytics meetup group. The line up was impressive (including Kyle Kingsbury from project “call me maybe”), the topics were very interesting and the location was very nice (first time I hear a talk in a disco party hall) and close to my house, perfect match. Due to earlier commitments I could only stay half of the day but what a day 🙂
Starting with Kyle Kingsbury (Stripe) as a key note talking about “call me maybe” and the jepsen test. The basic setup of the test is creating a cluster of some database, injecting random events (read, write, update, query, etc) and verify the database behavior. Next level include injecting all kind of failures – network errors, node pause, etc. Following the description of the test Kingsbury gave a brief summary of the results. Project “call me maybe” result creates pressure on the different database vendors to improve their products and align with the promises they make. It definitely serve as the watch dog of this industry. It is sometime amazing how not tested some products are.
The sentence I took with me from his talk is the choosing a database is a contract and you should check all the relevant aspects of it. Settings, complexity and needs sometimes change over time but as long as the initial contract is honored no complains.
The second talk I went to was Michael Hackstein‘s (ArangoDB) talk “NoSQL meets microservices”. Hackstein presented a problem of multiple calls to multiple databases where each answer different need and presented the solution – multi-model NoSQL database. Single call can answer all needs. ArangoDB is an example of such a database, OrientDB is also a major competitor. The problem multi-model database might be phrased as just multi-model data. I.e. documents data, graph data, key-value data, etc (see footnote 1).
The rest of the talk focused on ArangoDB different features –
- foxx – REST API over the database which works as a microservice.
- drivers – including python, php, java, js, ruby, etc.
- AQL – ArrangoDB query language
Next talk was given by Charity Majors (Parse – acquired by Facebook) as she introduced herself – “hate software for living”. This talk was really really great one. Majors is super experienced, worked with many technologies and very charismatic. She spoke about her many years experience about upgrading databases. Sound easy? well.. not that much. Stepping a step back, why would you want to upgrade a database – new feature, improved preformance, bug fixing, etc.. So what can go wrong? well, many things. From down time, losing data, losing transactions, different hashing and indexing, optimization which are not optimized to your data, changes in clients, and much more fun. Good news – you have enough interesting and important data to care for.
What should you do about it? tl;dr – be paranoid. Longer version –
- Change only one component at a time
- read the docs, change logs, release notes (also mentioned by Kyle Kingsbury)
- Don’t believe to database vendors benchmarks, test it for your expected load (if you can expected it, another painful point) – at least 24h data, make sure to clear the cache.
- Take consistent snapshots
- Prepare for rollback option.
I really hope her slides will be published soon since it was a very informative talk.
Tour de Force: Stream Based Textanalysis by Hendrik Saly, Stefan Siprell (codecentric).
This talk was classified as expert level and I expected it to a bit more then what it actually was. Many buzzwords were thrown around – R, machine learning, twitter, Apache spark, elasticsearch, kibana, aws, etc. but the big picture, how those technologies relate to each, what are the alternatives. For example, for the goal of enriching the data R can be easily replaced with python. Or in order to read from twitter why use files and not use Spark streaming capabilities (or some queue). Beside Kibana being a great and easy to use tool, preferring Elasticsearch over other options was not clear. To sum up – I also missed some reference to the distributed properties of those tools. This is the reason we were all there after all, not?
Footnotes
1 Multi-model data is just a way of thinking of about data structures in patterns we are used to. Maybe there are other more efficient way to think of this data?
2. Sponsor wise – on one hand there were few sponsors that didn’t give any talk, for example idealo. On the other hand there were few companies I expected to sponsor the conference, i.e crate.io, Elastic, mongodb which was mentioned a lot but unfortunately didn’t give any dedicated talk.
3. Database vendors business models. I think that the most popular model as of today is to develop a database, possibly oen source (elasticsearch, ArangoDb etc.) and make money from supporting it. Is there any other way to make money from database vendoring those days (hosting is not included under vendoring).
4. Diversity in tech – I saw no more than 5 women in the crowd.