Best programming language ever!!!!!1 – I really agree with this guy approach and the way he presents it. No better tool but better choices.
Apache Spark @ TripAdvisor – the title of this post is “Using Apache Spark for Massively Parallel NLP” but this is a bit misleading. Apache Spark is discussed only in the last third of the post and does not dive to the technical details as I expected or how Spark integrates with the rest of their architecture. However, the methodology they present regarding asking the users questions and the inherent users’ bias he points at. I’ll wait for the next post where he says he will talk more about the algorithm.
Let me guess where are you from? I love this post because it uses an interesting dataset (a list of people who have Wikipedia pages by country) and he explains the process he did and the tools he used step by step – getting the data, choosing the model, optimizing it, fitting it to production constraints and deploying it. Real pleasure to read.
Do you need a data scientist? I agree with many points in this post – sometimes data science and data engineering is mixed. I think that since “data scientist” become some kind of buzz word organizations might try to attract employees by this title. Regarding the point of having a problem to solve vs wanting to do something cool with data – yes, you have to have a concrete tasks for a person hired as a data scientist but you also have to have time for innovation and vision based on data and not all the time existing employees have those capabilities and knowledge.
AWS CLI Jungle – this tool is kind of a wrapper to AWS CLI which wants to make it more intuitive. The lack of support in wild cards disturbs me as a user and I am glad they solve it. It currently implements functionality for ec2 and elb, I hope they will also implement s3 functionality soon.