6 interesting things (20/2/2015)
Making an exception but I really came across some interesting things –
Topy – Python script to fix typos in text, using rule-sets developed by theRegExTypoFix project from Wikipedia. The basic rule set is an English rule set but other rule sets are also available. Trying it, I’m positive about it but it is not baked \ mature enough and I would like it to be more easy to use in code than as a command line tool.
5 interesting things (03/02/2015)
5 interesting things (04/01/2015)
Json as python modules – Making life simpler. Making it possible to do -“import x” for x.json.
Data Science Ontology – just having fun with d3.js.
Nips Experiment – there was relatively a lot of chatter regarding the experiment done by NIPS committee. Non the less, it creates a very uncomfortable feeling regarding committees.
5 interesting things (14/12/2014)
Databases compression – Summarizes the high level and things to know about each database very well. A good starting point when evaluating several solutions.
Common Crawl meetup
- Monthly build (currently prefer bigger monthly builds over more frequent builds)
- Latest build was 220TB with crawl data of 2.98 billion web pages.
- Data include 3 type of files –
- WARC files of the raw crawl data
- WAT files which include the metadata for the data stored in the WARC (about 1/3 of the raw crawl data)
- WET files which hold the plaintext from the data stored in the WARC (about 15% of the raw crawl data).
- Delay between publication of page and crawl \ public time is approx month-month and a half.
- No incremental dumps are planned at the moment.
- The data is currently skewed to dot com domains. They plan to improve it changes will hopefully be seen on January dump.
- They crawl using Apache Nutch – Nutch is an open source web crawler and cooperate with Blekko in order to avoid spam.
What can you do with Common Crawl
5 interesting things (06/11/2014)
Toolz – we all know the term design patterns. Toolz provides implementation pattern. Everyday utilities every developer needs – iterators, dictionaries, etc. Two links – blog post with common use cases and Toolz documentation.
http://matthewrocklin.com/blog/work/2014/07/04/Streaming-Analytics/
Running my first EMR – lessons learned
Today I was trying to run my first EMR, here are few lessons I learned during this day. I have previously run hadoop streaming mapreduce so I was familiar with the mapreduce state of mind. However, I was not familiar with the EMR environment.
I used boto – Amazon official python interface.
Operating system: Debian 5.0 (Lenny)
Applications: Hadoop 0.20 and 0.18 (default); Hive 0.5, 0.7 (default), 0.7.1; Pig 0.3 (on Hadoop 0.18), 0.6 (on Hadoop 0.20)
Languages: Perl 5.10.0, PHP 5.2.6, Python 2.5.2, R 2.7.1, Ruby 1.8.7
File system: ext3 for root and ephemeral
Kernel: Red Hat
- Does not include json – new in version 2.6.
- collections is new in python 2.4, but not all the models were added in this version –
| namedtuple() | factory function for creating tuple subclasses with named fields |
New in version 2.6. |
| deque | list-like container with fast appends and pops on either end |
New in version 2.4. |
| Counter | dict subclass for counting hashable objects |
New in version 2.7. |
| OrderedDict | dict subclass that remembers the order entries were added |
New in version 2.7. |
| defaultdict | dict subclass that calls a factory function to supply missing values |
New in version 2.5. |
- dict comprehensions is also kind of late addition (python 2.7)
output="s3n://<my-bucket>/output/"+str(int(time.time()))
4. Why my process failed – one option which produces are relatively understandable explanation is – conn.describe_jobflow(jobid).laststatechangereason
5. cache_files – enables you to import files you need for the map reduce process. Super important to “specify a fragment”, i.e. specify the local file name
cache_files=['s3n://<file-location>/<file-name>#<local-file-name>']
5 interesting things (04/09/2014)
C3.JS – I have previously wrote a post about the importance of visualization in the skill set of data scientist. C3.js is a JavaScript chart library based on d3.js which seems at least in a glimpse to be simple and intuitive. I would like to see a Python client for that but that for the future to come.
nvd3 also do something like – charts based 3d.js and also have a Python client which I worked with a bit. Comparing the two c3.js seems a little bit more mature than nvd3, ignoring the lack of Python client but I’m sure that gap would be filled soon.
Harvard Visualization course – I went through some of the slides and it was fascinating but what is even more exciting is the great collection of links about visualization examples, theory and tools. Great work.
http://www.cs171.org/#!index.md
textract – I needed high flexibility of input types in a project I do and of course I wanted to deal with as transparent as possible without looking myself to all the relevant packages or adjust my code to the API of each package. Fortunately somebody already did it – textarct. The package is not perfect and there are some “glitches” mostly concerning the parsers themselves (line splitting, non-ascii, etc) and not to the unified API textract provides. However, it is a very good start.
http://textract.readthedocs.org/en/latest/
Visualizing Garbage Collection Algorithms – both very cool visualization and good explanations. Design wise I think the visualization should be larger but the concept itself is very neat.
http://spin.atomicobject.com/2014/09/03/visualizing-garbage-collection-algorithms/
SmartCSV – making CSV reader more structured by defining a model and validating it while reading. Enables skipping rows (and soon skipping also columns). It is on going project and feature requests and issues are currently addressed quickly.
5 interesting things (26/08/2014)
Gooey – Command line to application! Very cool. Works by mapping ArgumentParser to GUI objects. Very cool and make it easier to make instant tools to play around the office. Future tasks include supporting in more themes, different arg parses (e.g docopt http://docopt.org/), etc. I believe there is much potential in this project.
https://github.com/chriskiehl/Gooey
Datamesh – stats command line tool. I like working for the command line, at least to begin with and to feel data before doing complex things, this tool is really what I need. However, the documentation is missing \ not clear enough \ open source style.
http://www.gnu.org/software/datamash/
In that sense, q, http://harelba.github.io/q/, is also very cool and answer the need to things and speak in the same framework in different tasks.
Getting started with Python internals – how to dive to the deep water. Very enriching post which also include links to other interesting posts.
http://akaptur.github.io/blog/2014/08/03/getting-started-with-python-internals/
python-ftfy – unicode strings can be very painful when it includes special characters as umlauts (ö), html tags (>) and everything users can think of :). The goal of this package is to at least partly ease this pain.
https://github.com/LuminosoInsight/python-ftfy
Fraud detection Lyft vs Uber – I think that the interesting part here is the data visualization as tool to understand the problem better.