Today I was trying to run my first EMR, here are few lessons I learned during this day. I have previously run hadoop streaming mapreduce so I was familiar with the mapreduce state of mind. However, I was not familiar with the EMR environment.
I used boto – Amazon official python interface.
Operating system: Debian 5.0 (Lenny)
Applications: Hadoop 0.20 and 0.18 (default); Hive 0.5, 0.7 (default), 0.7.1; Pig 0.3 (on Hadoop 0.18), 0.6 (on Hadoop 0.20)
Languages: Perl 5.10.0, PHP 5.2.6, Python 2.5.2, R 2.7.1, Ruby 1.8.7
File system: ext3 for root and ephemeral
Kernel: Red Hat
- Does not include json – new in version 2.6.
- collections is new in python 2.4, but not all the models were added in this version –
namedtuple() | factory function for creating tuple subclasses with named fields |
New in version 2.6. |
deque | list-like container with fast appends and pops on either end |
New in version 2.4. |
Counter | dict subclass for counting hashable objects |
New in version 2.7. |
OrderedDict | dict subclass that remembers the order entries were added |
New in version 2.7. |
defaultdict | dict subclass that calls a factory function to supply missing values |
New in version 2.5. |
- dict comprehensions is also kind of late addition (python 2.7)
output="s3n://<my-bucket>/output/"+str(int(time.time()))
4. Why my process failed – one option which produces are relatively understandable explanation is – conn.describe_jobflow(jobid).laststatechangereason
5. cache_files – enables you to import files you need for the map reduce process. Super important to “specify a fragment”, i.e. specify the local file name
cache_files=['s3n://<file-location>/<file-name>#<local-file-name>']