Running my first EMR – lessons learned

Today I was trying to run my first EMR, here are few lessons I learned during this day. I have previously run hadoop streaming mapreduce so I was familiar with the mapreduce state of mind. However, I was not familiar with the EMR environment.

I used boto – Amazon official python interface.

1. AMI version – default AMI version is 1.0.0 – first release. This means the following specifications –

Operating system: Debian 5.0 (Lenny)

Applications: Hadoop 0.20 and 0.18 (default); Hive 0.5, 0.7 (default), 0.7.1; Pig 0.3 (on Hadoop 0.18), 0.6 (on Hadoop 0.20)

Languages: Perl 5.10.0, PHP 5.2.6, Python 2.5.2, R 2.7.1, Ruby 1.8.7

File system: ext3 for root and ephemeral

Kernel: Red Hat

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html

For me Python 2.5.2 means –

Does not include json – new in version 2.6.

collections is new in python 2.4, but not all the models were added in this version –

namedtuple()	factory function for creating tuple subclasses with named fields	New in version 2.6.
deque	list-like container with fast appends and pops on either end	New in version 2.4.
Counter	dict subclass for counting hashable objects	New in version 2.7.
OrderedDict	dict subclass that remembers the order entries were added	New in version 2.7.
defaultdict	dict subclass that calls a factory function to supply missing values	New in version 2.5.

dict comprehensions is also kind of late addition (python 2.7)

Therefore specifying the ami_version version can be critical. Version 2.2.0 worked fine for me.

2. Must process all the input!

Naturally we will want to process all the input. However, for testing I went over only the n-first lines and then added a break to make things run faster. I was not consuming all the lines and therefore got an error. More about it here –

http://stackoverflow.com/questions/9881269/broken-pipe-error-causes-streaming-elastic-mapreduce-job-on-aws-to-fail

3. Output folder must not exists. This is the same as in hadoop streaming map reduce, for me the way to avoid it was to add a timestamp –

output="s3n://<my-bucket>/output/"+str(int(time.time()))

4. Why my process failed – one option which produces are relatively understandable explanation is – conn.describe_jobflow(jobid).laststatechangereason

5. cache_files – enables you to import files you need for the map reduce process. Super important to “specify a fragment”, i.e. specify the local file name

cache_files=['s3n://<file-location>/<file-name>#<local-file-name>']

Otherwise you will obtain the following error –

“Streaming cacheFile and cacheArchive must specify a fragment”

6. Status – the are 7 different status your flow may have – COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN, STARTING and WAITING. The right order of statuses if everything goes well is STARTING -> RUNNING ->SHUTTING_DOWN -> COMPLETED.

The SHUTTING_DOWN may take a while even for a very simple flow I measured about 1 minute of SHUTTING_DOWN process.

Resources I used –

http://boto.readthedocs.org/en/latest/emr_tut.html

http://atbrox.com/2010/10/01/programmatic-deployment-to-elastic-mapreduce-with-boto-and-bootstrap-action/

	Nicole S on 5 Python NLP pacakges
	blissful4bdd2399fa on CSV to radar plot
	tom on CSV to radar plot
	Matt on CSV to radar plot
	“ – Tom… on 📚 Book club Q1 2024 – 3…

Running my first EMR – lessons learned

Published by tom

Leave a comment Cancel reply

Share this:

Related

Published by tom

Leave a comment Cancel reply