Running my first EMR – lessons learned

Today I was trying to run my first EMR, here are few lessons I learned during this day. I have previously run hadoop streaming mapreduce so I was familiar with the mapreduce state of mind. However, I was not familiar with the EMR environment.

I used boto – Amazon official python interface.

1. AMI version – default AMI version is 1.0.0 – first release. This means the following specifications –

Operating system: Debian 5.0 (Lenny)

Applications: Hadoop 0.20 and 0.18 (default); Hive 0.5, 0.7 (default), 0.7.1; Pig 0.3 (on Hadoop 0.18), 0.6 (on Hadoop 0.20)

Languages: Perl 5.10.0, PHP 5.2.6, Python 2.5.2, R 2.7.1, Ruby 1.8.7

File system: ext3 for root and ephemeral

Kernel: Red Hat

For me Python 2.5.2 means –
  • Does not include json – new in version 2.6.
  • collections is new in python 2.4, but not all the models were added in this version –
namedtuple() factory function for creating tuple subclasses with named fields

New in version 2.6.

deque list-like container with fast appends and pops on either end

New in version 2.4.

Counter dict subclass for counting hashable objects

New in version 2.7.

OrderedDict dict subclass that remembers the order entries were added

New in version 2.7.

defaultdict dict subclass that calls a factory function to supply missing values

New in version 2.5.

 
Therefore specifying the ami_version version can be critical. Version 2.2.0 worked fine for me.
2. Must process all the input!
Naturally we will want to process all the input. However, for testing I went over only the n-first lines and then added a break to make things run faster. I was not consuming all the lines and therefore got an error. More about it here –
3. Output folder must not exists. This is the same as in hadoop streaming map reduce, for me the way to avoid it was to add a timestamp –

output="s3n://<my-bucket>/output/"+str(int(time.time()))

4. Why my process failed – one option which produces are relatively understandable explanation is  – conn.describe_jobflow(jobid).laststatechangereason

5. cache_files – enables you to import files you need for the map reduce process. Super important to “specify a fragment”, i.e. specify the local file name

cache_files=['s3n://<file-location>/<file-name>#<local-file-name>']
Otherwise you will obtain the following error –
“Streaming cacheFile and cacheArchive must specify a fragment”
6. Status – the are 7 different status your flow may have – COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN, STARTING and WAITING. The right order of statuses if everything goes well is STARTING -> RUNNING ->SHUTTING_DOWN -> COMPLETED.
The SHUTTING_DOWN may take a while even for a very simple flow I measured about 1 minute of SHUTTING_DOWN process.
Resources I used –

http://boto.readthedocs.org/en/latest/emr_tut.html

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s