How to create an EMR job with multiple inputs using the ruby client.
I’m using the ruby client to launch Hadoop jobs on Amazon’s Elastic Map Reduce framework. Things have gone very nicely until I tried scripting a job which draws input from multiple buckets. You can’t use the ‘-input’ option twice and the advice around the internet is to use –args. So, I added:
to the end my commandline and was displeased to see this option completely omitted from my job.
Grepping the source for ‘–args’ brings up:
commands.parse_options(step_commands + ["--bootstrap-action", "--stream"], [
[ ArgsOption, "--args ARGS", "A command separated list of arguments to pass to the step" ],
[ ArgOption, "--arg ARG", "An argument to pass to the step" ],
[ OptionWithArg, "--step-name STEP_NAME", "Set name for the step", :step_name ],
[ OptionWithArg, "--step-action STEP_ACTION", "Action to take when step finishes. One of CANCEL_AND_WAIT, TERMINATE_JOB_FLOW or CONTINUE", :step_action ],
Which shows my problem. –arg and –args are associated with a specific step in the job and as such have to follow the –bootstrap-action or –stream options. They can’t be tacked on to the end.
While on the subject, be wary of using –args because it does not play nicely with commas in the options.
Finally, you need to include –input even if you are also using –arg/–args. Otherwise you get the default wordcount input. So my final command line looked something like:
~/amazon/elastic-mapreduce --create --stream --args -input,"s3n://SomeSecondInput" --enable-debugging --num-instances 16 --master-instance-type c1.xlarge --slave-instance-type c1.xlarge --name "Script Name" --mapper "s3n://MyBucket/map.rb" --reducer "s3n://MyBucket/reduce.rb" --log-uri "s3n://MyBucket/logs" --output "s3n://MyBucket/output" --input "s3n://FirstInput" --bootstrap-action "s3n://MyBucket/bootstrap.sh" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive
The last bootstrap option is magic, btw.