Create EMR and run Spark zeppelin notebook on AWS
Note: the information in this post is based on the August 2018 AWS console. You need to have your amazon AWS account to get access to the console. As if you’re based in the West Europe, I suggest to use the region eu-west-1 (Ireland) as the speed, price and available services. Once signed in to your console, search for EMR and get into the EMR management page as below.
Look for the big blue button “Create Cluster” and click on it, you will see the next screen as the one in the picture below. Keep the default (suppose you’ve already set up a S3 bucket), change the applications to Spark, define the size and number of instances and select your EC2 key pair then you’re ready to go, click the big blue button “Create Cluster” again and probably grab yourself a cup of coffee as spinning up the EMR take a while.
Now wait for 5–10 minutes for the EMR to spin-up, once it’s done you will notice the “Waiting” status in your console.
To make it accessible by web interface, you need to first SSH into the EMR. Procedure is explained by amazon
Generally, copy the “hadoop@your-master-public-dns” to putty, with port 22, don’t forget to attach your private key the “ppk” file you got from the EC2 key pair into putty from the left panel: Category -> Connection -> SSH -> AUTH.
Then click the “Open” button, if everything goes alright, you would see the huge “EMR” logo from the terminal.
You can now connect to the zeppelin from your favourite browser, the port is 8890, no https just normal connection type.
In additional to the port for zeppelin, I’ve listed some widely used port in the list here:
Hadoop HDFS NameNode
Hadoop HDFS DataNode