Start this post to introduce basics of doing “big-data” thing with Spark using Databricks notebook, as probably the Databricks notebook is the best and easiest one as far I know (and tried) for now.
Ok, first of first, where should I start if I want to go for Spark? I suggest start learning scala, even you’ve used python/R,.etc before. For reasons, please google it, I’ve got a link here: https://www.dezyre.com/article/scala-vs-python-for-apache-spark/213
Fine, decide to have a try with Scala + Spark, the same question, where should I start? I suggest start with databricks community version, no complicated setup, free and easy to use. Just register your account at https://community.cloud.databricks.com/. Once you like it and want to have your own sandbox, a further option is go for different images from big names, like MAPR, Hortonworks, Cloudera,.etc.
I think that should be enough for people new to Spark today.
■■ TASK / Homework ■■ : register the databricks community. If you still cannot complete the reg-step, you’re out, Spark is too difficult for you (joke).
Good, once registered you should see the following page on the Databricks community version. Loads you can do now and probably you would see quite a lot things familiar with like Python, R, SQL.
To start with this “Hello World”, let’s create a single notebook & run some code. Click on Home -> Users -> [Your ID] -> Down-pointing triangle icon -> Create -> Notebook:
Give a name and select the language “Scala”, don’t worry too much about not know Scala at all, I did this when I started learning Scala. Basically this would give you some idea of how the language Scala feel like and you would be covered. Of course you can try anything later by using Python or R.
For now, Your notebook is up but NOT running (as you can see underneath the notebook title, the cluster status is “Detached”). No worry, let’s do a simple data-loading experience test.
Write the following code into the first cell (if you don’t know what’s cell, suggest learn a bit about notebook).
val diamonds = sqlContext.read.format(“csv”)
Basically, this would load a default diamonds.csv data file which is stored on the Databricks server, so you don’t have to upload anything, and then display the table.
Once you’ve done that, hit the run icon on top-right of the cell to execute it (keyboard shortcut “shift + enter”), as soon as you do that, you should see the following pop-up to ask you attaching to a cluster, select the “Auto” thing and click “Launch and Run”. This would take a while especially if the cluster is not up. Normally this cluster would cost you a bit fortune to run for example if you do this on AWS or Azure (if you’re not in the free-tier), Databricks give it for free, though a very tiny scale cluster, and limited storage.
Now when it’s done, everything is up and running, and you will see the table showing up. Feel free to poke around, and of course, to finish this “Hello World”, start a new cell and type
or even just: