Getting Started With Spark and Scala

September 17, 2017

Getting Started With Spark and Scala

What is Spark?

Spark is a distributed cluster computing designed by Apache Software Foundation. It was built on top of Hadoop MapReduce and was designed for fast computation. One of the main features that Spark addresses is its in-memory cluster computing that increases the processing speed of an application.

Before you install spark, please make sure you have installed Java and Hadoop in your system. My instructions are targeted for I am using Ubuntu 14.04 LTS.

Java Installation

Check whether your system has Java installed or not. Please type the following in your command line:

$java -version

You should get something like below:

java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

If Java is not installed please refer to this link.

Hadoop Installation

Verify Hadoop installation with the following command:

$hadoop version

Output should be something like this:

Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4

If Hadoop is not installed then please use this link.

Scala Installation

Verify Scala version using the following command:

$scala -version

Output should be something like this:

Scala code runner version 2.9.2 -- Copyright 2002-2011, LAMP/EPFL

If Scala needs to be installed please refer this link.

Spark Installation

Download Spark from here.

Extract the tgz file as follows:

$ tar xvf spark-2.1.1-bin-hadoop2.3.tgz

Move the spark files to /usr/local/spark directory as follows:

$ su mv spark-2.1.1-bin-hadoop2.3   /usr/local/spark/

Set up the environment for Spark by adding the following line to ~/.bashrc file:

export PATH = $PATH:/usr/local/spark/bin

Use the following command for sourcing the ~/.bashrc file.

$ source ~/.bashrc

Verify Spark Installation using the following command:

$spark-shell

Output:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/09/18 11:09:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/09/18 11:09:37 WARN util.Utils: Your hostname, dwnpcpu190 resolves to a loopback address: 127.0.1.1; using 192.168.2.176 instead (on interface eth0)
17/09/18 11:09:37 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/09/18 11:09:59 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/09/18 11:09:59 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
17/09/18 11:10:04 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.2.176:4040
Spark context available as 'sc' (master = local[*], app id = local-1505712279237).
Spark session available as 'spark'.
Welcome to
  ____ __
  / __/__ ___ _____/ /__
  _\ \/ _ \/ _ `/ __/ '_/
  /___/ .__/\_,_/_/ /_/\_\ version 2.1.1
  /_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.
scala>

Search This Blog

High Hopes