Exploring Pyspark.ml for Machine Learning: How to Run PySpark on Local Anaconda Jupyter Notebook
You may refer to this link for step by step instructions. This article is just to document what I did, some of which was not mentioned in the link.
Open Anaconda Prompt with admin privileges. I was not able to install the required packages without admin privileges.
Before we start with installing any packages, it is important to know about conda environments. Conda environments are isolated, self-contained environments in which you can install and manage dependencies for your Python / programming projects. This is so you can separate different sets of packages / libraries from each other. The “base” environment is the default Python environment when you first install conda. It is best practice to not install packages directly into the base environment and to keep it clean. For this project, I will be installing into the rip environment.
You may check out the the environments you have using conda info --envs
. For me, I have another environment known as rip.
If you want to change to your environment, you may use conda activate yourenv
. As an example, if I want to change to the rip env, i will use the code conda activate rip
. To make a new environment, you may use the code conda create --name myenv python=3.7
.
Lets proceed to the actual installation works.
Key in conda install openjdk
to install Java. After it loads, when prompted y / n, type y
to agree installing it. If all is done smoothly you should see something as below.
Next, key in conda install pyspark
to install PySpark. You will see something as below.
Similarly, press y
and after the download completes, it should look as below.
Next we will install FindSpark. If you use conda install findspark
you will not be able to install it.
Instead, use conda install -c conda-forge findspark
and you will be able to do it.
To further clarify, the -c
flag within the code specifies a channel from which to install a package. Channels are repositories that host conda packages. So what happened was when we used conda install findspark
, the command attempted to search in the default channels and repo, and when the package isn’t found there, it resulted in an error. When we used conda install -c conda-forge findspark
, the command explicitly tells conda to search for findspark in the conda-forge channel. Conda-forge is a community-maintained channel that provides a wide range of packages, which are not in the default channels.
Run on jupyter notebook and you should be able to get the version of the pyspark installed.
If you are not able to import pyspark on Jupyter Notebook, you can use findspark to find where pyspark is.
Try and see whether pyspark works by creating a simple pyspark.sql.DataFrame.
# Import SparkSession to be able to initiate a spark session
from pyspark.sql import SparkSession
# Initiate a Spark Session
spark = SparkSession.builder.appName('Sample').getOrCreate()
# Create a dataframe from columns values and column names
col_names = ['Col_A','Col_B']
col_values = [('Lim',29),("Sze", 68),('Zhong',63),('Medium',33)]
df = spark.createDataFrame(col_values).toDF(*col_names)
print(type(df))
print(df)
print(df.show())
The output should look as per below.
Now we have pyspark on our machine and can play around.