Installation#
General note on Windows installations#
When a system wide change like adding/changing environment variables or installing a program is performed, this does not appear in the current session of a Jupyter Notebook or other module that is started (directly or indirectly) from the command line. A restart of the whole Jupyter Notebook server may therefore be needed after the mentioned changes to see their effects.
Company managed Windows computers#
Some companies and organisations, e.g., the Norwegian University of Life Sciences (NMBU), do not give administrative rights to their employees by default. Instead they have a Software Center with installable programs. For NMBU the Software Center includes Anaconda, VS Code, Docker (with automatic WSL installation), Java, and Power BI Desktop.
Python#
An up to date installation of Python is recommended.
This book has been built using the minial conda installer miniconda.
Conda environments#
A conda environment is a self contained Python environment with its own Python version and installed packages.
Conda environments can be created and activated in a terminal on POSIX systems (macOS, Unix, Linux) or from a Conda prompt on Windows (use \ instead of / in paths).
A new conda environment can be created and prepared, e.g., as follows:
conda create --name D2D_env conda activate D2D_env conda install python==3.11.9 pip pip install -r /path_to_D2Dbook/requirements.txt
If you get an error regarding Rust, you may need to install Rust separately, e.g, from their webpage or using Homebrew.
As of September 2024 the Python version has been specified above due to trouble with the Cassandra driver from DataStax in Python 3.12.
Integrated Developer Environment - IDE#
There are many to choose between, e.g., the poular freeware applications PyCharm and Visual Studio Code (VS Code).
This book was developed using VS Code.
VS Code#
Download and install VS Code.
Add extensions for Python and Jupyter, possibly also for JSON viewer, GitHub Copilot, etc.
You may have to set the Conda and Python paths in the settings for the Python extension.
If your favourite Conda environment is not detected, press Ctrl/Cmd+Shift+P and write python:Select Interpreter to enable manual input of environment path.
Git#
Python pip installation from GitHub requires a local git system (as of September 2024 this is not required when using this book).
If your computer does not have git installed by default, this can be installed from git-scm.com.
Docker#
On POSIX systems, Docker desktop can be installed from docker.com.
On Windows computers, Docker requires installation/activation of the Windows Subsystem for Linux (WSL) for full functionality. This can be installed from a command prompt using the following command:
wsl --install
After WSL is installed, the Docker desktop can be installed as for POSIX systems.
Both WSL and Docker are found in the Software Catalog of NMBU computers.
Cassandra#
Assumptions:
Docker installed on system
“cassandra:latest” image installed in docker
Python/Conda environment with Python >=3.8 and <=3.11.9 *
*As of September 2024 there is a lack of compatibility between Python 3.12 and Cassandra drivers unless one installs additional software.
Java#
To use PySpark one needs a Java installation.
Several vendors make their own Java versions.
Microsoft’s version of OpenJDK supports all major platforms.
After an OS update, a reinstall of Java may be needed.
Access to Java can be enabled system wide using the environment JAVA_HOME or on a script-wise level using the Python os package.
Environment variables are system dependent. If the installation of Java did not add the environment variable automatically, the variable can be set persistently in the system (in the Cassandra notebook we set these variables in the script instead):
Windows: search for Environment Variables via the start button and selecting the one for User.
Linux and Mac OS: edit the ~/.bash_profile file, adding export JAVA_HOME=/opt/openjdk11 (or similar, depending on your installation).
Spark#
Spark is originally made for POSIX systems with Hadoop. On Windows computers it is therefore dependent on specialised setup or a set of drivers called winutils.
More than one GitHub repository maintains these drivers. We have tested the hadoop-3.3.1 version from Kontext.TECH with success.
At runtime, an environment variable can be set to point to the downloaded hadoop-x.y.z folder, e.g., in C:\Hadoop (not need on Mac/Linux, see example in Spark notebook):
import os os.environ["HADOOP_HOME"] = "C:/Hadoop/hadoop-3.3.1"
On some computers an additional Hadoop environment variable can silence a nuisance warning:
os.environ["PYSPARK_HADOOP_VERSION"] = "without"
Spark is not very picky with regard to Python version, but may need help choosing/finding it. Setting environment variables for this (and the Hadoop related above) can be done persistently, similar to JAVA_HOME or at runtime (see example in Spark notebook), e.g., using:
os.environ["PYSPARK_PYTHON"] = "python" os.environ["PYSPARK_DRIVER_PYTHON"] = "python"
If this version of Python is not accepted, one can point it to a version installed in a conda environment.
Spark itself can be installed using: pip install pyspark (or similar).
Troubleshooting:
Firewalls, e.g., Norton, may block the connection between Spark and Cassandra. Adding and exception in the firewall or temporarily disabling may help.