Simplifying PySpark Setup on Windows: A Practical Guide

Published by

Armin

on

January 16, 2024

Introduction

Transitioning a Linux project onto a Windows 10 platform can be trickier than expected. As someone who recently faced this challenge, I understand the frustration it can cause. That’s why I’m sharing my experience and the solutions I discovered to set up PySpark on Windows smoothly.

Backstory

Converting my Linux project to run on Windows took me two days of trial and error. Unlike the straightforward process in Debian, Windows posed unexpected hurdles. After installing PySpark via pip, warnings about missing libraries and binaries popped up:

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable".

Despite attempting to install Hadoop and Java, the issue persisted. It wasn’t until I stumbled upon a helpful link that I realized the need to add the binaries separately:

Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

Solution

Installing Libraries and Configuring Paths

To begin, I’ll walk you through the essential steps to overcome these challenges. Following Nehal. B’s comprehensive guide on Installing ApacheSpark on Windows up to step 7 covers nearly 90% of the setup process, making it a solid starting point.

Adding Necessary Packages

Despite making progress, I encountered roadblocks, such as the non-execution of winutils.exe. Delving into the issue revealed a missing .dll library, which I resolved by installing Microsoft Visual C++ 2010 Service Pack 1 Redistributable Package MFC Security Update.

Dynamically Setting PySpark Runtime Path

After trying to run a PySpark session, I encountered more error messages:

answer = 'xro59', gateway_client = <py4j.clientserver.JavaClient object at 0x000001D7807835D0> target_id = 'o58', name = 'showString'
py4j.protocol.Py4JJavaError: An error occurred while calling o58.showString.

After exhaustive debugging, a solution emerged from a StackOverflow query. By leveraging the ‘findspark’ pip module, configuring essential paths became effortless, ensuring smooth PySpark execution.

import findspark

findspark.init()
spark = SparkSession.builder.appName("test").getOrCreate()

Conclusion

Setting up PySpark on Windows might seem like a daunting task, but with the right guidance, it’s totally doable. By following these steps, you’ll navigate the twists and turns of the setup process with ease. So, roll up your sleeves, dive in, and let’s PySparkle on Windows!