HDFS local single node container for testing
$ git clone git@github.com:sukumaar/hdfs-local-container.git
$ cd hdfs-local-container
# using hdfs-local as image name you can choose your own
$ docker build -t hdfs-local .docker pull sukumaar/hdfs-local:latest- CONTAINER_NAME environment variable is required, it's value should be name of your container from your
docker runcommand
$ docker run -e CONTAINER_NAME=namenode \
-d --name namenode \
-p 9000:9000 -p 9870:9870 -p 9866:9866 -p 9864:9864 -p 9867:9867 \
--replace hdfs-local$ docker exec -it namenode /bin/bash -c "su hadoop"
hadoop@946b4517b87c:/$
hadoop@946b4517b87c:/$ cd ~
hadoop@946b4517b87c:~$ hadoop fs -ls
# create/upload some sample file on local filesystem of container, example: data.csv
hadoop@946b4517b87c:~$ hadoop fs -put data.csv- Do these steps if spark is not on the same machine where container is hosted
- You need to have ssh access to machine where conatainer is hosted/running
- Ssh port forwarding of these ports 9000, 9870, 9866, 9864, 9867
- If spark shell is running on the same machine skip previous step
- Spark sample code
scala> val df = spark.read.text("hdfs://localhost:9000/user/hadoop/data.csv")
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.show
+--------------------+
| value|
+--------------------+
|id,name,departmen...|
|1,John Doe,Engine...|
|2,Jane Smith,Mark...|
|3,Robert Brown,Sa...|
|4,Emily Davis,Eng...|
|5,Michael Wilson,...|
|6,Sophia Taylor,F...|
|7,David Miller,Ma...|
|8,Olivia Anderson...|
|9,Daniel Thomas,S...|
|10,Ava Martin,HR,...|
+--------------------+