Skip to content
This repository was archived by the owner on Jun 6, 2025. It is now read-only.
This repository was archived by the owner on Jun 6, 2025. It is now read-only.

Enable providing own hadoop for pyspark notebook image #220

@t92549

Description

@t92549

In the hdfs and Accumulo Dockerfiles, users can provide their own builds of Accumulo, ZooKeeper and Hadoop to be used instead of building them inside the image:

# Allow users to provide their own builds of Accumulo, ZooKeeper and Hadoop
COPY ./files/ .
# Otherwise, download official distributions
RUN if [ ! -f "./accumulo-${ACCUMULO_VERSION}-bin.tar.gz" ]; then \
(wget -nv -O ./accumulo-${ACCUMULO_VERSION}-bin.tar.gz ${ACCUMULO_DOWNLOAD_URL} || wget -nv -O ./accumulo-${ACCUMULO_VERSION}-bin.tar.gz ${ACCUMULO_BACKUP_DOWNLOAD_URL}); \

This can save a lot of time with repeated builds.
This cannot be done, however, for building hadoop inside the pyspark notebook Dockerfile:
ARG HADOOP_VERSION=3.2.2
ARG HADOOP_DOWNLOAD_URL="https://www.apache.org/dyn/closer.cgi?action=download&filename=hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz"
ARG HADOOP_BACKUP_DOWNLOAD_URL="https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz"
RUN cd /opt && \
(wget -nv -O ./hadoop-${HADOOP_VERSION}.tar.gz ${HADOOP_DOWNLOAD_URL} || wget -nv -O ./hadoop-${HADOOP_VERSION}.tar.gz ${HADOOP_BACKUP_DOWNLOAD_URL}) && \

It would be great if this was added to that Dockerfile also.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DockerIssue related to the Docker side of the projectgood first issueSmall, lower complexity and doesn't require pre-existing Gaffer knowledge

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions