Installing Hail¶
Requirements¶
You’ll need:
- Java 8 JDK
- Spark 2.2.0
- Hail will work with other bug fix versions of Spark 2.2.x, but it will not work with Spark 1.x.x, 2.0.x, or 2.1.x.
- Anaconda for Python 3
Installation¶
Running Hail locally with a pre-compiled distribution¶
Hail uploads distributions to Google Storage as part of our continuous integration suite. You can download a pre-built distribution from the below links. Make sure you download the distribution that matches your Spark version! We recommend updating about once a week, as features are added and improved regularly.
A pre-compiled distribution will be suitable for most users. If you’d like to use Hail with a different version of Spark, see Building your own JAR.
Unzip the distribution after you download it. Next, edit and copy the below bash
commands to set up the Hail environment variables. You may want to add the
export
lines to the appropriate dot-file (we recommend ~/.profile
) so
that you don’t need to rerun these commands in each new session.
Un-tar the Spark distribution.
tar xvf <path to spark.tgz>
Here, fill in the path to the un-tarred Spark package.
export SPARK_HOME=<path to spark>
Unzip the Hail distribution.
unzip <path to hail.zip>
Here, fill in the path to the unzipped Hail distribution.
export HAIL_HOME=<path to hail>
export PATH=$PATH:$HAIL_HOME/bin/
To install Python dependencies, create a conda environment for Hail:
conda env create -n hail -f $HAIL_HOME/python/hail/environment.yml
source activate hail
Once you’ve set up Hail, we recommend that you run the Python tutorials to get an overview of Hail functionality and learn about the powerful query language. To try Hail out, run the below commands to start a Jupyter Notebook server in the tutorials directory.
cd $HAIL_HOME/tutorials
jhail
You can now click on the “01-genome-wide-association-study” notebook to get started!
In the future, if you want to run:
- Hail in Python use hail
- Hail in IPython use ihail
- Hail in a Jupyter Notebook use jhail
Hail will not import correctly from a normal Python interpreter, a normal IPython interpreter, nor a normal Jupyter Notebook.
Building your own Jar¶
To use Hail with other Hail versions of Spark 2, you’ll need to build your own JAR instead of using a pre-compiled distribution. To build against a different version, such as Spark 2.3.0, run the following command inside the directory where Hail is located:
./gradlew -Dspark.version=2.3.0 shadowJar
The Spark version in this command should match whichever version of Spark you would like to build against.
The SPARK_HOME
environment variable should point to an installation of the desired version of Spark, such as spark-2.3.0-bin-hadoop2.7
The version of the Py4J ZIP file in the hail alias must match the version in $SPARK_HOME/python/lib
in your version of Spark.
Running on a Spark cluster¶
Hail can run on any Spark 2.2 cluster. For example, Google and Amazon offer optimized Spark performance and exceptional scalability to thousands of cores without the overhead of installing and managing an on-premesis cluster.
On Google Cloud Dataproc, we provide pre-built JARs and a Python package cloudtools to simplify running Hail, whether through an interactive Jupyter notebook or by submitting Python scripts.
For Cloudera-specific instructions, see Running on a Cloudera cluster.
For all other Spark clusters, you will need to build Hail from the source code.
Hail should be built on the master node of the Spark cluster with the following
command, replacing 2.2.0
with the version of Spark available on your
cluster:
./gradlew -Dspark.version=2.2.0 shadowJar archiveZip
Python and IPython need a few environment variables to correctly find Spark and
the Hail jar. We recommend you set these environment variables in the relevant
profile file for your shell (e.g. ~/.bash_profile
).
export SPARK_HOME=/path/to/spark-2.2.0/
export HAIL_HOME=/path/to/hail/
export PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$HAIL_HOME/build/distributions/hail-python.zip"
export PYTHONPATH="$PYTHONPATH:$SPARK_HOME/python"
export PYTHONPATH="$PYTHONPATH:$SPARK_HOME/python/lib/py4j-*-src.zip"
## PYSPARK_SUBMIT_ARGS is used by ipython and jupyter
export PYSPARK_SUBMIT_ARGS="\
--jars $HAIL_HOME/build/libs/hail-all-spark.jar \
--conf spark.driver.extraClassPath=\"$HAIL_HOME/build/libs/hail-all-spark.jar\" \
--conf spark.executor.extraClassPath=./hail-all-spark.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator
pyspark-shell"
If the previous environment variables are set correctly, an IPython shell which can run Hail backed by the cluster can be started with the following command:
ipython
When using ipython
, you can import hail and start interacting directly
>>> import hail as hl
>>> mt = hl.balding_nichols_model(3, 100, 100)
>>> mt.aggregate_entries(hl.agg.mean(mt.GT.n_alt_alleles()))
You can also interact with hail via a pyspark
session, but you will need to
pass the configuration from PYSPARK_SUBMIT_ARGS
directly as well as adding
extra configuration parameters specific to running Hail through pyspark
:
pyspark \
--jars $HAIL_HOME/build/libs/hail-all-spark.jar \
--conf spark.driver.extraClassPath=$HAIL_HOME/build/libs/hail-all-spark.jar \
--conf spark.executor.extraClassPath=./hail-all-spark.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator
Moreover, unlike in ipython
, pyspark
provides a Spark Context via the
global variable sc
. For Hail to interact properly with the Spark cluster,
you must tell hail about this special Spark Context
>>> import hail as hl
>>> hl.init(sc)
After this initialization step, you can interact as you would in ipython
>>> mt = hl.balding_nichols_model(3, 100, 100)
>>> mt.aggregate_entries(hl.agg.mean(mt.GT.n_alt_alleles()))
It is also possible to run Hail non-interactively, by passing a Python script to
spark-submit
. Again, you will need to explicitly pass several configuration
parameters to spark-submit
spark-submit \
--jars "$HAIL_HOME/build/libs/hail-all-spark.jar" \
--py-files "$HAIL_HOME/build/distributions/hail-python.zip" \
--conf spark.driver.extraClassPath="$HAIL_HOME/build/libs/hail-all-spark.jar" \
--conf spark.executor.extraClassPath=./hail-all-spark.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \
your-hail-python-script-here.py
Running on a Cloudera cluster¶
These instructions explain how to install Spark 2 on a Cloudera cluster. You should work on a gateway node on the cluster that has the Hadoop and Spark packages installed on it.
Once Spark is installed, building and running Hail on a Cloudera cluster is exactly the same as above, except:
On a Cloudera cluster, when building a Hail JAR, you must specify a Cloudera version of Spark. The following example builds a Hail JAR for Cloudera’s 2.2.0 version of Spark:
./gradlew shadowJar -Dspark.version=2.2.0.clouderaOn a Cloudera cluster,
SPARK_HOME
should be set as:SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
,On Cloudera, you can create an interactive Python shell using
pyspark
:pyspark --jars build/libs/hail-all-spark.jar \ --py-files build/distributions/hail-python.zip \ --conf spark.driver.extraClassPath="build/libs/hail-all-spark.jar" \ --conf spark.executor.extraClassPath=./hail-all-spark.jar \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \
Common Installation Issues¶
BLAS and LAPACK¶
Hail uses BLAS and LAPACK optimized linear algebra libraries. These should load automatically on recent versions of Mac OS X and Google Dataproc. On Linux, these must be explicitly installed; on Ubuntu 14.04, run
apt-get install libatlas-base-dev
If natives are not found, hail.log
will contain the warnings
Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
See netlib-java for more information.