Hurricane / Whirlwind
Hurricane and Whirlwind are the subclusters of SciClone with Intel Xeon "Westmere-EP" processors. Their front-end is hurricane.sciclone.wm.edu
and they share the same startup modules file .cshrc.rhel6-xeon
.
Hardware
Front-end (hurricane / hu00) |
GPU nodes (hu01-hu12) |
Non-GPU nodes (wh01-wh44) |
Large-memory nodes (wh45-wh52) |
||
---|---|---|---|---|---|
Model |
HP ProLiant DL180 G6 |
HP ProLiant SL390s G7 2U |
Dell PowerEdge C6100 |
||
Processor(s) |
2×4-core Intel Xeon E5620 |
2×4-core Intel Xeon X5672 |
|||
Clock speed |
2.4 GHz |
3.2 GHz |
|||
L3 cache |
12 MB |
||||
Memory | 16 GB |
48 GB 1333 MHz |
64 GB 1333 MHz |
192 GB 1066 MHz |
|
GPUs | - | 2 × NVIDIA Tesla M2075, 1.15 GHz, 448 CUDA cores |
- |
- |
|
Network interfaces |
Application |
QDR IB (hu??-ib, wh??-ib) |
|||
System |
10 GbE (hu00) |
1 GbE (hu??, wh??) |
|||
OS | RHEL 6.8 | CentOS 6.8 |
The Hurricane and Whirlwind subclusters share a single QDR InfiniBand switch, and can also communicate with the Rain subcluster via a single DDR (20 Gb/s) switch-to-switch link, with Hima via a single QDR (40 Gb/s) switch-to-switch link, and with Bora, and Vortex via two QDR switch-to-switch links.
Main memory size works out to a generous 6-24 GB/core, meeting or exceeding that of SciClone's large-memory Rain nodes. Per-core InfiniBand bandwidth is 5 GB/core (less 20% protocol overhead), also matching that of the existing Rain compute nodes. However, when the higher speed of the Xeon processors is taken into account, the bandwidth/FLOP is lower; when GPU acceleration is factored in, the communication-to-computation ratio could drop by a couple of orders of magnitude. Communication performance is therefore an important concern when designing multi-node parallel algorithms for this architecture.
TORQUE node specifiers
All access to compute nodes (for either interactive or batch work) is via the TORQUE resource manager, as described elsewhere. TORQUE assigns jobs to a particular set of processors so that jobs do not interfere with each other. The TORQUE properties for the hurricane and whirlwind nodes are:
hu01-hu08: c10, c10x, x5672, el6, compute, hurricane
hu09-hu12: c10a, c10x, x5672, el6, compute, hurricane
wh01-wh44: c11, c11x, x5672, el6, compute, whirlwind
wh45-wh52: c11a, c11x, x5672, el6, compute, whirlwind
This set of properties allows you to select different subsets of hurricane and whirlwind nodes.
Considerations for CPU jobs
While Hurricane additionally has GPUs, Hurricane and Whirlwind have the same CPU configuration and InfiniBand switch and can be used effectively together as a "metacluster" by non-GPU parallel jobs using the TORQUE property named for their processor model, e.g.
qsub -l nodes=16:x5672:ppn=8 ...
To use only Whirlwind nodes, you would instead specify
qsub -l nodes=16:whirlwind:ppn=8 ...
If you have memory requirements exceeding the 8000 MB/core available on every Whirlwind node, ask for only the large-memory nodes, like so:
qsub -l nodes=4:whirlwind:ppn=8,pmem=24000mb ...
qsub -l nodes=4:c11a:ppn=8 ...
Considerations for GPU jobs
Until we are able to install a GPU-aware job scheduler, there is no simple way to allocate GPU devices among multiple jobs running on the same node. To prevent device conflicts (which would result in either runtime errors or degraded performance), we recommend that GPU applications request an entire node, and then use as many of the GPU devices and Xeon cores as possible to avoid wasting resources. For example,
qsub -n -l nodes=1:hurricane ...
In some cases it may be necessary to obtain an interactive shell on a GPU-enabled compute node in order to successfully compile GPU applications. This can be done with qsub -I
. This limitation arises because the CUDA device driver and associated libraries cannot be installed on systems (such as the hurricane
front-end node) which do not have resident GPU devices.
Compilers
Several compiler suites are available in SciClone's RHEL 6 / Xeon environment, including PGI 11.10, several versions of the GNU Compiler Collection (GCC), and Solaris Studio 12.3. NVIDIA's nvcc
compiler for CUDA runs on top of GCC 4.4.7.
In some cases libraries and applications are supported only for a particular compiler suite; in other cases they may be supported across multiple compiler suites. For a complete list of available compilers, use the "module avail" command.
In most cases code generated by the commercial compiler suites (PGI and Sun Studio) will outperform that generated by the open-source GNU compilers, sometimes by a wide margin. There are exceptions, however, so we strongly encourage you to experiment with different compiler suites in order to determine which will yield the best performance for a given task. When a GNU compiler is required, we recommend GCC 4.7.0 since it has better support for the Nehalem architecture than earlier versions.
Note that well-crafted GPU programs written with CUDA, OpenCL, or compiler directives can vastly exceed the performance achievable with conventional code running on the Xeon processors, but the inverse is also true: problems that are ill-suited to the GPU architecture or which require a lot of data movement between main memory and GPU memory can run much more slowly than CPU-based code.
You can switch between alternative compilers by modifying the appropriate module load
command in your .cshrc.rhel6-xeon
file. The default configuration loads pgi/11.10
. Because of conflicts with command names, environment variables, libraries, etc., attempts to load multiple compiler modules into your environment simultaneously may result in an error.
For details about compiler installation paths, environment variables, etc., use the "module show" command for the compiler of interest, e.g.,
module show pgi/11.10
module show gcc/4.7.0
module show solstudio/12.3
module show cuda/7.0
etc.
For proper operation and best performance, it's important to choose compiler options that match the target architecture and enable the most profitable code optimizations. The options listed below are suggested as starting points. Note that for some codes, these optimizations may be too aggressive and may need to be scaled back. Consult the appropriate compiler manuals for full details.
Intel | C | icc -O3 -xSSE4.2 -align -finline-functions |
---|---|---|
C++ | icpc -std=c11 -O3 -xSSE4.2 -align -finline-functions | |
Fortran | ifort -O3 -xSSE4.2 -align array64byte -finline-functions | |
GCC | C | gcc -march=westmere -O3 -finline-functions |
C++ | g++ -std=c11 -march=westmere -O3 -finline-functions | |
Fortran | gfortran -march=westmere -O3 -finline-functions | |
GCC 4.4.6 |
-O3 -march=core2 -m64 | |
GCC 4.7.0 |
-O3 -march=corei7 -m64 | |
PGI | -fast -tp nehalem [-ta=nvidia,cc13,cuda4.0] -m64 -Mipa=fast [-Minfo=all] | |
Sun |
-fast -xchip=westmere -xarch=sse4_2 -xcache=32/64/8:256/64/8:12288/64/16 -m64
|