Today is Sunday. Vancouver is sunny. It's been quite a while that I
haven't written anything. It took me a couple of weeks to have my tax
reported finally. Hmmm... Anyway, finally, I've got some time to talk
about Supercomputer:
processor : 1 model name : ARMv7 Processor rev 3 (v7l) BogoMIPS : 108.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
processor : 2 model name : ARMv7 Processor rev 3 (v7l) BogoMIPS : 108.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
processor : 3 model name : ARMv7 Processor rev 3 (v7l) BogoMIPS : 108.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
Hardware : BCM2835 Revision : d03114 Serial : 10000000bc6e6e05 Model : Raspberry Pi 4 Model B Rev 1.4
processor : 1 BogoMIPS : 108.00 Features : fp asimd evtstrm crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
processor : 2 BogoMIPS : 108.00 Features : fp asimd evtstrm crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
processor : 3 BogoMIPS : 108.00 Features : fp asimd evtstrm crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
Hardware : BCM2835 Revision : d03114 Serial : 10000000bc6e6e05 Model : Raspberry Pi 4 Model B Rev 1.4
processor : 1 model name : ARMv7 Processor rev 3 (v7l) BogoMIPS : 108.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
processor : 2 model name : ARMv7 Processor rev 3 (v7l) BogoMIPS : 108.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
processor : 3 model name : ARMv7 Processor rev 3 (v7l) BogoMIPS : 108.00 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 3
Hardware : BCM2835 Revision : c03111 Serial : 100000006c0c9b01 Model : Raspberry Pi 4 Model B Rev 1.1
processor : 1 model name : ARMv7 Processor rev 4 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4
processor : 2 model name : ARMv7 Processor rev 4 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4
processor : 3 model name : ARMv7 Processor rev 4 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4
Hardware : BCM2835 Revision : a02082 Serial : 000000009fcc6a22 Model : Raspberry Pi 3 Model B Rev 1.2
processor : 1 model name : ARMv7 Processor rev 4 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4
processor : 2 model name : ARMv7 Processor rev 4 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4
processor : 3 model name : ARMv7 Processor rev 4 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4
Hardware : BCM2835 Revision : a22082 Serial : 000000003fc1b876 Model : Raspberry Pi 3 Model B Rev 1.2
Actually, the cluster can certainly be arbitrarily configured as you
wish. A typical configuration is 1-master-3-workers,
but which one should be the master? Is it really a good
idea to ALWAYS designate the MOST
powerful one as the master? Particularly, in
my case, 4 Raspberry Pis are
of different versions, so that they are of different computing
capability.
3.1 Configure Hostfile
It's always a good idea to create a hostfile on the
master node. However, as reasons mentioned above, there
is NO priority among ALL nodes in my
case, I configured the hostfile for
ALL 4 Raspberry
Pis.
In order to test multiple nodes across the cluster, we need to
generate SSH keys to avoid inputting password for logging into the other
nodes all the time. In such, for each Raspberry Pi, you'll have to
generate a SSH key by ssh-keygen -t rsa, and push this
generated key using command ssh-copy-id onto the other 3 Raspberry Pis. Finally, for a
cluster of 4 Raspberry Pis,
there are 3 authorized keys (for these other 3 Raspberry Pis) stored in file
/home/pi/.ssh/authorized_keys on each of the 4 Raspberry Pis.
For a cluster of 4 Raspberry
Pis, there will be 4*4=16 CPUs in total. Therefore, the maximum
number to specify for argument -n will be 16.
Otherwise, you'll meet the following ERROR message:
1 2 3 4 5 6 7 8 9
pi@pi01:~ $ mpiexec -hostfile hostfile -n 20 hostname -------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 20 slots that were requested by the application: hostname
Either request fewer slots for your application, or make more slots available for use. --------------------------------------------------------------------------
pi@pi01:~ $ mpiexec -hostfile hostfile -n 16 python Downloads/helloworld.py Hello, World! I am process 1 of 16 on pi01. Hello, World! I am process 5 of 16 on pi02. Hello, World! I am process 6 of 16 on pi02. Hello, World! I am process 7 of 16 on pi02. Hello, World! I am process 4 of 16 on pi02. Hello, World! I am process 15 of 16 on pi04. Hello, World! I am process 12 of 16 on pi04. Hello, World! I am process 13 of 16 on pi04. Hello, World! I am process 14 of 16 on pi04. Hello, World! I am process 2 of 16 on pi01. Hello, World! I am process 0 of 16 on pi01. Hello, World! I am process 3 of 16 on pi01. Hello, World! I am process 9 of 16 on pi03. Hello, World! I am process 10 of 16 on pi03. Hello, World! I am process 11 of 16 on pi03. Hello, World! I am process 8 of 16 on pi03.
pi@pi01:~/Downloads/mpi4py-examples $ mpirun --hostfile ~/hostfile ./01-hello-world Hello! I'm rank 1 from 16 running in total... Hello! I'm rank 2 from 16 running in total... Hello! I'm rank 3 from 16 running in total... Hello! I'm rank 0 from 16 running in total... Hello! I'm rank 6 from 16 running in total... Hello! I'm rank 7 from 16 running in total... Hello! I'm rank 4 from 16 running in total... Hello! I'm rank 5 from 16 running in total... Hello! I'm rank 12 from 16 running in total... Hello! I'm rank 10 from 16 running in total... Hello! I'm rank 11 from 16 running in total... Hello! I'm rank 13 from 16 running in total... Hello! I'm rank 9 from 16 running in total... Hello! I'm rank 14 from 16 running in total... Hello! I'm rank 8 from 16 running in total... Hello! I'm rank 15 from 16 running in total...
Sometimes, without specifying the
parameter btl_tcp_if_include, the running program will
hang:
1 2 3 4 5 6 7 8 9 10 11 12
pi@pi01:~/Downloads/mpi4py-examples $ mpirun --np 16 --hostfile ~/hostfile 03-scatter-gather ------------------------------------------------------------------------------ Running on 16 cores ------------------------------------------------------------------------------ After Scatter: [0] [0. 1. 2. 3.] [1] [4. 5. 6. 7.] [pi03][[1597,1],8][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[1597,1],10] [pi01][[1597,1],0][btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[1597,1],3] [2] [ 8. 9. 10. 11.] ^C^Z [1]+ Stopped mpirun --np 16 --hostfile ~/hostfile 03-scatter-gather
Please refer to the explanation TCP: unexpected
process identifier in connect_ack. Now, let's
specify the parameter as
--mca btl_tcp_if_include "192.168.1.251/24,192.168.1.249/24,192.168.1.247/24".
pi@pi01:~/Downloads/mpi4py-examples $ mpirun --hostfile ~/hostfile python ./09-task-pull.py Master starting with 15 workers I am a worker with rank 1 on pi01. I am a worker with rank 2 on pi01. I am a worker with rank 3 on pi01. I am a worker with rank 4 on pi02. I am a worker with rank 5 on pi02. I am a worker with rank 6 on pi02. I am a worker with rank 7 on pi02. Sending task 0 to worker 2 Sending task 1 to worker 1 Sending task 2 to worker 3 Got data from worker 2 Sending task 3 to worker 2 Got data from worker 3 Sending task 4 to worker 3 Got data from worker 1 Got data from worker 2 Sending task 5 to worker 1 Sending task 6 to worker 2 Got data from worker 3 Sending task 7 to worker 3 Got data from worker 1 Got data from worker 2 Sending task 8 to worker 1 Sending task 9 to worker 2 Got data from worker 3 Sending task 10 to worker 3 Got data from worker 1 Got data from worker 2 Sending task 11 to worker 1 Sending task 12 to worker 2 Got data from worker 3 Sending task 13 to worker 3 Got data from worker 1 Got data from worker 2 Sending task 14 to worker 1 Sending task 15 to worker 2 Got data from worker 3 Sending task 16 to worker 3 Got data from worker 1 Got data from worker 2 Sending task 17 to worker 1 Sending task 18 to worker 2 Got data from worker 3 Sending task 19 to worker 3 Got data from worker 1 Sending task 20 to worker 1 Got data from worker 2 Sending task 21 to worker 2 Got data from worker 3 Sending task 22 to worker 3 Got data from worker 1 Sending task 23 to worker 1 Got data from worker 2 Got data from worker 3 Sending task 24 to worker 2 Sending task 25 to worker 3 Got data from worker 2 Got data from worker 1 Sending task 26 to worker 2 Got data from worker 3 Sending task 27 to worker 3 Got data from worker 2 Sending task 28 to worker 1 Sending task 29 to worker 2 Got data from worker 3 Sending task 30 to worker 3 Got data from worker 2 Got data from worker 1 Sending task 31 to worker 2 Got data from worker 2 Got data from worker 3 Worker 2 exited. Worker 1 exited. Worker 3 exited. I am a worker with rank 15 on pi04. I am a worker with rank 12 on pi04. I am a worker with rank 8 on pi03. I am a worker with rank 13 on pi04. I am a worker with rank 9 on pi03. I am a worker with rank 14 on pi04. I am a worker with rank 10 on pi03. I am a worker with rank 11 on pi03. Worker 5 exited. Worker 4 exited. Worker 6 exited. Worker 7 exited. Worker 15 exited. Worker 8 exited. Worker 9 exited. Worker 10 exited. Worker 11 exited. Worker 12 exited. Worker 13 exited. Worker 14 exited. Master finishing
pi@pi01:~ $ mpiexec -n 1 python prime.py 100000 Find all primes up to: 100000 Nodes: 1 Time elasped: 214.86 seconds Primes discovered: 9592
pi02
1 2 3 4 5
pi@pi02:~ $ mpiexec -n 1 python prime.py 100000 Find all primes up to: 100000 Nodes: 1 Time elasped: 212.2 seconds Primes discovered: 9592
pi03
1 2 3 4 5
pi@pi03:~ $ mpiexec -n 1 python prime.py 100000 Find all primes up to: 100000 Nodes: 1 Time elasped: 665.24 seconds Primes discovered: 9592
pi04
1 2 3 4 5
pi@pi04:~ $ mpiexec -n 1 python prime.py 100000 Find all primes up to: 100000 Nodes: 1 Time elasped: 684.64 seconds Primes discovered: 9592
Clearly, the computing capability of each
CPU on pi01/pi02 is roughly 3 times faster than the CPU
on pi03/pi04, which can be easily estimated from the
parameter BogoMIPS: \[ 108.00 (pi01/pi02) / 38.40 (pi03/pi04) \approx 3
\]
pi@pi01:~ $ mpiexec -np 16 --hostfile hostfile --mca btl_tcp_if_include "192.168.1.251/24,192.168.1.249/24,192.168.1.247/24" python prime.py 100000 Find all primes up to: 100000 Nodes: 16 Time elasped: 42.22 seconds Primes discovered: 9592
192.168.1.253 slots=4 192.168.1.251 slots=4
1 2 3 4 5
pi@pi01:~ $ mpiexec -np 8 --hostfile hostfile --mca btl_tcp_if_include "192.168.1.251/24" python prime.py 100000 Find all primes up to: 100000 Nodes: 8 Time elasped: 29.56 seconds Primes discovered: 9592
The results are obviously telling: - to calculate using a cluster of
4 Raspberry Pis with 16 CPUs
is ALWAYS faster than running on a single node with 4
CPUs. \[ 42.22 \le 50 \] - to
calculate using 2 fastest nodes is even faster than running on a cluster
of 4 nodes. This clearly hints the
importance of Load
Balancing. \[ 29.56 \le
42.22 \] - the speed in Experiment 2 is roughly
doubled as that using a single node of pi03 or
pi04. \[ 52 (pi01/pi02) /
29.56 (Experiment 2) \approx 2 \]
In the end of this blog, as for Load
Balancing, I may talk about it some time in the future.