Skip to Content.
Sympa Menu

opal - Re: [Opal] optimizer sometime gets stuck

opal AT lists.psi.ch

Subject: The OPAL Discussion Forum

List archive

Re: [Opal] optimizer sometime gets stuck


Chronological Thread  
  • From: "Adelmann Andreas (PSI)" <andreas.adelmann AT psi.ch>
  • To: Philippe Piot <philippe.piot AT gmail.com>
  • Cc: "opal AT lists.psi.ch" <opal AT lists.psi.ch>
  • Subject: Re: [Opal] optimizer sometime gets stuck
  • Date: Thu, 27 May 2021 14:15:09 +0000
  • Accept-language: en-US, de-CH
  • Authentication-results: localhost; iprev=pass (psi-seppmail1.ethz.ch) smtp.remote-ip=129.132.93.141; spf=pass smtp.mailfrom=psi.ch; dmarc=skipped

Hello Philippe on Bebop most of the simulation
where done by Nicole. Indeed, we had a lot of “stability” related
problems; with that I mean after a resubmission the run 
was successfully. Some where related some MPI environment variables
(you also set some). Needless to say that I also have the suspicion 
that there could be a dead lock in OPAL. 

We needed the help of the computing consultant to sort out some of these 
issues. 

The submission script looks good to me.

So here a few things to tryout: 

1. does the job run when using less cores?
2. what about the KNL partition? The cores are slower but you need
    to use less nodes.
3. in case of a deadlock, maybe a computing guy can find out where in the code
    (would need to compile with -O3 -g)
4. as a last resort we could try to debugging this at our retreat. Interesting would be
    a configuration with a minimal amount of cores that exhibits the problem. 

Does that make sense ?

Cheers A
------
Dr. sc. math. Andreas (Andy) Adelmann
Head a.i. Labor for Scientific Computing and Modelling 
Paul Scherrer Institut OHSA/ CH-5232 Villigen PSI
Phone Office: xx41 56 310 42 33 Fax: xx41 56 310 31 91
Zoom ID: 470-582-4086 Password: AdA
Zoom Link: https://ethz.zoom.us/j/4705824086?pwd=dFcvT1pMMGY0bHg0dTNncUNZZTJkZz09

-------------------------------------------------------
Friday: ETH HPK G 28   +41 44 633 3076
============================================
The more exotic, the more abstract the knowledge, 
the more profound will be its consequences.
Leon Lederman 
============================================

On 27 May 2021, at 14:45, Philippe Piot <philippe.piot AT gmail.com> wrote:

Andreas, 
  Did you ever encounter this type of problem on bebop? This is the cluster I am using -- below is my input script in case you have a good suggestion. Thank you! -- Philippe. 

#!/bin/bash -l
#SBATCH -A Bright-Beams
#SBATCH --job-name=awa_optim
#SBATCH -o optim.%j.%N.out
#SBATCH -e optim.%j.%N.error
#SBATCH --time=18:00:00
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=36
#SBATCH --partition=bdwall

#
#export I_MPI_SLURM_EXT=0
#export I_MPI_FABRICS=shm:tmi

ulimit -s unlimited

export OPAL_EXE_PATH=/lcrc/project/Bright-Beams/software/opal/build_gcc/src
#
# cd $SLURM_SUBMIT_DIR
#
rm -rf   *.0 tmp *_0
#
# mkdir results tmp
#
# Setup My Environment
module load  gcc/7.1.0-4bgguyp
module load boost  # needs > 1.66
module load mpich
module load hdf5/1.10.5-fuzylbv # need parallel
module load libszip
module load gsl   #/2.4


# Run My Program

mpirun -n $SLURM_NTASKS $OPAL_EXE_PATH/opal awaDrive_optimEmit.in  --info 5

On Thu, May 27, 2021 at 7:41 AM Adelmann Andreas (PSI) <andreas.adelmann AT psi.ch> wrote:
Hi Philippe tend to agree with Jochem (I misinterpreted the output snippet in your original email).
 
Cheers A
------
Dr. sc. math. Andreas (Andy) Adelmann
Head a.i. Labor for Scientific Computing and Modelling 
Paul Scherrer Institut OHSA/ CH-5232 Villigen PSI
Phone Office: xx41 56 310 42 33 Fax: xx41 56 310 31 91
Zoom ID: 470-582-4086 Password: AdA
Zoom Link: https://ethz.zoom.us/j/4705824086?pwd=dFcvT1pMMGY0bHg0dTNncUNZZTJkZz09

-------------------------------------------------------
Friday: ETH HPK G 28   +41 44 633 3076
============================================
The more exotic, the more abstract the knowledge, 
the more profound will be its consequences.
Leon Lederman 
============================================

On 27 May 2021, at 14:32, Philippe Piot <philippe.piot AT gmail.com> wrote:

<pilot.trace.0>





Archive powered by MHonArc 2.6.19.

Top of Page