Skip to Content.
Sympa Menu

opal - Re: [Opal] MPI issue

opal AT lists.psi.ch

Subject: The OPAL Discussion Forum

List archive

Re: [Opal] MPI issue


Chronological Thread  
  • From: Robert Nagler <nagler AT radiasoft.net>
  • To: Adelmann Andreas <andreas.adelmann AT psi.ch>
  • Cc: "opal AT lists.psi.ch" <opal AT lists.psi.ch>
  • Subject: Re: [Opal] MPI issue
  • Date: Thu, 13 Jul 2023 10:23:36 -0600

Hi Andreas,

Thanks. This only happens on Perlmutter. We didn't have these problems on Cori, but that was a different host and guest operating system.

We haven't seen this on our local systems. I will do more testing on Perlmutter both in containers and native and report back.

Rob


On Thu, Jul 13, 2023 at 1:56 AM Adelmann Andreas <andreas.adelmann AT psi.ch> wrote:
Hi Rob, I never got that error.

Is the problem arising only when you run OPAL from the container on Perlmutter OR 
does the problem also manifest itself when compiling/running on your cluster?

Cheers A
------
Dr. sc. math. Andreas (Andy) Adelmann
Paul Scherrer Institut OHSA/D17 CH-5232 Villigen PSI
Phone Office: xx41 56 310 42 33 Fax: xx41 56 310 31 91
Zoom ID: 470-582-4086 Password: AdA
Zoom Link: https://ethz.zoom.us/j/4705824086?pwd=dFcvT1pMMGY0bHg0dTNncUNZZTJkZz09

-------------------------------------------------------
Friday: ETH HPK G 28   +41 44 633 3076
============================================
The more exotic, the more abstract the knowledge, 
the more profound will be its consequences.
Leon Lederman 
============================================

On 11 Jul 2023, at 21:23, Robert Nagler <nagler AT radiasoft.net> wrote:

We're trying to run Opal with the attached input on NERSC Perlmutter via Shifter (NERSC's container technology).

The first problem we ran into is that NERSC's Cray MPICH ABI DSOs do not include C++ bindings, since they are deprecated in MPI 3. This was worked around by switching to the mpicc (instead of mpicxx), which doesn't include libmpic++.so. Opal loads on Perlmutter with our image.

The current problem is this:
PMPI_Allreduce(497).....: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fff418617a7, count=1, datatype=dtype=0x4c000133, op=MPI_LOR, comm=MPI_COMM_WORLD) failed
MPIR_LOR_check_dtype(92): MPI_Op MPI_LOR operation not defined for this datatype

We switched to Opal 2022.1.0 and Trilions 13.0.1 and updated other dependencies before we got this error. We are using Fedora 36 as the base container image, which comes with gcc 12.2.1 and mpich-3.4.3. 

I will debug this further, but I was wondering if someone has run into this issue and is it specific to 2022.1.0.

Thanks,
Rob

Robert Nagler
CTO | RadiaSoft LLC

<eic_test_wig.txt>




Archive powered by MHonArc 2.6.24.

Top of Page