Skip to Content.
Sympa Menu

opal - Re: [Opal] optimizer sometime gets stuck

opal AT lists.psi.ch

Subject: The OPAL Discussion Forum

List archive

Re: [Opal] optimizer sometime gets stuck


Chronological Thread  
  • From: "Snuverink Jochem (PSI)" <jochem.snuverink AT psi.ch>
  • To: opal <opal AT lists.psi.ch>, "philippe.piot AT gmail.com" <philippe.piot AT gmail.com>
  • Subject: Re: [Opal] optimizer sometime gets stuck
  • Date: Thu, 27 May 2021 12:30:50 +0000
  • Accept-language: en-US, de-CH
  • Authentication-results: localhost; iprev=pass (psi-seppmail1.ethz.ch) smtp.remote-ip=129.132.93.141; spf=pass smtp.mailfrom=psi.ch; dmarc=skipped

Dear Philippe


Unfortunately this seems to happen sometimes. There is an open issue about it:

https://gitlab.psi.ch/OPAL/src/-/issues/557


Can you perhaps post the opt/pilot.trace.0 ?


My impression was when I investigated this is that the MPI communication between the nodes is not always successful, and then the optimisation gets stuck. My preliminary conclusion is that I think the MPI communication between the nodes in OPAL needs to be made more robust, and there is not much to apart from resubmitting. I might try to work on that in the upcoming developer week. What could help is to run on a single machine with several cores instead of on several machines (if you did so).


Hope that helps,

Jochem


From: opal-request AT lists.psi.ch <opal-request AT lists.psi.ch> on behalf of Philippe Piot <philippe.piot AT gmail.com>
Sent: Thursday, May 27, 2021 1:52:02 PM
To: opal
Subject: [Opal] optimizer sometime gets stuck
 
Dear All,
   When running optimization in OPAL it sometimes happens that the optimizer gets stuck: the job is still active (shows as running) in the cluster queue, but none of the files (opt/pilot.trace.0) are updated anymore. It also usually happens before any of the .json outputs are written. Doing a cat of the stdout (which is not more updated) gives the info below. Could somebody points to other diagnostics I could use to troubleshoot this issue? My problem is quite simple (I have two objective emit_x and emit_y and one "derived" objective sqrt(emit_x*emit_y) and four constraints. Thank you for any suggestions. All the best,  -- Philippe. 


--- tail of stdout
Ippl> CommMPI: Initialization complete.
Ippl> CommMPI: Parent process waiting for children ...
Ippl> CommMPI: Initialization complete.
Ippl> CommMPI: Parent process waiting for children ...
Ippl> CommMPI: Initialization complete.
Ippl> CommMPI: Parent process waiting for children ...
Ippl> CommMPI: Initialization complete.
Ippl> CommMPI: Parent process waiting for children ...
Ippl> CommMPI: Initialization complete.

    



Archive powered by MHonArc 2.6.19.

Top of Page