Skip to content

Crash with large systems #700

@aizvorski

Description

@aizvorski

Describe the bug
Systems larger than approx 831-833 atoms always crash. This doesn't seem to depend on what the systems are (tried a few different types of systems, from one long linear molecule to many small ones with different atoms, all behave the same), and also doesn't depend on the coordinates (molecules near each other in different orientations, or very far apart). It also doesn’t seem related to the OpenMP stack size.

To Reproduce
Using the provided water278.xyz file:
https://gist.github.com/aizvorski/641a987e7dfa89eba4ce241c68409768#file-water278-xyz

$ OMP_NUM_THREADS=1 OMP_MAX_ACTIVE_LEVELS=1 OMP_STACKSIZE=200G time -v /home/ubuntu/bin/xtb-6.5.1/bin/xtb water278.xyz --gfn 2 --chrg "0"
...
   * xtb version 6.5.1 (579679a) compiled by 'ehlert@majestix' on 2022-07-11
...
          ...................................................
          :                      SETUP                      :
          :.................................................:
          :  # basis functions                1668          :
          :  # atomic orbitals                1668          :
          :  # shells                         1112          :
          :  # electrons                      2224          :
          :  max. iterations                   250          :
          :  Hamiltonian                  GFN2-xTB          :
          :  restarted?                      false          :
          :  GBSA solvation                  false          :
          :  PC potential                    false          :
          :  electronic temp.          300.0000000     K    :
          :  accuracy                    1.0000000          :
          :  -> integral cutoff          0.2500000E+02      :
          :  -> integral neglect         0.1000000E-07      :
          :  -> SCF convergence          0.1000000E-05 Eh   :
          :  -> wf. convergence          0.1000000E-03 e    :
          :  Broyden damping             0.4000000          :
          ...................................................
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
xtb                000000000305452D  Unknown               Unknown  Unknown
xtb                0000000003271BC0  Unknown               Unknown  Unknown
xtb                000000000099DF21  xtb_disp_coordina         396  coordinationnumber.f90
xtb                00000000031D4B83  Unknown               Unknown  Unknown
xtb                0000000003186C16  Unknown               Unknown  Unknown
xtb                0000000003155085  Unknown               Unknown  Unknown
xtb                000000000099DCA0  xtb_disp_coordina         396  coordinationnumber.f90
xtb                000000000099B2C8  xtb_disp_coordina         340  coordinationnumber.f90
xtb                00000000008E7399  xtb_scf_mp_scf_.A         519  scf_module.F90
xtb                00000000006125A3  xtb_xtb_calculato         257  calculator.f90
xtb                000000000041800F  xtb_prog_main_mp_         580  main.F90
xtb                000000000042512B  MAIN__                     55  primary.f90
xtb                00000000004020EE  Unknown               Unknown  Unknown
xtb                0000000003273060  Unknown               Unknown  Unknown
xtb                0000000000401FD7  Unknown               Unknown  Unknown
Command exited with non-zero status 174
	Command being timed: "/home/ubuntu/bin/xtb-6.5.1/bin/xtb water278.xyz --gfn 2 --chrg 0"
	User time (seconds): 0.15
	System time (seconds): 0.03
	Percent of CPU this job got: 97%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.19
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 108560
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 28220
	Voluntary context switches: 1
	Involuntary context switches: 449
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 174

For comparison, an input file water277.xyz with one less water succeeds:
https://gist.github.com/aizvorski/7b4215388491126090ba83b6ae4ab341#file-water277-xyz

$ OMP_NUM_THREADS=1 OMP_MAX_ACTIVE_LEVELS=1 OMP_STACKSIZE=200G time -v /home/ubuntu/bin/xtb-6.5.1/bin/xtb water277.xyz --gfn 2 --chrg "0"
...
   * xtb version 6.5.1 (579679a) compiled by 'ehlert@majestix' on 2022-07-11
...
          ...................................................
          :                      SETUP                      :
          :.................................................:
          :  # basis functions                1662          :
          :  # atomic orbitals                1662          :
          :  # shells                         1108          :
          :  # electrons                      2216          :
          :  max. iterations                   250          :
          :  Hamiltonian                  GFN2-xTB          :
          :  restarted?                       true          :
          :  GBSA solvation                  false          :
          :  PC potential                    false          :
          :  electronic temp.          300.0000000     K    :
          :  accuracy                    1.0000000          :
          :  -> integral cutoff          0.2500000E+02      :
          :  -> integral neglect         0.1000000E-07      :
          :  -> SCF convergence          0.1000000E-05 Eh   :
          :  -> wf. convergence          0.1000000E-03 e    :
          :  Broyden damping             0.4000000          :
          ...................................................

 iter      E             dE          RMSdq      gap      omega  full diag
   1  -1415.6386943 -0.141564E+04  0.204E-07    8.73       0.0  T
   2  -1415.6386943  0.886757E-11  0.119E-07    8.73   29040.2  T
   3  -1415.6386943 -0.106866E-10  0.207E-08    8.73  100000.0  T

   *** convergence criteria satisfied after 3 iterations ***

         #    Occupation            Energy/Eh            Energy/eV
      -------------------------------------------------------------
         1        2.0000           -0.7271272             -19.7861
       ...           ...                  ...                  ...
      1102        2.0000           -0.3682050             -10.0194
      1103        2.0000           -0.3664023              -9.9703
      1104        2.0000           -0.3625255              -9.8648
      1105        2.0000           -0.3584824              -9.7548
      1106        2.0000           -0.3570151              -9.7149
      1107        2.0000           -0.3556497              -9.6777
      1108        2.0000           -0.3359206              -9.1409 (HOMO)
      1109                         -0.0151621              -0.4126 (LUMO)
      1110                         -0.0061251              -0.1667
      1111                          0.0011029               0.0300
      1112                          0.0020212               0.0550
      1113                          0.0029399               0.0800
       ...                                ...                  ...
      1662                          0.4675880              12.7237
      -------------------------------------------------------------
                  HL-Gap            0.3207585 Eh            8.7283 eV
             Fermi-level           -0.1755413 Eh           -4.7767 eV

 SCC (total)                   0 d,  0 h,  0 min, 17.350 sec
 SCC setup                      ...        0 min,  0.037 sec (  0.211%)
 Dispersion                     ...        0 min,  0.080 sec (  0.462%)
 classical contributions        ...        0 min,  0.011 sec (  0.063%)
 integral evaluation            ...        0 min,  0.634 sec (  3.651%)
 iterations                     ...        0 min, 11.684 sec ( 67.342%)
 molecular gradient             ...        0 min,  4.016 sec ( 23.145%)
 printout                       ...        0 min,  0.889 sec (  5.125%)

         :::::::::::::::::::::::::::::::::::::::::::::::::::::
         ::                     SUMMARY                     ::
         :::::::::::::::::::::::::::::::::::::::::::::::::::::
         :: total energy           -1405.892124104588 Eh    ::
         :: gradient norm              0.203225946340 Eh/a0 ::
         :: HOMO-LUMO gap              8.728283439762 eV    ::
         ::.................................................::
         :: SCC energy             -1415.638694336316 Eh    ::
         :: -> isotropic ES            8.569566870483 Eh    ::
         :: -> anisotropic ES         -0.289563022977 Eh    ::
         :: -> anisotropic XC         -0.213130853940 Eh    ::
         :: -> dispersion             -0.253146647874 Eh    ::
         :: repulsion energy           9.734657198491 Eh    ::
         :: add. restraining           0.000000000000 Eh    ::
         :: total charge              -0.000000000003 e     ::
         :::::::::::::::::::::::::::::::::::::::::::::::::::::
...

           -------------------------------------------------
          | TOTAL ENERGY            -1405.892124104588 Eh   |
          | GRADIENT NORM               0.203225946340 Eh/α |
          | HOMO-LUMO GAP               8.728283439762 eV   |
           -------------------------------------------------

------------------------------------------------------------------------
 * finished run on 2022/10/02 at 00:43:41.395     
------------------------------------------------------------------------
 total:
 * wall-time:     0 d,  0 h,  0 min, 18.069 sec
 *  cpu-time:     0 d,  0 h,  0 min, 18.065 sec
 * ratio c/w:     1.000 speedup
 SCF:
 * wall-time:     0 d,  0 h,  0 min, 17.377 sec
 *  cpu-time:     0 d,  0 h,  0 min, 17.376 sec
 * ratio c/w:     1.000 speedup

normal termination of xtb
	Command being timed: "/home/ubuntu/bin/xtb-6.5.1/bin/xtb water277.xyz --gfn 2 --chrg 0"
	User time (seconds): 17.69
	System time (seconds): 0.37
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.07
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 594568
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 228439
	Voluntary context switches: 1
	Involuntary context switches: 483
	Swaps: 0
	File system inputs: 0
	File system outputs: 368
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

This does not appear to be due to out-of-memory, or to too-low setting for OMP_STACKSIZE. The machine this was tested on has >200GB memory. The actual memory used when the crash happens (reported by time -v) is just a little over 100MB.

Setting the stack size deliberately very low with largest input system which succeeds, water277.xyz:

  • OMP_STACKSIZE=1M OMP_NUM_THREADS=1 succeeds - the stack size seems to not matter when there is only one thread
  • OMP_STACKSIZE=50M OMP_NUM_THREADS=2 succeeds
  • OMP_STACKSIZE=20M OMP_NUM_THREADS=2 fails, the exact failure seems non-deterministic - either SIGSEGV in xtb_coulomb_klopm during the iterations, or "Command terminated by signal 11" after iterations finish

GDB backtrace:

$ OMP_NUM_THREADS=1 OMP_MAX_ACTIVE_LEVELS=1 OMP_STACKSIZE=200G gdb /home/ubuntu/bin/xtb-6.5.0/bin/xtb
(gdb) run water278.xyz --gfn 1 --chrg "0"

Program received signal SIGSEGV, Segmentation fault.
0x000000000099cf41 in xtb_disp_coordinationnumber_mp_ncoordlatp_.A ()
(gdb) bt
#0  0x000000000099cf41 in xtb_disp_coordinationnumber_mp_ncoordlatp_.A ()
#1  0x00000000031d3f83 in __kmp_invoke_microtask ()
#2  0x0000000003186016 in __kmp_fork_call ()
#3  0x0000000003154485 in __kmpc_fork_call ()
#4  0x000000000099ccc0 in xtb_disp_coordinationnumber_mp_ncoordlatp_.A ()
#5  0x000000000099a438 in xtb_disp_coordinationnumber_mp_getcoordinationnumberlp_ ()
#6  0x00000000008e6429 in xtb_scf_mp_scf_.A ()
#7  0x0000000000611d33 in xtb_xtb_calculator_mp_singlepoint_.A ()
#8  0x00000000004177f3 in xtb_prog_main_mp_xtbmain_.A ()
#9  0x000000000042492b in MAIN__ ()

Expected behaviour
No crash.

Additional context

Using xtb 6.5.1 binary downloaded from https://github.com/grimme-lab/xtb/releases/download/v6.5.1/xtb-6.5.1-linux-x86_64.tar.xz
xtb --version gives version 6.5.1 (579679a) compiled by 'ehlert@majestix' on 2022-07-11

OS: Ubuntu 18.04.4 LTS
Hardware: AMD EPYC 7B13 CPU, 224GB RAM
(also tested on Ubuntu 20.04 LTS, Intel i7-10510U, 48GB RAM: same behavior)
(also tested on xtb-6.5.0 and 6.4.1: same)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions