ai2-kit workflow cll-mlp-training
CLL workflow is an improved version of DPGEN workflow, which is designed to meet more complex potential training requirements and sustainable code integration. CLL workflow adopts a closed-loop learning mode to automatically train MLP potential. In each iteration, the workflow uses labeled structures generated by first-principles methods to train multiple MLP models. Then, these models are used to explore new structures as training data for the next iteration. The iteration will continue until the quality of the MLP model meets the predetermined standard. The configuration of each iteration can be updated according to the training needs to further improve the training efficiency.
The main improvements of CLL workflow include:
- More semantic configuration system to support the selection of different software, and set different software configurations according to different systems.
- More robust checkpoint mechanism to reduce the impact of execution interruption.
- Support remote Python method execution to avoid unnecessary data transfer and improve the efficiency and stability of execution on HPC clusters.
Currently, CLL workflow supports the following tools for potential training:
- Label: CP2K, VASP
- Train: DeepMD
- Explore: LAMMPS, LASP
- Select: Model deviation, Distinctive structure selection
Currently, CLL workflow submits jobs through the HPC executor provided by ai2-kit
, which supports completing calculations in a single HPC cluster. In the future, multi-cluster scheduling and support for different workflow engines including DFlow
will be considered according to needs.
- The Python version of the workflow execution environment and the HPC execution environment need to be consistent, otherwise there will be problems with remote execution.
- The HPC execution environment needs to install
ai2-kit
. Generally speaking, theai2-kit
on the HPC does not need to be strictly the same as the localai2-kit
version, but if the difference is too large, there may still be problems, so it is recommended to use the same version ofai2-kit
when conditions permit.
pip install -U ai2-kit
The usage of CLL workflow is demonstrated through an example.
The data required for the execution of the workflow needs to be placed on the HPC cluster node in advance. Before starting the execution of the workflow, you need to prepare the following data:
- Initial structure or initial data set for potential training
- Initial structure for structure search
Assuming that you already have a trajectory generated by AIMD h2o_64.aimd.xyz
, then you can use the ai2-kit tool ase
command line tool to prepare these data.
mkdir -p data/explore
# Extract training set from 0-900 frames, extract every 5 frames
ai2-kit tool ase read h2o_64.aimd.xyz --index ':900:5' - set_cell "[12.42,12.42,12.42,90,90,90]" - write data/training.xyz
# Extract validation set from 900- frames, extract every 5 frames
ai2-kit tool ase read h2o_64.aimd.xyz --index '900::5' - set_cell "[12.42,12.42,12.42,90,90,90]" - write data/validation.xyz
# Extract data for initial structure search, extract every 100 frames
ai2-kit tool ase read h2o_64.aimd.xyz --index '::100' - set_cell "[12.42,12.42,12.42,90,90,90]" - write_frames "./data/explore/POSCAR-{i:04d}" --format vasp
The configuration file of CLL workflow is in YAML format and supports splitting in any dimension. ai2-kit
will automatically merge them during execution. Moderate splitting is conducive to the maintenance and reuse of configuration files. In general, we can split the configuration file into the following parts:
- artifact.yml: the data required by the workflow
- executor.yml: the parameters of the HPC executor
- workflow.yml: the parameters of the workflow software
Another way to build configuration is to use the configuration file provided in example as a reference to create your own workflow. For details, please refer to the document in the example directory.
We start with artifact.yml
, which is used to configure the data required by the workflow. In this example, we need to configure three datasets, which are the training dataset, the validation dataset, and the dataset for structure search. The configuration of these three datasets is as follows:
.base_dir: &base_dir /home/user01/data/
artifacts:
h2o_64-train:
url: !join [*base_dir, training.xyz]
h2o_64-validation:
url: !join [*base_dir, validation.xyz]
attrs:
deepmd:
validation_data: true # Specify this dataset as the validation set, Optional
h2o_64-explore:
url: !join [*base_dir, explore]
includes: POSCAR*
attrs:
# If necessary, you can specify specific software configurations for specific systems here. This example does not require this configuration, so it is empty
# lammps:
# plumed_config: !load_text plumed.in
# cp2k:
# input_template: !load_text cp2k.inp
Here we use the custom tag !join
provided by ai2-kit
to simplify the data configuration. For related functions, please refer to the TIPS document.
Next, we configure the executor.yml
file, which is used to configure parameters related to the HPC link and the use template of the software.
executors:
hpc-cluster01:
ssh:
host: user01@login-01 # Login node
gateway:
host: user01@jump-host # Jump host (optional)
queue_system:
slurm: {} # Use slurm as the job scheduling system
work_dir: /home/user01/ai2-kit/workdir # Working directory
python_cmd: /home/user01/libs/conda/env/py39/bin/python # Remote Python interpreter
context:
train:
deepmd: # Configure the deepmd job submission template
script_template:
header: |
#SBATCH -N 1
#SBATCH --ntasks-per-node=4
#SBATCH --job-name=deepmd
#SBATCH --partition=gpu3
#SBATCH --gres=gpu:1
#SBATCH --mem=8G
setup: |
set -e
module load deepmd/2.2
set +e
explore:
lammps: # Configure the lammps job submission template
lammps_cmd: lmp_mpi
concurrency: 5
script_template:
header: |
#SBATCH -N 1
#SBATCH --ntasks-per-node=4
#SBATCH --job-name=lammps
#SBATCH --partition=gpu3
#SBATCH --gres=gpu:1
#SBATCH --mem=24G
setup: |
set -e
module load deepmd/2.2
export OMP_NUM_THREADS=1
set +e
label:
cp2k: # Configure the cp2k job submission template
cp2k_cmd: mpiexec.hydra cp2k.popt
concurrency: 5
script_template:
header: |
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH -t 12:00:00
#SBATCH --job-name=cp2k
#SBATCH --partition=c52-medium
setup: |
set -e
module load intel/17.5.239 mpi/intel/2017.5.239
module load gcc/5.5.0
module load cp2k/7.1
set +e
Finally, the configuration of the workflow.yml
file is the configuration of the parameters of the workflow. This file is used to configure the parameters of the workflow.
workflow:
general:
type_map: [ H, O ]
mass_map: [ 1.008, 15.999 ]
max_iters: 2 # Specify the maximum number of iterations
train:
deepmd: # deepmd parameter configuration
model_num: 4
# The data used in this example has not been labeled, so it needs to be configured here.
# If there is an existing labeled deepmd/npy dataset, you can specify it here.
init_dataset: [ ]
input_template:
model:
descriptor:
type: se_a
sel:
- 100
- 100
rcut_smth: 0.5
rcut: 5.0
neuron:
- 25
- 50
- 100
resnet_dt: false
axis_neuron: 16
seed: 1
fitting_net:
neuron:
- 240
- 240
- 240
resnet_dt: true
seed: 1
learning_rate:
type: exp
start_lr: 0.001
decay_steps: 2000
loss:
start_pref_e: 0.02
limit_pref_e: 2
start_pref_f: 1000
limit_pref_f: 1
start_pref_v: 0
limit_pref_v: 0
training:
#numb_steps: 400000
numb_steps: 5000
seed: 1
disp_file: lcurve.out
disp_freq: 1000
save_freq: 1000
save_ckpt: model.ckpt
disp_training: true
time_training: true
profiling: false
profiling_file: timeline.json
label:
cp2k: # Specify the cp2k parameter configuration
limit: 10
# The data used in this example has not been labeled, so it needs to be configured here.
# If there is an existing labeled dataset, this should be empty
# When this configuration is empty, the workflow will automatically skip the label phase of the first iteration
# and start execution from the train phase.
init_system_files: [ h2o_64-train, h2o_64-validation ]
input_template: |
&GLOBAL
PROJECT DPGEN
&END
&FORCE_EVAL
&DFT
BASIS_SET_FILE_NAME /home/user01/data/cp2k/BASIS/BASIS_MOLOPT
POTENTIAL_FILE_NAME /home/user01/data/cp2k/POTENTIAL/GTH_POTENTIALS
CHARGE 0
UKS F
&MGRID
CUTOFF 600
REL_CUTOFF 60
NGRIDS 4
&END
&QS
EPS_DEFAULT 1.0E-12
&END
&SCF
SCF_GUESS RESTART
EPS_SCF 3.0E-7
MAX_SCF 50
&OUTER_SCF
EPS_SCF 3.0E-7
MAX_SCF 10
&END
&OT
MINIMIZER DIIS
PRECONDITIONER FULL_SINGLE_INVERSE
ENERGY_GAP 0.1
&END
&END
&LOCALIZE
METHOD CRAZY
MAX_ITER 2000
&PRINT
&WANNIER_CENTERS
IONS+CENTERS
FILENAME =64water_wannier.xyz
&END
&END
&END
&XC
&XC_FUNCTIONAL PBE
&END
&vdW_POTENTIAL
DISPERSION_FUNCTIONAL PAIR_POTENTIAL
&PAIR_POTENTIAL
TYPE DFTD3
PARAMETER_FILE_NAME dftd3.dat
REFERENCE_FUNCTIONAL PBE
&END
&END
&END
&END
&SUBSYS
@include coord_n_cell.inc
&KIND O
BASIS_SET DZVP-MOLOPT-SR-GTH
POTENTIAL GTH-PBE-q6
&END
&KIND H
BASIS_SET DZVP-MOLOPT-SR-GTH
POTENTIAL GTH-PBE-q1
&END
&END
&PRINT
&FORCES ON
&END
&END
&END
explore:
lammps:
timestep: 0.0005
sample_freq: 100
nsteps: 2000
ensemble: nvt
template_vars:
POST_INIT: |
neighbor 1.0 bin
box tilt large
POST_READ_DATA: |
change_box all triclinic
system_files: [ h2o-64-explore ]
explore_vars:
TEMP: [ 330, 430, 530]
PRES: [1]
TAU_T: 0.1 # Optional
TAU_P: 0.5 # Optional
select:
model_devi:
f_trust_lo: 0.4
f_trust_hi: 0.6
# Remove structurally similar configurations using clustering method.
# If not needed, delete the following line.
asap_options: {}
update:
walkthrough:
# You can specify the parameter configuration to be used from the second iteration and beyond here
# The parameters configured here will override any configuration in the workflow section,
# and can be adjusted according to the training strategy
table:
- train: # The training steps are 10000 in the second iteration
deepmd:
input_template:
training:
numb_steps: 10000
- train: # The training steps are 20000 in the third iteration
deepmd:
input_template:
training:
numb_steps: 20000
After completing the configuration, you can start the execution of the workflow
ai2-kit workflow cll-mlp-training *.yml --executor hpc-cluster01 --path-prefix h2o_64-run-01 --checkpoint run-01.ckpt
In the above parameters,
*.yml
is used to specify the configuration file, you can specify multiple configuration files,ai2-kit
will automatically merge them,*
wildcard is used here;--executor hpc-cluster01
is used to specify the HPC executor to use. Here, thehpc-cluster01
executor configured in the previous section is used;--path-prefix h2o_64-run-01
specifies the remote working directory, which will create ah2o_64-run-01
directory underwork_dir
to store the execution results of the workflow;--checkpoint run-01.cpkt
will generate a checkpoint file locally to save the execution status of the workflow, so as to resume execution after the execution is interrupted.
To train MLP model for FEP based Redox Potential Calculation,
you need to make use of fparam
feature of deepmd
to fit the PES of initial (ini) and finial (fin) state of the reaction.
And should also config different label configuration for ini and fin state.
Here is the key configuration you need to pay attention to:
Explore artifacts are used as initial structures for structure explore. For example:
artifacts:
explore-h2ox64:
url: /path/to/h2ox64.xyz
attrs:
fep-ini:
dp_fparam: 0
cp2k:
input_template: !load_text cp2k-ini.inp
fep-fin:
dp_fparam: 1
cp2k:
input_template: !load_text cp2k-fin.inp
The following example dismiss the common configuration of LAMMPS,
and only focus on the fep-redox
specific configuration.
workflow:
explore:
lammps:
mode: fep-redox
template_vars:
# The fparam option of deepmd pair style
# The value of fparam should be consistent with the fparam in the explore artifacts
# doc: https://docs.deepmodeling.com/projects/deepmd/en/latest/third-party/lammps-command.html#pair-style-deepmd
FEP_INI_DP_OPT: fparam 0
FEP_FIN_DP_OPT: fparam 1
https://docs.deepmodeling.com/projects/deepmd/en/master/train/train-input.html
TODO: example.
TODO
The configurations produced through the process of Exploration
are chosen simply based on whether their Maximum Model Deviation of Forces falls within the range defined by f_trust_lo
and f_trust_hi
. However, such screening process still leaves a large number of configurations that require labeling. The screening{cite}Guo2023checmate
is improved by adding an clustering procedure, which is allowed to remove structurally similar configurations. To enable this functionality, asap_options: {}
is added in the above workflow.yml
. For details of clustering method, users are referred to the documentation of ASAP{cite}Cheng2020mapping
.
If you use clustering methods in your research, please cite the following paper: {cite}Guo2023checmate,Cheng2020mapping
If you use LASP in your research, please cite the following paper: {cite}Huang2019lasp