This Python tool is designed to automate the extraction of quantum chemical descriptors from Gaussian calculation output files (.log) and perform subsequent multivariate statistical analysis.
It is specifically tailored for analyzing ionic liquid systems (Cation/Anion/Solvent interactions), extracting electronic, thermodynamic, and steric properties, and performing backward stepwise linear regression to model experimental properties (e.g., Conductivity).
- Batch Extraction: Automatically processes multiple molecules based on numerical indices found in filenames.
- Robust Parsing: Extracts SCF energies, orbital energies (HOMO/LUMO), Dipole moments, and Frequency data using regex.
- Descriptor Calculation: Computes DFT-based reactivity indices (Hardness, Softness, Electrophilicity, etc.).
-
Steric Analysis: Calculates Sterimol parameters (
$L, B_1, B_5$ ) usingmorfeusbased on.gjfgeometries. -
Complexation Energy: Automatic calculation of binding energies (
$\Delta E$ ) for salts and solvent-cation complexes. - Statistical Modeling: Performs automatic Backward Stepwise OLS Regression to identify significant descriptors correlated with experimental data.
Ensure you have Python installed along with the following libraries:
pip install pandas statsmodels morfeus-py openpyxlNote: The script also uses standard libraries: os, re, csv.
The script relies on a strict file naming convention to associate files with a specific sample ID (integer n).
Place all files in a single directory. Replace n with the sample number (e.g., 1, 2, 10...):
| File Type | Naming Pattern | Description |
|---|---|---|
| Cation Optimization | n-cation.log |
Output log for the isolated cation. |
| Cation Input | n-cation.gjf |
Geometry file required for Sterimol calculation. |
| Salt Optimization | n-salt.log |
Output log for the cation-anion salt complex. |
| Solvent Complex | n-DME-M*.log |
E.g., 1-DME-M1.200.node1.log. Used to find the lowest energy solvent complex. |
You must provide an Excel file named data.xlsx in the same directory. This file serves two purposes: providing the experimental target variable (for regression) and configuration for steric calculations.
The Excel file must contain the following columns (headers are case-sensitive):
| Column Name | Description | Example |
|---|---|---|
number |
The Sample ID corresponding to n in filenames. |
1 |
dependent variable |
The experimental value (target Y) to predict (e.g., Conductivity). | 5.4 |
sterimol axis atoms |
Atom indices for Sterimol calculation, separated by a comma. | 1,6 |
Example data.xlsx content:
| number | dependent variable | sterimol axis atoms |
|---|---|---|
| 1 | 8.23 | 1,6 |
| 2 | 7.45 | 1,5 |
| 3 | 9.10 | 2,7 |
The script extracts and calculates the following descriptors:
- Energies: HOMO, LUMO, HOMO-LUMO Gap.
-
DFT Indices:
- Chemical Hardness (
$\eta$ ) - Chemical Softness (
$\sigma$ ) - Chemical Potential (
$\mu$ ) - Electronegativity (
$\chi$ ) - Electrophilicity Index (
$\omega$ )
- Chemical Hardness (
- Dipole Moment: Field-independent basis (Debye).
- Energies: Total SCF Energy, Kinetic Energy (KE), Nuclear Repulsion (N-N), Electron-Nuclear (E-N).
- Corrections: ZPE, Thermal Corrections to Energy, Enthalpy (H), and Gibbs Free Energy (G).
-
Thermochemistry: Entropy (
$S$ ), Heat Capacity ($C_v$ ). -
Binding Energies:
$\Delta E$ for Salt formation and Solvent-Cation interaction.
-
Sterimol Parameters:
$L$ (Length),$B_1$ (Min width),$B_5$ (Max width). - Frequencies: Lowest vibrational frequency.
- Mass: Molecular mass.
Follow these steps to run the analysis:
Create a folder (e.g., D:\Research\GaussianData) and ensure it contains:
- All your
.logand.gjffiles named correctly (see Naming Convention). - The
data.xlsxfile containing your experimental data and sterimol configs.
Open extract_gaussian_data.py in a text editor or IDE. Locate the main() function and update the data_folder variable to point to your directory:
def main():
# ...
data_folder = r"D:\Research\GaussianData" # <--- Update this path
output_file = 'results.csv'
# ...Optional: If your system uses a different Anion or Solvent, update the energy constants at the top of the main() function:
anion_energy = -459.54813049
E_DME_SOLVENT = -308.71907112Open your terminal or command prompt, navigate to the folder containing the python script, and run:
python extract_gaussian_data.pyThe script will provide real-time feedback in the console:
- Loading: It will confirm that
data.xlsxwas loaded and how many Sterimol configs were found. - Processing: It will iterate through every group number found:
Processing group 1... - Fitting: Once extraction is done, it begins the Multivariate Linear Regression (Backward Elimination):
Starting Multivariate Linear Fitting (Mode: backward)...--- Round 1 Fitting ---Descriptor Contribution...Decision: Removing descriptor 'SM-LUMO' (Low contribution, P=0.85...)
After execution, two new files will be generated in your working directory:
results.csv: A comprehensive dataset containing every extracted descriptor for every molecule. This is your raw data for further analysis.fitting_report.txt: The final statistical summary of the best regression model found, including R-squared, F-statistic, and coefficients.
Warning: File not found ...: The script cannot find a specific log file. Double-check that your files are named exactlyn-cation.log,n-salt.log, etc.Warning: Config file ... missing columns: Yourdata.xlsxheaders are likely incorrect. They must exactly matchnumber,dependent variable, andsterimol axis atoms.Error extracting HOMO/LUMO: The script failed to parse the orbital energies. Ensure your Gaussian jobs included orbital printing (standard in optimization jobs) and finished successfully (SCF Done).- Empty
fitting_report.txt: If the regression fails, check ifresults.csvcontainsNaNvalues (blank cells). The regression tool removes columns containing any missing data.