Gaussian Output Data Extractor & QSAR Modeling Tool

Overview

This Python tool is designed to automate the extraction of quantum chemical descriptors from Gaussian calculation output files (.log) and perform subsequent multivariate statistical analysis.

It is specifically tailored for analyzing ionic liquid systems (Cation/Anion/Solvent interactions), extracting electronic, thermodynamic, and steric properties, and performing backward stepwise linear regression to model experimental properties (e.g., Conductivity).

Key Features

Batch Extraction: Automatically processes multiple molecules based on numerical indices found in filenames.
Robust Parsing: Extracts SCF energies, orbital energies (HOMO/LUMO), Dipole moments, and Frequency data using regex.
Descriptor Calculation: Computes DFT-based reactivity indices (Hardness, Softness, Electrophilicity, etc.).
Steric Analysis: Calculates Sterimol parameters ($L, B_1, B_5$) using morfeus based on .gjf geometries.
Complexation Energy: Automatic calculation of binding energies ($\Delta E$) for salts and solvent-cation complexes.
Statistical Modeling: Performs automatic Backward Stepwise OLS Regression to identify significant descriptors correlated with experimental data.

Prerequisites

Ensure you have Python installed along with the following libraries:

pip install pandas statsmodels morfeus-py openpyxl

Note: The script also uses standard libraries: os, re, csv.

File Directory & Naming Convention

The script relies on a strict file naming convention to associate files with a specific sample ID (integer n).

1. Gaussian Output Files

Place all files in a single directory. Replace n with the sample number (e.g., 1, 2, 10...):

File Type	Naming Pattern	Description
Cation Optimization	`n-cation.log`	Output log for the isolated cation.
Cation Input	`n-cation.gjf`	Geometry file required for Sterimol calculation.
Salt Optimization	`n-salt.log`	Output log for the cation-anion salt complex.
Solvent Complex	`n-DME-M*.log`	E.g., `1-DME-M1.200.node1.log`. Used to find the lowest energy solvent complex.

2. Configuration & Experimental Data (`data.xlsx`)

You must provide an Excel file named data.xlsx in the same directory. This file serves two purposes: providing the experimental target variable (for regression) and configuration for steric calculations.

Required Data Format

The Excel file must contain the following columns (headers are case-sensitive):

Column Name	Description	Example
`number`	The Sample ID corresponding to `n` in filenames.	`1`
`dependent variable`	The experimental value (target Y) to predict (e.g., Conductivity).	`5.4`
`sterimol axis atoms`	Atom indices for Sterimol calculation, separated by a comma.	`1,6`

Example data.xlsx content:

number	dependent variable	sterimol axis atoms
1	8.23	1,6
2	7.45	1,5
3	9.10	2,7

Extracted Descriptors

The script extracts and calculates the following descriptors:

Electronic & Reactivity

Energies: HOMO, LUMO, HOMO-LUMO Gap.
DFT Indices:
- Chemical Hardness ($\eta$)
- Chemical Softness ($\sigma$)
- Chemical Potential ($\mu$)
- Electronegativity ($\chi$)
- Electrophilicity Index ($\omega$)
Dipole Moment: Field-independent basis (Debye).

Thermodynamic & Energetic

Energies: Total SCF Energy, Kinetic Energy (KE), Nuclear Repulsion (N-N), Electron-Nuclear (E-N).
Corrections: ZPE, Thermal Corrections to Energy, Enthalpy (H), and Gibbs Free Energy (G).
Thermochemistry: Entropy ($S$), Heat Capacity ($C_v$).
Binding Energies: $\Delta E$ for Salt formation and Solvent-Cation interaction.

Structural & Steric

Sterimol Parameters: $L$ (Length), $B_1$ (Min width), $B_5$ (Max width).
Frequencies: Lowest vibrational frequency.
Mass: Molecular mass.

Detailed Usage Guide

Follow these steps to run the analysis:

Step 1: Prepare Your Directory

Create a folder (e.g., D:\Research\GaussianData) and ensure it contains:

All your .log and .gjf files named correctly (see Naming Convention).
The data.xlsx file containing your experimental data and sterimol configs.

Step 2: Configure the Script

Open extract_gaussian_data.py in a text editor or IDE. Locate the main() function and update the data_folder variable to point to your directory:

def main():
    # ...
    data_folder = r"D:\Research\GaussianData"  # <--- Update this path
    output_file = 'results.csv'
    # ...

Optional: If your system uses a different Anion or Solvent, update the energy constants at the top of the main() function:

    anion_energy = -459.54813049
    E_DME_SOLVENT = -308.71907112

Step 3: Run the Script

Open your terminal or command prompt, navigate to the folder containing the python script, and run:

python extract_gaussian_data.py

Step 4: Monitor Console Output

The script will provide real-time feedback in the console:

Loading: It will confirm that data.xlsx was loaded and how many Sterimol configs were found.
Processing: It will iterate through every group number found:

Processing group 1...
Fitting: Once extraction is done, it begins the Multivariate Linear Regression (Backward Elimination):

Starting Multivariate Linear Fitting (Mode: backward)... --- Round 1 Fitting --- Descriptor Contribution... Decision: Removing descriptor 'SM-LUMO' (Low contribution, P=0.85...)

Step 5: Check Outputs

After execution, two new files will be generated in your working directory:

results.csv: A comprehensive dataset containing every extracted descriptor for every molecule. This is your raw data for further analysis.
fitting_report.txt: The final statistical summary of the best regression model found, including R-squared, F-statistic, and coefficients.

Troubleshooting

Warning: File not found ...: The script cannot find a specific log file. Double-check that your files are named exactly n-cation.log, n-salt.log, etc.
Warning: Config file ... missing columns: Your data.xlsx headers are likely incorrect. They must exactly match number, dependent variable, and sterimol axis atoms.
Error extracting HOMO/LUMO: The script failed to parse the orbital energies. Ensure your Gaussian jobs included orbital printing (standard in optimization jobs) and finished successfully (SCF Done).
Empty fitting_report.txt: If the regression fails, check if results.csv contains NaN values (blank cells). The regression tool removes columns containing any missing data.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
extract_gaussian_data.py		extract_gaussian_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Gaussian Output Data Extractor & QSAR Modeling Tool

Overview

Key Features

Prerequisites

File Directory & Naming Convention

1. Gaussian Output Files

2. Configuration & Experimental Data (`data.xlsx`)

Required Data Format

Extracted Descriptors

Electronic & Reactivity

Thermodynamic & Energetic

Structural & Steric

Detailed Usage Guide

Step 1: Prepare Your Directory

Step 2: Configure the Script

Step 3: Run the Script

Step 4: Monitor Console Output

Step 5: Check Outputs

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

License

SkillfulPainter/gaussian_data_extractor

Folders and files

Latest commit

History

Repository files navigation

Gaussian Output Data Extractor & QSAR Modeling Tool

Overview

Key Features

Prerequisites

File Directory & Naming Convention

1. Gaussian Output Files

2. Configuration & Experimental Data (data.xlsx)

Required Data Format

Extracted Descriptors

Electronic & Reactivity

Thermodynamic & Energetic

Structural & Steric

Detailed Usage Guide

Step 1: Prepare Your Directory

Step 2: Configure the Script

Step 3: Run the Script

Step 4: Monitor Console Output

Step 5: Check Outputs

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

2. Configuration & Experimental Data (`data.xlsx`)

Packages