Automated Network Optimizer (ANO) for Enhanced Prediction of Intrinsic Solubility in Drug-like Organic Compounds: A Comprehensive Machine Learning Approach

Overview

This repository presents a novel approach to predicting aqueous solubility of drug-like organic compounds using our Automated Network Optimizer (ANO) framework. By integrating advanced machine learning techniques with automated feature selection and hyperparameter optimization, we achieve state-of-the-art prediction accuracy for intrinsic solubility (logS).

System Requirements

Dependencies

  • Python 3.12 or later
  • TensorFlow 2.15.0 (Linux/MacOS/WSL)
  • TensorFlow 2.15.0-GPU (Windows)
  • RDKit 2024.3.1
  • pandas 2.2.1
  • scikit-learn 1.4.1.post1
  • seaborn 0.13.2
  • matplotlib 3.8.3
  • optuna 3.5.0

Repository Structure

Jupyter Notebooks

  1. 1_standard_ML.ipynb

    • Comprehensive evaluation of traditional ML approaches
    • Random Forest, XGBoost, and SVM implementations
    • Baseline performance metrics and comparative analysis
  2. 2_solubility_fingerprint_comparison.ipynb

    • Detailed analysis of molecular fingerprint methods
    • Evaluation of ECFP, MACCS, and custom fingerprints
    • Performance comparison across fingerprint types
  3. 3_ANO_with_feature_checker.ipynb

    • Implementation of ANO framework
    • Automated feature importance analysis
    • Real-time feature selection optimization
  4. 4_ANO_feature.ipynb

    • Optimal physicochemical feature search using ANO
  5. 5_ANO_structure.ipynb

    • Hyperparameter optimization using ANO
  6. 6_ANO_network_[fea_struc].ipynb

    • Network architecture optimization based on optimal physicochemical features
  7. 7_ANO_network_[struc_fea].ipynb

    • Network architecture optimization based on optimal hyperparameters
  8. 7_Solubility_final_HPO_proving.ipynb (Bug fixing...)

    • Performance validation of final ANO model
  9. 8_solubility_xai.ipynb

    • Model explainability analysis
    • Permutation importance and SHAP evaluation
    • Correlation analysis between physicochemical features and logS
    • Implementation of Lipinski's Rule of 5

Core Python Modules

  • basic_model.py

    • Foundation architecture for fingerprint analysis
    • Modular design for easy extension
    • Comprehensive validation methods
  • feature_search.py

    • Feature search implementation for ANO (used in 4_ANO_feature.ipynb)
  • feature_selection.py

    • Feature selection implementation for ANO (used in 5_ANO_structure.ipynb, 6_ANO_network_[fea_struc].ipynb, 7_ANO_network_[struc_fea].ipynb)
  • learning_model.py

    • ANO learning model implementation
    • Used in deep learning and feature optimization notebooks (used in 3_ANO_with_feature_checker, 3_solubility_descriptor_deeplearning, 4_ANO_feature, 5_ANO_structure.ipynb, 6_ANO_network_[fea_struc].ipynb, 7_ANO_network_[struc_fea].ipynb)

Key Innovations

  • 49 carefully selected chemical descriptors for target dataset
  • Fast and efficient selections of chemical descriptors and hyperparameters in machine learning models

Model Availability

Pre-trained models and complete results are available at: https://huggingface.co/arer90/ANO_solubility_prediction/tree/main

Version

Current Version: 1.0.2 (2024.11)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this work in your research, please cite:

@article{ANO2024solubility,
  title={Prediction of intrinsic solubility for drug-like organic compounds using Automated Network Optimizer (ANO) for physicochemical feature and hyperparameter optimization},
  author={Chung, Young Kyu and Lee, Seung Jun and Lee, Jonggeun and Cho, Hyunwoo and Kim, Sung-Jin and Huh, June},
  journal={ChemRxiv},
  year={2024},
  doi={10.26434/chemrxiv-2024-mp291}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.