Welcome to my portfolio. Read on for a guided tour of some of my data projects.
In 2023 I completed The Data Incubator's (TDI) competitive Data Science Fellowship - an immersive training program designed to prepare individuals with strong quantitative and analytical backgrounds, often including those with advanced degrees in fields like physics, mathematics, computer science, or engineering, for careers in data science. The curriculum covered a wide range of data science topics, including but not limited to machine learning, statistical analysis, data visualization, and big data technologies. Fellowship work features hands-on projects, solving real-world problems and gaining practical experience. Modules included:
- Data wrangling with Numpy and Pandas
- Data analysis with SQL
- Machine learning with Scikit-Learn
- Natural Language Processing (NLP) and time series analysis
- Distributed computing with Spark (RDD and dataframe/MLLib)
- Deep learning with Keras/Tensorflow
- A rigorous capstone project
DEZoomcamp is an educational initiative by DataTalksClub focused on teaching data engineering. It's curriculum includes several coding projects.
Project Link | Area | Project Description | Tools |
---|---|---|---|
ETL with PostgreSQL, Docker, and Terraform | ETL, containers and orchestration, resource provisioning | Demonstrated essential data engineering techniques using Docker. ETL development in a Jupyter notebook with Pandas for data manipulation and SQLAlchemy for database creation and exploration. Container orchestration with Docker-Compose. Resource provisioning on GCP with Terraform. | Google Cloud Platform, Pandas, SQLAlchemy, Jupyter, Docker, PostgreSQL, pgAdmin, Docker-Compose, Terraform |
MLZoomcamp is an educational initiative by DataTalksClub focused on teaching machine learning and data science. It's curriculum includes several coding projects.
Project Link | Area | Project Description | Libraries |
---|---|---|---|
Exploratory Data Analysis and Linear Regression | Data Analysis & Linear Regression | Demonstrated essential data analysis techniques and linear regression using Python in a Jupyter notebook. Emphasized Pandas and NumPy for data manipulation, exploratory analysis, and linear regression implementation. Calculated linear regression weights through matrix inversion. | Pandas, NumPy |
Machine Learning for Regression | Machine Learning & Regression Analysis | Built a regression model predicting housing prices using the California Housing Prices dataset. Utilized Pandas, NumPy, Matplotlib, and Seaborn for data handling, visualization, and regression analysis. Regression by matrix inversion. | Pandas, NumPy, Matplotlib, Seaborn |
Classification with scikit-learn | Machine Learning & Classification | Focused on transforming a car price dataset into a classification problem, predicting whether a car's price is above its mean value ('above_average'). Utilized scikit-learn for classification modeling, NumPy, Pandas, Matplotlib, and Seaborn for data handling, visualization, and preprocessing. | scikit-learn, NumPy, Pandas, Matplotlib, Seaborn |
Model Evaluation | Machine Learning & Evaluation | Focused on model evaluation techniques—ROC-AUC, precision, recall, and F1 score. Transformed a car price dataset into a binary classification problem, predicting whether a car's price is above its mean value ('above_average'). Utilized scikit-learn for data prep, exploratory analysis, logistic regression, cross-validation, and hyperparameter tuning. | scikit-learn, NumPy, Pandas, Matplotlib, Seaborn |
Model Deployment with Flask, Gunicorn, and Docker | Deployment & Machine Learning | Focused on deploying a pre-trained ML model using Flask and containerizing it with Docker. Created a web service incorporating a Scikit-Learn model for predicting credit probabilities for clients. Utilized Flask, Gunicorn, and Docker for deployment. | Pipenv, Scikit-Learn, Pickle, Flask, Gunicorn, Docker |
Decision Trees and Ensemble Learning | Machine Learning & Regression Analysis | Analyzed the California Housing Prices dataset from Kaggle, focusing on predicting 'median_house_value.' Explored Decision Trees, Random Forests, and XGBoost for ensemble-based regression analysis. Implemented hyperparameter tuning for improved model performance. | pandas, numpy, matplotlib, seaborn, scikit-learn, XGBoost |
Image Classification with TensorFlow and Keras | Deep Learning & Computer Vision | Built an image classification model to differentiate between bees and wasps using TensorFlow and Keras. Leveraged the "Bee or Wasp?" dataset from Kaggle and implemented Convolutional Neural Networks (CNNs) for the classification task. Explored data augmentation through image transformations. | numpy, pandas, matplotlib, seaborn, TensorFlow, Keras |