Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update T1.md #4

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
212 changes: 99 additions & 113 deletions source/_posts/T1.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,144 +93,130 @@ Users should be able to:


**<span style="color: #90EE90; font-size: 1.5rem;">AI</span>**
**<span style="color: #ADD8E6; font-size: 1rem;">Authors - R. Pranav and Jagaadhep U K</span>**

**<span style="color: #ADD8E6; font-size: 1rem;">Google Colab</span>**
**<span style="color: #ADD8E6; font-size: 1rem;">Authors - Shalini D and Jeba Rachel</span>**

**Google Colab is a free online tool that lets you write and run Python code right in your web browser. It's like a notebook where you can type your code and see the results immediately. Colab is great for learning and working on data science projects because it supports popular Python libraries like Pandas and TensorFlow. You can also use it with others, making it easy to collaborate on projects. Plus, it provides access to powerful computers (GPUs and TPUs) for faster processing, which is really helpful for running complex tasks.**

**Note: Use Google Colab to complete the tasks provided. After completing the tasks, upload your Google Colab Notebook to your GitHub repository.**

<span style="color: #ADD8E6;">_References:_</span>
- [<span style="color: #55AAFF;">Google Colab tutorial</span>](https://www.youtube.com/watch?v=rsBiVxzmhG0)

<hr>

**<span style="color: #FF6363; font-size: 1rem;">Question 1</span>**

**<span style="color: #ADD8E6; font-size: 1rem;">Numpy</span>**

**NumPy is a popular Python library used for working with numbers and arrays. Think of it as a tool that helps you do math and handle large sets of numbers easily. It makes it simple to perform calculations on lists of numbers and matrices. NumPy is great for anyone who wants to do data analysis or scientific computing because it speeds up these tasks with its fast and powerful features.**

**<span style="color: #ADD8E6; font-size: 1rem;">DataSet</span>**

We use a dataset of details about 15 students each having attributes – Height, Weight, Age, Average Grade and Courses. We use the python code given below to create a NumPy array of our dataset.

**Python code to create NumPy array for the task:**

```lua

import numpy as np
# Creating a dataset with 15 students and 5 attributes
data = np.array([
[170, 65, 19, 85, 5],
[180, 75, 20, 90, 6],
[160, 55, 18, 80, 4],
[175, 70, 21, 88, 7],
[155, 50, 19, 82, 5],
[165, 62, 22, 89, 6],
[178, 80, 23, 91, 7],
[162, 58, 20, 78, 3],
[172, 68, 19, 86, 5],
[169, 66, 20, 84, 4],
[171, 64, 22, 87, 6],
[177, 72, 21, 90, 9],
[174, 76, 24, 88, 8],
[158, 52, 18, 75, 3],
[164, 63, 19, 81, 4]
])

# Printing the dataset with student labels
print("Student\tHeight\tWeight\tAge\tAvg Grade\tCourses")
for index, student in enumerate(data):
print(f"Student {index + 1}\t{student[0]}\t{student[1]}\t{student[2]}\t{student[3]}\t\t{student[4]}")

```

<span style="color: #ADD8E6;">_Objective:_</span>
- <span style="color: #FF6363;">Question 1.1 : </span>Find the Average Height of the Students

Explanation: You need to use the mean() function from NumPy to compute the average value of the height column in the dataset.

- <span style="color: #FF6363;">Question 1.2 : </span>Find the Age of the Oldest Student
---

Explanation: Use the max() function from NumPy to find the maximum value in the age column and determine the age of the oldest student.
**<span style="color: #FF6363; font-size: 1rem;">Task 1: Model Selection for Regression with Scikit-learn</span>**

- <span style="color: #FF6363;">Question 1.3 : </span>Find the Index of the Student Who Took the Most Courses
**<span style="color: #ADD8E6;">Objective:</span>**
Build a regression model to predict diabetes progression using Scikit-learn. This task will help you understand the fundamentals of selecting and evaluating regression models.

Explanation: Use the argmax() function from NumPy to locate the index of the maximum value in the number of courses column.
**<span style="color: #ADD8E6;">Dataset:</span>**
Diabetes Dataset (available in Scikit-learn).

- <span style="color: #FF6363;">Question 1.4 : </span>Find the Number of Students with an Average Grade Above 85
---

Explanation: Use a NumPy condition to filter the dataset for students with an average grade above 85, and then use the sum() function to count them.
**<span style="color: #ADD8E6;">Steps</span>**

1. **Load the Dataset**
- Use Scikit-learn to load the Diabetes dataset.
- Convert it into a Pandas DataFrame for preprocessing.

**Code:**
```python
from sklearn.datasets import load_diabetes
import pandas as pd
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['progression'] = data.target
df.to_csv("diabetes.csv", index=False)
```

2. **Preprocess the Data**
- Load the CSV file using Pandas.
- Check for missing values and handle them if present.
- Standardize numerical features using `StandardScaler`.

3. **Model Selection**
- Compare multiple regression models, such as Linear Regression, Ridge Regression, and Random Forest Regressor.
- Use cross-validation to evaluate model performance using Mean Squared Error (MSE).

4. **Train and Test the Best Model**
- Split the data into training and testing sets (80-20 split).
- Train the selected model on the training set.
- Evaluate its performance on the test set.
---

- <span style="color: #FF6363;">Question 1.5 : </span>Calculate the Ratio of a Student's Age to Their Average Grade for Each Student
**<span style="color: #ADD8E6;">Deliverables</span>**

Explanation: Perform element-wise division of the age column by the average grade column to get the ratio for each student.

<span style="color: #ADD8E6;">_References:_</span>
- [<span style="color: #55AAFF;">W3Schools</span>](https://www.w3schools.com/python/numpy/default.asp)
- [<span style="color: #55AAFF;">Numpy Documentation</span>](https://numpy.org/doc/stable/reference/arrays.ndarray.html)
- [<span style="color: #55AAFF;">GeeksforGeeks</span>]( https://www.geeksforgeeks.org/numpy-tutorial/)
- [<span style="color: #55AAFF;">NumPy cheat sheet</span>](https://images.datacamp.com/image/upload/v1676302459/Marketing/Blog/Numpy_Cheat_Sheet.pdf)
- A Google Colab Notebook containing:
- Code for data preprocessing, model comparison, and evaluation.
- Comments explaining each step.
- A summary of the best model and its MSE on the test set.

<hr>
---

**<span style="color: #FF6363; font-size: 1rem;">Question 2</span>**
**<span style="color: #FF6363; font-size: 1rem;">Task 2: Model Selection for Classification with Scikit-learn</span>**

**<span style="color: #ADD8E6; font-size: 1rem;">Pandas</span>**
**<span style="color: #ADD8E6;">Objective:</span>**
Build a classification model to predict iris species using Scikit-learn. This task focuses on understanding the process of selecting the best classification model.

**Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures like Data Frames and Series that are built on top of NumPy arrays and are designed to handle a wide range of data types and operations efficiently. Pandas is extensively used in data science and machine learning for tasks such as data cleaning, transformation, and analysis.**
**<span style="color: #ADD8E6;">Dataset:</span>**
Iris Dataset (available in Scikit-learn).

**<span style="color: #ADD8E6; font-size: 1rem;">DataSet</span>**
---

*We will use a dataset with 15 students, each having 5 attributes. Let's first convert the list into a Pandas DataFrame.*
```lua
data = [
[170, 65, 19, 85, 5],
[180, 75, 20, 90, 6],
[160, 55, 18, 80, 4],
[175, 70, 21, 88, 7],
[155, 50, 19, 82, 5],
[165, 62, 22, 89, 6],
[178, 80, 23, 91, 7],
[162, 58, 20, 78, 3],
[172, 68, 19, 86, 5],
[169, 66, 20, 84, 4],
[171, 64, 22, 87, 6],
[177, 72, 21, 90, 9],
[174, 76, 24, 88, 8],
[158, 52, 18, 75, 3],
[164, 63, 19, 81, 4]
]
# column names being ‘Height’, ‘Weight’, ‘Age’, ‘Avg_Grade’ and ‘Courses’ in that order.
```
<span style="color: #ADD8E6;">_Objective:_</span>
- <span style="color: #FF6363;">Question 2.1 : </span>Create a Pandas DataFrame
**<span style="color: #ADD8E6;">Steps</span>**

1. **Load the Dataset**
- Use Scikit-learn to load the Iris dataset.
- Convert it into a Pandas DataFrame for preprocessing.

**Code:**
```python
from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['species'] = data.target
df.to_csv("iris.csv", index=False)
```

2. **Preprocess the Data**
- Load the CSV file in Pandas.
- Handle missing values if present.
- Standardize numerical features using `StandardScaler`.
- Encode the target variable using label encoding.

3. **Model Selection**
- Compare multiple classification models, such as Logistic Regression, Decision Trees, and Support Vector Machines (SVM).
- Use cross-validation to evaluate model performance using accuracy and F1-score.

4. **Train and Test the Best Model**
- Split the data into training and testing sets (80-20 split).
- Train the selected model on the training set.
- Evaluate its performance on the test set.

Explanation: You need to understand how to convert a NumPy array into a DataFrame and assign column names.
---

- <span style="color: #FF6363;">Question 2.2 : </span>Describe the DataFrame
**<span style="color: #ADD8E6;">Deliverables</span>**

Explanation: The describe() function provides various summary statistics (mean, standard deviation, min, max, and percentiles) for numeric columns in the DataFrame.
- A Google Colab Notebook containing:
- Code for data preprocessing, model comparison, and evaluation.
- Comments explaining each step.
- A summary of the best model and its accuracy and F1-score on the test set.

- <span style="color: #FF6363;">Question 2.3 : </span>Count the Number of Students in Each Age Group
---

Explanation: Use the value_counts() function to count occurrences of unique values in a column.
**<span style="color: #ADD8E6;">References</span>**

- <span style="color: #FF6363;">Question 2.4 : </span>Filter the DataFrame
**General References:**
1. **Regression:**
- [Scikit-learn Regression Guide](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)
- [StandardScaler Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

Explanation: Filtering allows you to extract specific rows from the DataFrame based on certain conditions.

- <span style="color: #FF6363;">Question 2.5 : </span>Calculate the Average Grade for Each Age Group
2. **Classification:**
- [Scikit-learn Classification Guide](https://scikit-learn.org/stable/supervised_learning.html#classification)
- [Cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html)

Explanation: The groupby() function in Pandas is used to group data based on one or more columns. After grouping, you can apply aggregation functions like mean() to these groups. In this task, you will group students by their age and then calculate the average grade for each age group.
**Specific Techniques:**
- **Data Splitting:** [Train-Test Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- **Handling Missing Values:** [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)
- **Model Selection:** [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

<span style="color: #ADD8E6;">_References:_</span>
- [<span style="color: #55AAFF;">W3Schools</span>](https://www.w3schools.com/python/pandas/default.asp)
- [<span style="color: #55AAFF;">Pandas Documentation</span>](https://pandas.pydata.org/docs/reference/frame.html)
- [<span style="color: #55AAFF;">GeeksforGeeks</span>](https://www.geeksforgeeks.org/pandas-tutorial/)
- [<span style="color: #55AAFF;">Pandas cheat sheet </span>](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)


<hr>
Expand Down