My first encounter with Data Science!

Shaivi Ganatra
12 min readAug 7, 2020

--

Data Science Introduction (https://download.ir/wp-content/uploads/2019/12/Data-Science-Foundations-Fundamentals-cover.jpg)

“Without Data, you are just another person with an opinion!”

Truely, Data is a very important resource in this world. Moreover, this world is creating chunks and chunks of data each day! Data Science in Layman’s words is to put this data into use. It is to capture, clean, manage, explore data and do wonders from it.

What is Data Science? (http://clipart-library.com/clipart/analyze-cliparts_5.htm)

Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data.

After understanding the importance of data, the next step is the implementation!

Nowadays, Heart Diseases in humans are the most common. One of them, which is Heart Attack, is believed to be a peaceful death, too! These diseases are so common in humans, that for people older than 75, congestive heart failure occurs 10 times more often than in younger adults.

https://www.vectorstock.com/royalty-free-vector/human-sick-heart-cartoon-character-vector-19187433

According to WHO, Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year. Heart Diseases have shown an estimate of 31% of all deaths worldwide.

How helpful it would have been, if one could predict a patient would be suffering from Heart Disease or not, or whether he/she may fall for such a situation in future or not.

Here is where data science comes into picture, below is an explanation of a complete project which tries to predict whether a patient is suffering through heart diseases or not by cleaning, managing, exploring, analysing and applying ML Algorithms for prediction on the dataset.

  • So, the very first step is to import the required libraries!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
  • Next step is to load the dataset on which analysis and predictions are to be done, for that read_csv function of pandas library is used.

Pandas is a Python library with many helpful utilities for loading and working with structured data.

  • The output of read_csv is stored in the dataframe df.
df = pd.read_csv('C:/Users/New/Desktop/Data Science/Classification of Heart Disease Patients/heart_info.csv')
  • It is important to know the size of our dataset, in simple words we know the number of rows and columns, for that df.shape of pandas library is used.
df.shape
  • For any dataset, Feature Selection is the most important.

But the question is what are features?

Features in simple words are columns of the dataset. They are the basic building blocks of datasets.

The quality of the features in your dataset has a major impact on the quality and accuracy of the insights which will be gained after applying various ML Algorithms.

But are all columns called features? No, this depends on the motive and expectation of results from the analysis done on the dataset. Only those features which can give tenacious output and accuracy should be selected.

  • To view features of a dataset, df.info() is used which returns information of all the features existing in the dataset. In case any trivial feature exists, df.drop(‘column_name’) can be used to drop that specific column.

df.info() and df.drop(‘column_name’) , both are functions of pandas library.

df.info()
  • Explanation of 14 Features:

age: age in years

sex: (1 = male; 0 = female)

cp: chest pain type

  • 0: Typical angina: chest pain related decrease blood supply to the heart
  • 1: Atypical angina: chest pain not related to heart
  • 2: Non-anginal pain: typically esophageal spasms (non heart related)
  • 3: Asymptomatic: chest pain not showing signs of disease

trestbps: resting blood pressure (in mm Hg on admission to the hospital) anything above 130–140 is typically cause for concern

chol: serum cholesterol in mg/dl

  • serum = LDL + HDL + .2 * triglycerides
  • above 200 is cause for concern

fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

  • ‘>126’ mg/dL signals diabetes

restecg: resting electrocardiographic results

  • 0: Nothing to note
  • 1: ST-T Wave abnormality
  • 2: Possible or definite left ventricular hypertrophy

thalach: maximum heart rate achieved

exang: exercise induced angina (1 = yes; 0 = no)

oldpeak: ST depression induced by exercise relative to rest looks at stress of heart during exercise unhealthy heart will stress more

slope: the slope of the peak exercise ST segment -0: Upsloping: better heart rate with exercise (uncommon) -1: Flat Sloping: minimal change (typical healthy heart) -2: Downsloping: signs of unhealthy heart

ca: number of major vessels (0–3) colored by fluoroscopy -colored vessel means the doctor can see the blood passing through -the more blood movement the better (no clots)

thal: 3 = normal; 6 = fixed defect; 7 = reversible defect -1,3: normal -6: fixed defect: used to be defect but ok now -7: reversible defect: no proper blood movement when exercising

target: have disease or not (1=yes, 0=no) (= the predicted attribute)

  • Next step is viewing the database, for that we use the df.head() function from pandas library. (It by default, views the top 5 rows of the database. User defined numbers can be passed to the function. For eg. df.head(10))
df.head()
  • Checking for null values in the dataset (Null values can affect the accuracy of model, such valued need to handled)
df.isnull().sum()

As every feature returned 0, it means there are no null values in the dataset!

Visuals always have a great impact on human eyes!

https://www.misterseo.net/wp-content/uploads/2020/07/analisi-obiettivi-email-marketing-800x652.png

Data Visualization plays an important role throughout in Data Science.

It is also very important throughout the analysis and prediction process.

Visuals can be used at both, before and after applying ML Algorithms.

When applied before, they help to understand the data and identify proper features.

After predictions, it can help to understand and compare crucial details like execution time, accuracy and what not!

So, let’s visualize our dataset!

Before visualizing, we need to make some changes of values. The reason for this is, while applying algorithms and analysing data, values are numeric in input dataset, but for visualizing, we need to convert that numeric data into human understandable words. (For more clarification, view the explanation of the features section.)

  • Conversion for Visualizing Data (A simple dictionary is defined, and values are swapped using -df[‘column_name’]=df[‘column_name’].replace(conversion_dict))
#for target
conversion_dict = {1 : 'isHeartPatient', 0 : 'isNotHeartPatient'}
df['target'] = df['target'].replace(conversion_dict)
#for sex
conversion_dict = {1 : 'Male', 0 : 'Female'}
df['sex'] = df['sex'].replace(conversion_dict)
#for cp
conversion_dict = {0 : 'Typical', 1 : 'Atypical', 2 : 'Non-anginal', 3 : 'Asymptomatic'}
df['cp'] = df['cp'].replace(conversion_dict)
#for fbs
conversion_dict = {1 : 'fbs > 120 mg/dl', 0 : 'fbs < 120 mg/dl'}
df['fbs'] = df['fbs'].replace(conversion_dict)
#for exang
conversion_dict = {1 : 'induced angina', 0 : 'not induced angina'}
df['exang'] = df['exang'].replace(conversion_dict)
df.head()
  • Checking features of various attributes

Sex vs Heart Disease Patients

df_plot = df.groupby(['target', 'sex']).size().reset_index().pivot(columns='target', index='sex', values=0)
df_plot.plot(kind='bar', stacked=True, color=['skyblue','orange'])

Chest Pain Type vs Heart Disease Patients

df_plot = df.groupby(['target', 'cp']).size().reset_index().pivot(columns='target', index='cp', values=0)
df_plot.plot(kind='bar', stacked=True, color=['yellowgreen','violet'])

Fasting Blood Sugar vs Heart Disease Patients

df_plot = df.groupby(['target', 'fbs']).size().reset_index().pivot(columns='target', index='fbs', values=0)
df_plot.plot(kind='bar', stacked=True, color=['orange', 'yellowgreen'])

Exercise Induced Angina vs Heart Disease Patients

df_plot = df.groupby(['target', 'exang']).size().reset_index().pivot(columns='target', index='exang', values=0)
df_plot.plot(kind='bar', stacked=True, color=['gold', 'violet'])
  • Plotting the distribution of various attributes

thalach : maximum heart rate achieved

sns.distplot(df['thalach'],kde=True,bins=30,color='green')

chol : serum cholestoral in mg/dl

sns.distplot(df['chol'],kde=True,bins=30,color='red')

trestbps: resting blood pressure (in mm Hg on admission to the hospital)

sns.distplot(df['trestbps'],kde=True,bins=30,color='blue')

Number of people who have heart disease according to age

plt.figure(figsize=(15,6))
sns.countplot(x='age',data = df, hue = 'target',palette='cubehelix')
  • Now, visualizations are done! So, Reverting back to numeric values for predictions.
#for target
conversion_dict = {'isHeartPatient' : 1, 'isNotHeartPatient' : 0}
df['target'] = df['target'].replace(conversion_dict)
#for sex
conversion_dict = {'Male' : 1,'Female' : 0}
df['sex'] = df['sex'].replace(conversion_dict)
#for cp
conversion_dict = {'Typical' : 0,'Atypical' : 1,'Non-anginal' : 2,'Asymptomatic' : 3}
df['cp'] = df['cp'].replace(conversion_dict)
#for fbs
conversion_dict = {'fbs > 120 mg/dl' : 1, 'fbs < 120 mg/dl' : 0}
df['fbs'] = df['fbs'].replace(conversion_dict)
#for exang
conversion_dict = {'induced angina' : 1,'not induced angina' : 0}
df['exang'] = df['exang'].replace(conversion_dict)
df.head()
  • Next step is to split the dataset into a train and test set for making predictions.
x = df.drop('target',axis=1)
y = df['target']

x is taking all features as input except the last column ‘target’ which gives the result of that specific row information. y is taking the last ‘target’ column as input.

  • This data is given as input to the function train_test_split() from sklearn library.

It takes x, y as input, test_size=0.20 means 20% of the whole dataset is allocated as test data and rest is train data.

In Layman’s words, train data is the one on which an algorithm is applied, test data is the one on which testing is done on the obtained output. This way predictions are checked either to be accurate or not, and finally accuracy score is generated.

Output of train_test_split() is stored in 4 parts (x_train, x_test, y_train, y_test).

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state=42)
  • Next step is Preprocessing — Scaling the features

StandardScaler is used to standardize features by removing the mean and scaling to unit variance. The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x_train)

x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

Now as we are done with Data Cleaning and Preprocessing, we can apply ML Algorithms to our dataset!

  • Importing important modules from above imported libraries
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn import metrics
  • 1. Applying Logistic Regression
from sklearn.linear_model import LogisticRegression
  • Define function LogisticRegression() from sklearn.linear_model library to logreg
logreg = LogisticRegression()
  • Now, x_train and y_train is passed to function logreg.fit() from sklearn library
logreg.fit(x_train, y_train)
  • Finally, prediction is performed on x_test data using function logreg. predict() from sklearn library

begin = time.time() stores the starting time and end = time.time() stores the ending time. Finally by getting the difference, we get Execution Time taken by the following model.

begin = time.time()
y_pred = logreg.predict(x_test)
end = time.time()
lrExecTime = end - begin
print('Execution Time taken by LR : ',lrExecTime)
  • Printing predicted output
y_pred
  • Printing Confusion Matrix

Below is the explanation for output.

true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.

true negatives (TN): We predicted no, and they don’t have the disease.

false positives (FP): We predicted yes, but they don’t actually have the disease. (Also known as a “Type I error.”)

false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a “Type II error.”)

print(confusion_matrix(y_test, y_pred))
  • Printing Classification Matrix

The Precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The Recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

The Support is the number of occurrences of each class in y_true.

print(classification_report(y_test, y_pred))
  • Printing Accuracy Score
print("Accuracy: ",metrics.accuracy_score(y_test, y_pred))
print('Accuracy Score: ',accuracy_score(y_test,y_pred))
lrAccuracy = round(accuracy_score(y_test,y_pred),5)*100
print('Using Logistic Regression we get an accuracy score of: ',
lrAccuracy,'%')

Logistic Regression gave 85.246% accuracy in 0.0004436969757080078 seconds.

  • 2. Applying Gaussian Naive Bayesian
from sklearn.naive_bayes import GaussianNB
  • Define function GaussianNB() from sklearn.naive_bayes library to gnb
gnb = GaussianNB()
  • Now, x_train and y_train is passed to function gnb.fit() from sklearn library
gnb.fit(x_train, y_train)
  • Finally, prediction is performed on x_test data using function gnb. predict() from sklearn library
begin = time.time()
y_pred = gnb.predict(x_test)
end = time.time()
gnbExecTime = end - begin
print('Execution Time taken by GNB : ',gnbExecTime)
  • Printing predicted output
y_pred
  • Printing Confusion Matrix
print(confusion_matrix(y_test, y_pred))
  • Printing Classification Matrix
print(classification_report(y_test, y_pred))
  • Printing Accuracy Score
print("Accuracy: ",metrics.accuracy_score(y_test, y_pred))
print('Accuracy Score: ',accuracy_score(y_test,y_pred))
gnbAccuracy = round(accuracy_score(y_test,y_pred),5)*100
print('Using Gaussian Naive Bayesian we get an accuracy score of: ',
gnbAccuracy,'%')

Gaussian Naive Bayesian gave 86.885% accuracy in 0.0006594657897949219 seconds.

  • 3. Apply K — Nearest Algorithm
from sklearn.neighbors import KNeighborsClassifier
  • Define the function KNeighborsClassifier(n_neighbors=7) from sklearn.neighbors library to classifier (Selection of neighbor as 7 was finalized by trying different values and finalizing the most optimal one.)
classifier = KNeighborsClassifier(n_neighbors=7)
  • Now, x_train and y_train is passed to function classifier.fit() from sklearn library
classifier.fit(x_train, y_train)
  • Finally, prediction is performed on x_test data using a function classifier. predict() from sklearn library
begin = time.time()
y_pred = classifier.predict(x_test)
end = time.time()
knnExecTime = end - begin
print('Execution Time taken by KNN : ',knnExecTime)
  • Printing predicted output
y_pred
  • Printing Confusion Matrix
print(confusion_matrix(y_test, y_pred))
  • Printing Classification Matrix
print(classification_report(y_test, y_pred))
  • Printing Accuracy Score
print('Accuracy Score: ',accuracy_score(y_test,y_pred))
knnAccuracy = round(accuracy_score(y_test,y_pred),5)*100
print('Using k-NN we get an accuracy score of: ',
knnAccuracy,'%')
  • K — Nearest Neighbors gave 91.803% accuracy in 0.004843473434448242 seconds.
  • Below is a comparison of Execution Time for all 3 models!
labels = 'LR','GNB','KNN'
values = [lrExecTime,gnbExecTime,knnExecTime]
plt.bar(labels,values,color=['violet','skyblue','yellowgreen'])
plt.title('Execution Time Comparison')
plt.show()
  • Below a comparison of Accuracy Score for all 3 models!
labels = 'LR','GNB','KNN'
values = [lrAccuracy,gnbAccuracy,knnAccuracy]
plt.bar(labels,values,color=['skyblue','gold','violet'])
plt.title('Accuracy Comparison')
plt.show()

The most optimal accuracy was obtained by K — Nearest Neighbors gave 91.803% accuracy in 0.004843473434448242 seconds.

https://www.karwaantimes.com/wp-content/uploads/sale-improving-aanalysis.jpg

So, this was my overall explanation of my journey and first experience through Data Science!

Lastly, I would conclude with Tim Berners-Lee’s quote :

Data is a precious thing and will last longer than the systems themselves!

🔗GitHub Link to Project!

🔗Kaggle Link to Project!

--

--

Shaivi Ganatra
Shaivi Ganatra

Responses (1)