BREAST CANCER DETECTION USING MACHINE LEARNING

 BREAST CANCER DETECTION USING MACHINE LEARNING







In this blog, we are going to learn the following things
  • In this project, we will learn how to detect whether women have breast cancer or not by using machine learning
  • Uploading our data using ipywidgets
  • Data understanding and visualization
  • Plot various kinds of plots using seaborn, matplotlib , heatmap, and much more
  • Different machine learning techniques like RandomForestClassifier, K neighbors Classifier, SVC

First thing we need is data on which this machine learning can be done you download it from kaggle





Various parameters is given in our dataset and their meaning
Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

Since we are not doctor all you need to know is malignant means women have cancer and benign means she does not have

First thing we have to do is import important modules that we are going to use [we are going to call machine learning modules later onwards]
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


Upload Our data into Jupyter Notebook

import ipywidgets as widgets
widgets.IntSlider()
from IPython.display import display
w = widgets.IntSlider()
uploader = widgets.FileUpload(
    accept='*.csv',  # Accepted file extension e.g. '.txt', '.pdf', 'image/*', 'image/*,.pdf'
    multiple=False  # True to accept multiple files upload else False
)

display(uploader)

we will see an uploading button once we run this code

Show our data 
import io
import pandas as pd
input_file = list(uploader.value.values())[0]
content = input_file['content']
content = io.StringIO(content.decode('utf-8'))
df = pd.read_csv(content)
df.head()

The above Line will show our data in Jupyter notebook like all the paramter and value it contains 
Note : We have store our data into a variable called df but you can change the variable name as per your requirement

After that we will type df.describe it will show how many value contain each parameter ,mean,median, standard deviation , 25 % ,75%, 100% etc . It tells us how cluster our data is 


df.shape #it tells how many rows and columns are present in our data in our case it is 569,33

df.info() #this tells our data types ,no. of non null counts , space occupy by our data , no.of rows and colums

df.isnull().sum() # this line will sum up all the null values in our data that present Note: null value in this case are useless and we need to remove it

After running last command you observe that Unnamed: 32 contains null value so we will drop this column

df =df.drop(columns='Unnamed: 32') # this will drop this column


Data correlation (In simple how does one parameter effect the other parameter) . We will use heatmap to color visualise 
corr =df.corr()
plt.figure(figsize=(20,10))
sns.heatmap(corr, annot=True)


Now as we know that our Diagonsis contain two variable 
M = malignant, B = benign . But machine learing algorithm does not understand work like M and B so we have to convert it into integer so that we can train the model [parameter and goal of this machine learning blog already define in the beginning]

# one hot encoding
df = pd.get_dummies(data=df, drop_first=True) # this will convert it into 1 and 0 form which can be easily use for machine learning algorithm

JUST IN IF YOU WANT TO PLOT MORE GRAPH AND VISUALISE JUST WRITE THIS SIMPLE COMMAND IT WILL DO ALL YOUR WORK AND ALSO CREATE REPORT FOR YOU [MY FAVOURITE COMMAND]

from pandas_profiling import ProfileReport
df.profile_report()


We can also do pairpot in our jupyter notebook by using this simple command

sns.pairplot(df)


We have done enough Visualisation now its show time to do our machine learning
This is a supervised machine learning classification project so that you need to keep in mind

First we have to create our data which contains our data
x = df.iloc[:,1:-1].values # this will all rows and columns except diagonsis one
y = df.iloc[:,-1].values # this will contain all rows and column of Diagnosis


Splitting dataset into training data and testing data . Notice the size define is 0.2 it means 20% of our all data will use as testing data , random_State =0 whenever we run our data we will get same results .x_trian (contains data that is going to use train our machine learning [variable]) ,y_train(contains data that is going to use train our machine learning [Results of x_train variable]) , x_test (contain variable of our dataset)
y_test (will get this value after training our data we will use this to test how accurate our model is)


from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)

First Machine Learning Model we  are going to use RandomForestClassifier


from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(x_train,y_train)


Here what we have done is we have store our randomForestClasifier into a variable called rfc and rfc.fit(x_train ,y_train) basically here all our magic of machine learning is done here.

After training we have to see how good our model is 
y_pred=rfc.predict(x_test)
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,mean_squared_error
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print("Training Score: ",rfc.score(x_train,y_train)*100)

precision    recall  f1-score   support

           0       0.98      0.97      0.98        67
           1       0.96      0.98      0.97        47

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

[[65  2]
 [ 1 46]]
Training Score:  100.0

We can see different parameter like accuracy , precision ,confusion matrix ,accuracy of our model


Machine Learning With KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=7)

knn.fit(x_train,y_train)

y_pred=knn.predict(x_test)
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,mean_squared_error,r2_score
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print("Training Score: ",knn.score(x_train,y_train)*100)
print(knn.score(x_test,y_test))

 precision    recall  f1-score   support

           0       0.96      0.96      0.96        67
           1       0.94      0.94      0.94        47

    accuracy                           0.95       114
   macro avg       0.95      0.95      0.95       114
weighted avg       0.95      0.95      0.95       114

[[64  3]
 [ 3 44]]
Training Score:  93.4065934065934
0.9473684210526315









Post a Comment

0 Comments