What is CENTERFOLD?

Playboy launched nearly 70 years ago to a society badly in need of a sexual shake-up. We quickly became THE platform for writers, artists, photographers and performers to express themselves with…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Redefining Cancer Treatment with Machine Learning

Photo by - National Cancer Institute

Over the past decades, there have been continuous evolution related to cancer treatment.Scientists applied different techniques to find the types of cancer before they cause symptoms.Recent years have seen many breakthroughs in the field of medicine and also there have been large amount of data available to medical researchers, as more data is available medical researchers have used machine learning to identify hidden patterns from complex data to try to predict effective future outcomes of the cancer type.

Given the significance of personalized medicine and the growing trends on the application of ML techniques we will try to solve one such problem where the challenge is to distinguish the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers).

Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature.We need to develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations.

Given Gene, Variations and Text as features we need to predict the class of the Class variable(target variable). It’s a multi-class classification problem and we will measure the performance of our model with a multi-class log-loss metric.

We will read the data, perform text-preprocessing , split the data into train, test and cross- validation , train random models, train different ML models , compute log-loss and also the percentage of misclassified points and then compare and find out the best model.

Chaliye Shuru Karte(Let’s start coding!!)

Reading the data

Our data is present in two different files with different separators so will read each file separately and then combine both the files using “ID” column.

Text-Preprocessing and Feature Engineering

After reading the data we will do text preprocessing which involves cleaning of text like stopword removal, removing special characters if any , normalizing text and converting all the words to lowercase. During this process we found that there are some rows which doesn’t have text and therefore we will replace the NaN values with Gene + Variation values.

We will now split our data into train, test and cross-validate data to check if the distribution of our target values are same in all the three data or not.

Why distribution needs to be same? Distribution of our target value should be same so that during training, our model should encounter all the class values as present in our dataset.

Distribution of target variable

We will first train a random model so that we can compare our other models and their performance and efficiency.

How to perform log-loss for a random model in a multi-class setting?We will randomly generate numbers equal to our number of classes(10 in our problem) for every point in our Test and Cross Validate data and then normalize them to sum it to one.

In the above we first created an empty array with size 9 for each class label and then randomly generated probabilities for each class label and plotted the confusion matrix and computed log-loss.

We can see that our random-model has a log-loss of 2.4 across cross-validate and test-data so we need our models to perform better than this, let’s check the precision and recall for this model.

How to interpret the above precision recall matrix?

Precision
1. Taking an example of cell(1x1) it has value of 0.127 ; it says of all the points that are predicted to be class 1 only 12.7% values are actually class 1

2. For original class 4 and predicted class 2 we can say that of the values that our model predicted to class 2, 23.6% values actually belong to class 4

Recall

1. Check cell (1X1) it has a value of 0.079 which means for all the points which actually belongs to class 1 our model predicted only 7% values to be class 1

2. For original class 8 and predicted class 5 values is 0.250 means of all the values which are actually class 8 are model predicted 25% values to be class 5

Logistic Regression

Performance of Logistic Regression on Cross-Validation
Confusion Matrix of Logistic Regression Model

Support Vector Machine

Performance of SVM on Cross-Validation

Comparison of all the models

We can see that Logistic Regression and Support Vector Machine performs better than others in terms of both log-loss and percentage of mis-classified points.

Feel free to connect with me on any of the platforms.

Check out my other articles also

Add a comment

Related posts:

7 Things I Never Knew About Breastfeeding

7 Things I Never Knew About Breastfeeding. After two kids and 18+ intermittent months of being a milk machine for my offspring, I’ve learned some things..

Is anyone out there?

I feel like nothing is real, like everything in my life is just an elaborate scheme. I feel like my boyfriend isn’t actually with me, like my family isn’t actually there, and I just feel completely…

The Shooting Guard from North Carolina

Everyone knows about the great New York guards like Kenny Smith, Kenny Anderson, Stephon Marbury, Mark Jackson, Bob Cousy and how they inspired tens of millions of people to work on their handles to…