The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
Content
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
This corpus has been collected from free or free for research sources at the Internet:
-> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. -> A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. -> Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches:
Acknowledgements
The original dataset can be found here. The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page:http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/in your papers, research, etc.
We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.
Inspiration
Can you use this dataset to build a prediction model that will accurately classify which texts are spam?
Goal
Let's find out which texts are spam with NLP!!
🔍 About NLP
The NLP (Natural Language Processing) is a branch of AI with the goal to make machines capable of understanding and producing human language. NLP has been around for decades, but it has recently seen an explosion in popularity due to pre-trained models (PTMs) which can be implemented with minimal effort and time on the side of NLP developers. This blog post will introduce you to different types of pre-trained machine learning models for NLP and discuss their usage in real-world examples.
T5
BERT
GPT
➕ Import Libraries
numpyfor linear algebra
pandasfor data processing & deal with CSV data
mataplotlibfor visualization
seabornfor statistical data visualization
In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")
🔭 Representing text as numerical data
📌 From the scikit-learn documentation:
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
So, we have to convert text into matrix of token counts throughCountVerctorizer!
In [2]:
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'Please call me... PLEASE!']
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
In [4]:
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)
Out[4]:
CountVectorizer()
In [5]:
# examine the fitted vocabulary
vect.get_feature_names()
/opt/conda/lib/python3.7/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
In this scheme, features and samples are defined as follows:
Each individual token occurrence frequency (normalized or not) is treated as a feature.
The vector of all the token frequencies for a given document is considered a multivariate sample.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completelyignoring the relative position informationof the words in the document.
In [10]:
# check the type of the document-term matrix
type(simple_train_dtm)
Out[10]:
scipy.sparse.csr.csr_matrix
In [11]:
# examine the sparse matrix contents
print(simple_train_dtm)
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.
In [12]:
# example text for model testing
simple_test = ["please don't call me"]
In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.
In [13]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()
Out[13]:
array([[0, 1, 1, 1, 0, 0]])
In [14]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
/opt/conda/lib/python3.7/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
Out[14]:
cabcallmepleasetonightyou0
0
1
1
1
0
0
📋 Summary:
vect.fit(train) learns the vocabulary of the training data vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)
💾 Reading a text-based dataset into pandas
In [15]:
# read file into pandas using a relative path
sms = pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv", encoding='latin-1')
# leave only data to use
sms.dropna(how="any", inplace=True, axis=1)
sms.columns = ['label', 'message']
sms.head()
Out[15]:
labelmessage01234
ham
Go until jurong point, crazy.. Available only ...
ham
Ok lar... Joking wif u oni...
spam
Free entry in 2 a wkly comp to win FA Cup fina...
ham
U dun say so early hor... U c already then say...
ham
Nah I don't think he goes to usf, he lives aro...
🔍 Exploratory Data Analysis (EDA)
In [16]:
sms.describe()
Out[16]:
labelmessagecountuniquetopfreq
5572
5572
2
5169
ham
Sorry, I'll call later
4825
30
Label divided in only two unique data
In [17]:
sms.groupby('label').describe()
Out[17]:
messagecountuniquetopfreqlabelhamspam
4825
4516
Sorry, I'll call later
30
747
653
Please call our customer service representativ...
4
There's 4825 ham rows and 747 spam rows data.
In [18]:
# convert label to a numerical variable
# ham to 0 spam to 1
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})
sms.head()
Out[18]:
labelmessagelabel_num01234
ham
Go until jurong point, crazy.. Available only ...
0
ham
Ok lar... Joking wif u oni...
0
spam
Free entry in 2 a wkly comp to win FA Cup fina...
1
ham
U dun say so early hor... U c already then say...
0
ham
Nah I don't think he goes to usf, he lives aro...
0
In [19]:
# add len data
sms['message_len'] = sms.message.apply(len)
sms.head()
Out[19]:
labelmessagelabel_nummessage_len01234
ham
Go until jurong point, crazy.. Available only ...
0
111
ham
Ok lar... Joking wif u oni...
0
29
spam
Free entry in 2 a wkly comp to win FA Cup fina...
1
155
ham
U dun say so early hor... U c already then say...
0
49
ham
Nah I don't think he goes to usf, he lives aro...
0
61
👁️Visualize for Compare the length between Ham and Spam Message
We could find the spam messages usaually has a long message. I think it's because almost the spam has lots of information in it.
In [21]:
sms[sms.label=='ham'].describe()
Out[21]:
label_nummessage_lencountmeanstdmin25%50%75%max
4825.0
4825.000000
0.0
71.023627
0.0
58.016023
0.0
2.000000
0.0
33.000000
0.0
52.000000
0.0
92.000000
0.0
910.000000
Let's find the message which has a longest lenght!(==910)
In [22]:
sms[sms.message_len == 910].message.iloc[0]
Out[22]:
"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."
📑 Text Pre-processing
Our main issue with our data is that it is all in text format (strings). The classification algorithms that we usally use need some sort ofnumerical feature vectorin order to perform the classification task. There are actually many methods to convert a corpus to a vector format. The simplest is the bag-of-words approach, where each unique word in a text will be represented by one number.
In this section we'll convert the raw messages (sequence of characters) into vectors (sequences of numbers).
As a first step, let's write a function that will split a message into its individual words and return a list. We'll alsoremove very common words, ('the', 'a', etc..). To do this we will take advantage of theNLTK library. It's pretty much the standard library in Python for processing text and has a lot of useful features. We'll only use some of the basic ones here.
Let's create a function that willprocess the string in the message column, then we can just use apply() in pandas do process all the text in the DataFrame.
import string
from nltk.corpus import stopwords ## for removing puctuating
def text_process(mess):
"""
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Returns a list of the cleaned text
"""
STOPWORDS = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']
# Check characters to see if they are in punctuation
nopunc = [char for char in mess if char not in string.punctuation]
# Join the characters again to form the string.
nopunc = ''.join(nopunc)
# Now just remove any stopwords
return ' '.join([word for word in nopunc.split() if word.lower() not in STOPWORDS])
In [24]:
sms.head()
Out[24]:
labelmessagelabel_nummessage_len01234
ham
Go until jurong point, crazy.. Available only ...
0
111
ham
Ok lar... Joking wif u oni...
0
29
spam
Free entry in 2 a wkly comp to win FA Cup fina...
1
155
ham
U dun say so early hor... U c already then say...
0
49
ham
Nah I don't think he goes to usf, he lives aro...
0
61
message_len seems decreased!
📊Let's Tokenize!!
We will use pythoncollections.counterto count the letter of the word contains.
from collections import Counter
words = sms[sms.label=='ham'].clean_msg.apply(lambda x: [word.lower() for word in x.split()])
ham_words = Counter()
for msg in words:
ham_words.update(msg)
print(ham_words.most_common(50))
words = sms[sms.label=='spam'].clean_msg.apply(lambda x: [word.lower() for word in x.split()])
spam_words = Counter()
for msg in words:
spam_words.update(msg)
print(spam_words.most_common(50))
Currently, we have the messages as lists of tokens (also known as lemmas) and now we need to convert each of those messages into a vector the SciKit Learn's algorithm models can work with.
Now we'll convert each message, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.
We'll do that in three steps using the bag-of-words model:
Count how many times does a word occur in each message (Known as term frequency)
Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)
Normalize the vectors to unit length, to abstract from the original text length (L2 norm)
In [30]:
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.clean_msg
y = sms.label_num
print(X.shape)
print(y.shape)
(5572,)
(5572,)
In [31]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(4179,)
(1393,)
(4179,)
(1393,)
In [32]:
from sklearn.feature_extraction.text import CountVectorizer
# instantiate the vectorizer
vect = CountVectorizer()
vect.fit(X_train)
Out[32]:
CountVectorizer()
In [33]:
# learn training data vocabulary, then use it to create a document-term matrix
X_train_dtm = vect.transform(X_train)
In [34]:
# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
In [35]:
# examine the document-term matrix
X_train_dtm
Out[35]:
<4179x7996 sparse matrix of type '<class 'numpy.int64'>'
with 34796 stored elements in Compressed Sparse Row format>
In [36]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm
Out[36]:
<1393x7996 sparse matrix of type '<class 'numpy.int64'>'
with 9971 stored elements in Compressed Sparse Row format>
In [37]:
from sklearn.feature_extraction.text import TfidfTransformer
# Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.
tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(X_train_dtm)
tfidf_transformer.transform(X_train_dtm)
Out[37]:
<4179x7996 sparse matrix of type '<class 'numpy.float64'>'
with 34796 stored elements in Compressed Sparse Row format>
🤖 Building and evaluating a model
We will use multinomial Naive Bayes:
The multinomial Naive Bayes classifier is suitable for classification withdiscrete features(e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
In [38]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
In [39]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)
CPU times: user 3.14 ms, sys: 193 µs, total: 3.33 ms
Wall time: 3.27 ms
Out[39]:
MultinomialNB()
In [40]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
In [41]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)
Out[41]:
0.9827709978463748
In [42]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
Out[42]:
array([[1205, 8],
[ 16, 164]])
In [43]:
X_test.shape
Out[43]:
(1393,)
In [44]:
# print message text for false positives (ham incorrectly classifier)
# X_test[(y_pred_class==1) & (y_test==0)]
X_test[y_pred_class > y_test]
Out[44]:
2418 Madamregret disturbancemight receive reference...
4598 laid airtel line rest
386 Customer place call
1289 HeyGreat dealFarm tour 9am 5pm 95pax 50 deposi...
5094 Hi ShanilRakhesh herethanksi exchanged uncut d...
494 free nowcan call
759 Call youcarlos isare phones vibrate acting mig...
3140 Customer place call
Name: clean_msg, dtype: object
In [45]:
# print message text for false negatives (spam incorrectly classifier)
X_test[y_pred_class < y_test]
Out[45]:
4674 Hi babe Chloe r smashed saturday night great w...
3528 Xmas New Years Eve tickets sale club day 10am ...
3417 LIFE never much fun great came made truly spec...
2773 come takes little time child afraid dark becom...
1960 Guess Somebody know secretly fancies Wanna fin...
5 FreeMsg Hey darling 3 weeks word back Id like ...
2078 85233 FREERingtoneReply REAL
1457 CLAIRE havin borin time alone wanna cum 2nite ...
190 unique enough Find 30th August wwwareyouunique...
2429 Guess IThis first time created web page WWWASJ...
3057 unsubscribed services Get tons sexy babes hunk...
1021 Guess Somebody know secretly fancies Wanna fin...
4067 TBSPERSOLVO chasing us since Sept forå£38 defi...
3358 Sorry missed call lets talk time 07090201529
2821 ROMCAPspam Everyone around responding well pre...
2247 Back work 2morro half term C 2nite sexy passio...
Name: clean_msg, dtype: object
In [46]:
# example of false negative
X_test[5]
Out[46]:
'FreeMsg Hey darling 3 weeks word back Id like fun still Tb ok XxX std chgs send å£150 rcv'
We will compare multinomial Naive Bayes with logistic regression:
Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
In [53]:
# import an instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear')
In [54]:
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)
CPU times: user 14.6 ms, sys: 919 µs, total: 15.5 ms
Wall time: 14.8 ms
Out[54]:
LogisticRegression(solver='liblinear')
In [55]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)