📨 SMS Spam Collection Dataset with NLP
✒️Description
Context
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
Content
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
This corpus has been collected from free or free for research sources at the Internet:
-> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. -> A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. -> Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches:
Acknowledgements
The original dataset can be found here. The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.
We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.
Inspiration
Can you use this dataset to build a prediction model that will accurately classify which texts are spam?
Goal
Let's find out which texts are spam with NLP!!
🔍 About NLP
The NLP (Natural Language Processing) is a branch of AI with the goal to make machines capable of understanding and producing human language. NLP has been around for decades, but it has recently seen an explosion in popularity due to pre-trained models (PTMs) which can be implemented with minimal effort and time on the side of NLP developers. This blog post will introduce you to different types of pre-trained machine learning models for NLP and discuss their usage in real-world examples.
- T5
- BERT
- GPT
➕ Import Libraries
- numpyfor linear algebra
- pandasfor data processing & deal with CSV data
- mataplotlibfor visualization
- seabornfor statistical data visualization
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")
🔭 Representing text as numerical data
📌 From the scikit-learn documentation:
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
So, we have to convert text into matrix of token counts through CountVerctorizer !
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'Please call me... PLEASE!']
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)
CountVectorizer()
# examine the fitted vocabulary
vect.get_feature_names()
/opt/conda/lib/python3.7/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
['cab', 'call', 'me', 'please', 'tonight', 'you']
# vocabulary_ and lable match
vect.vocabulary_
{'call': 1, 'you': 5, 'tonight': 4, 'me': 2, 'cab': 0, 'please': 3}
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm
<3x6 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Row format>
📲 process of CountVectorizer
ex ) 'call you tonight' = ['cab', 'call', 'me', 'please', 'tonight', 'you'] </br> = [0,1,0,0,1,1] </br> if the word is in the sentences return 1
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
0 | 1 | 0 | 0 | 1 | 1 |
1 | 1 | 1 | 0 | 0 | 0 |
0 | 1 | 1 | 2 | 0 | 0 |
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()
array([[0, 1, 0, 0, 1, 1],
[1, 1, 1, 0, 0, 0],
[0, 1, 1, 2, 0, 0]])
📌 From the scikit-learn documentation:
In this scheme, features and samples are defined as follows:
- Each individual token occurrence frequency (normalized or not) is treated as a feature.
- The vector of all the token frequencies for a given document is considered a multivariate sample.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
# check the type of the document-term matrix
type(simple_train_dtm)
scipy.sparse.csr.csr_matrix
# examine the sparse matrix contents
print(simple_train_dtm)
(0, 1) 1
(0, 4) 1
(0, 5) 1
(1, 0) 1
(1, 1) 1
(1, 2) 1
(2, 1) 1
(2, 2) 1
(2, 3) 2
📌 From the scikit-learn documentation:
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.
# example text for model testing
simple_test = ["please don't call me"]
In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()
array([[0, 1, 1, 1, 0, 0]])
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
/opt/conda/lib/python3.7/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
0 | 1 | 1 | 1 | 0 | 0 |
📋 Summary:
vect.fit(train) learns the vocabulary of the training data vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)
💾 Reading a text-based dataset into pandas
# read file into pandas using a relative path
sms = pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv", encoding='latin-1')
# leave only data to use
sms.dropna(how="any", inplace=True, axis=1)
sms.columns = ['label', 'message']
sms.head()
ham | Go until jurong point, crazy.. Available only ... |
ham | Ok lar... Joking wif u oni... |
spam | Free entry in 2 a wkly comp to win FA Cup fina... |
ham | U dun say so early hor... U c already then say... |
ham | Nah I don't think he goes to usf, he lives aro... |
🔍 Exploratory Data Analysis (EDA)
sms.describe()
5572 | 5572 |
2 | 5169 |
ham | Sorry, I'll call later |
4825 | 30 |
Label divided in only two unique data
sms.groupby('label').describe()
4825 | 4516 | Sorry, I'll call later | 30 |
747 | 653 | Please call our customer service representativ... | 4 |
There's 4825 ham rows and 747 spam rows data.
# convert label to a numerical variable
# ham to 0 spam to 1
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})
sms.head()
ham | Go until jurong point, crazy.. Available only ... | 0 |
ham | Ok lar... Joking wif u oni... | 0 |
spam | Free entry in 2 a wkly comp to win FA Cup fina... | 1 |
ham | U dun say so early hor... U c already then say... | 0 |
ham | Nah I don't think he goes to usf, he lives aro... | 0 |
# add len data
sms['message_len'] = sms.message.apply(len)
sms.head()
ham | Go until jurong point, crazy.. Available only ... | 0 | 111 |
ham | Ok lar... Joking wif u oni... | 0 | 29 |
spam | Free entry in 2 a wkly comp to win FA Cup fina... | 1 | 155 |
ham | U dun say so early hor... U c already then say... | 0 | 49 |
ham | Nah I don't think he goes to usf, he lives aro... | 0 | 61 |
👁️Visualize for Compare the length between Ham and Spam Message
plt.figure(figsize=(12, 8))
sms[sms.label=='ham'].message_len.plot(bins=35, kind='hist', color='blue',
label='Ham messages', alpha=0.6)
sms[sms.label=='spam'].message_len.plot(kind='hist', color='red',
label='Spam messages', alpha=0.6)
plt.legend()
plt.xlabel("Message Length")
Text(0.5, 0, 'Message Length')
We could find the spam messages usaually has a long message. I think it's because almost the spam has lots of information in it.
sms[sms.label=='ham'].describe()
4825.0 | 4825.000000 |
0.0 | 71.023627 |
0.0 | 58.016023 |
0.0 | 2.000000 |
0.0 | 33.000000 |
0.0 | 52.000000 |
0.0 | 92.000000 |
0.0 | 910.000000 |
Let's find the message which has a longest lenght!(==910)
sms[sms.message_len == 910].message.iloc[0]
"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."
📑 Text Pre-processing
Our main issue with our data is that it is all in text format (strings). The classification algorithms that we usally use need some sort of numerical feature vector in order to perform the classification task. There are actually many methods to convert a corpus to a vector format. The simplest is the bag-of-words approach, where each unique word in a text will be represented by one number.
In this section we'll convert the raw messages (sequence of characters) into vectors (sequences of numbers).
As a first step, let's write a function that will split a message into its individual words and return a list. We'll also remove very common words, ('the', 'a', etc..). To do this we will take advantage of the NLTK library. It's pretty much the standard library in Python for processing text and has a lot of useful features. We'll only use some of the basic ones here.
Let's create a function that will process the string in the message column, then we can just use apply() in pandas do process all the text in the DataFrame.
✂️ Removing punctuation.
We will use nltk.corpus to rid of the punctuation. https://mizykk.tistory.com/29
import string
from nltk.corpus import stopwords ## for removing puctuating
def text_process(mess):
"""
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. Returns a list of the cleaned text
"""
STOPWORDS = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']
# Check characters to see if they are in punctuation
nopunc = [char for char in mess if char not in string.punctuation]
# Join the characters again to form the string.
nopunc = ''.join(nopunc)
# Now just remove any stopwords
return ' '.join([word for word in nopunc.split() if word.lower() not in STOPWORDS])
sms.head()
ham | Go until jurong point, crazy.. Available only ... | 0 | 111 |
ham | Ok lar... Joking wif u oni... | 0 | 29 |
spam | Free entry in 2 a wkly comp to win FA Cup fina... | 1 | 155 |
ham | U dun say so early hor... U c already then say... | 0 | 49 |
ham | Nah I don't think he goes to usf, he lives aro... | 0 | 61 |
message_len seems decreased!
📊Let's Tokenize!!
We will use python collections.counter to count the letter of the word contains.
sms['clean_msg'] = sms.message.apply(text_process)
sms.head()
ham | Go until jurong point, crazy.. Available only ... | 0 | 111 | Go jurong point crazy Available bugis n great ... |
ham | Ok lar... Joking wif u oni... | 0 | 29 | Ok lar Joking wif oni |
spam | Free entry in 2 a wkly comp to win FA Cup fina... | 1 | 155 | Free entry wkly comp win FA Cup final tkts 21s... |
ham | U dun say so early hor... U c already then say... | 0 | 49 | dun say early hor c already say |
ham | Nah I don't think he goes to usf, he lives aro... | 0 | 61 | Nah think goes usf lives around though |
type(stopwords.words('english'))
list
from collections import Counter
words = sms[sms.label=='ham'].clean_msg.apply(lambda x: [word.lower() for word in x.split()])
ham_words = Counter()
for msg in words:
ham_words.update(msg)
print(ham_words.most_common(50))
[('get', 303), ('ltgt', 276), ('ok', 272), ('go', 247), ('ill', 236), ('know', 232), ('got', 231), ('like', 229), ('call', 229), ('come', 224), ('good', 222), ('time', 189), ('day', 187), ('love', 185), ('going', 167), ('want', 163), ('one', 162), ('home', 160), ('lor', 160), ('need', 156), ('sorry', 153), ('still', 146), ('see', 137), ('n', 134), ('later', 134), ('da', 131), ('r', 131), ('back', 129), ('think', 128), ('well', 126), ('today', 125), ('send', 123), ('tell', 121), ('cant', 118), ('ì', 117), ('hi', 117), ('take', 112), ('much', 112), ('oh', 111), ('night', 107), ('hey', 106), ('happy', 105), ('great', 100), ('way', 100), ('hope', 99), ('pls', 98), ('work', 96), ('wat', 95), ('thats', 94), ('dear', 94)]
words = sms[sms.label=='spam'].clean_msg.apply(lambda x: [word.lower() for word in x.split()])
spam_words = Counter()
for msg in words:
spam_words.update(msg)
print(spam_words.most_common(50))
[('call', 347), ('free', 216), ('txt', 150), ('mobile', 123), ('text', 120), ('claim', 113), ('stop', 113), ('reply', 101), ('prize', 92), ('get', 83), ('new', 69), ('send', 67), ('nokia', 65), ('urgent', 63), ('cash', 62), ('win', 60), ('contact', 56), ('service', 55), ('please', 52), ('guaranteed', 50), ('customer', 49), ('16', 49), ('week', 49), ('tone', 48), ('per', 46), ('phone', 45), ('18', 43), ('chat', 42), ('awarded', 38), ('draw', 38), ('latest', 36), ('å£1000', 35), ('line', 35), ('150ppm', 34), ('mins', 34), ('receive', 33), ('camera', 33), ('1', 33), ('every', 33), ('message', 32), ('holiday', 32), ('landline', 32), ('shows', 31), ('å£2000', 31), ('go', 31), ('box', 30), ('number', 30), ('apply', 29), ('code', 29), ('live', 29)]
🧮 Vectorization
Currently, we have the messages as lists of tokens (also known as lemmas) and now we need to convert each of those messages into a vector the SciKit Learn's algorithm models can work with.
Now we'll convert each message, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.
We'll do that in three steps using the bag-of-words model:
- Count how many times does a word occur in each message (Known as term frequency)
- Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)
- Normalize the vectors to unit length, to abstract from the original text length (L2 norm)
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.clean_msg
y = sms.label_num
print(X.shape)
print(y.shape)
(5572,)
(5572,)
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(4179,)
(1393,)
(4179,)
(1393,)
from sklearn.feature_extraction.text import CountVectorizer
# instantiate the vectorizer
vect = CountVectorizer()
vect.fit(X_train)
CountVectorizer()
# learn training data vocabulary, then use it to create a document-term matrix
X_train_dtm = vect.transform(X_train)
# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
# examine the document-term matrix
X_train_dtm
<4179x7996 sparse matrix of type '<class 'numpy.int64'>'
with 34796 stored elements in Compressed Sparse Row format>
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm
<1393x7996 sparse matrix of type '<class 'numpy.int64'>'
with 9971 stored elements in Compressed Sparse Row format>
from sklearn.feature_extraction.text import TfidfTransformer
# Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.
tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(X_train_dtm)
tfidf_transformer.transform(X_train_dtm)
<4179x7996 sparse matrix of type '<class 'numpy.float64'>'
with 34796 stored elements in Compressed Sparse Row format>
🤖 Building and evaluating a model
We will use multinomial Naive Bayes:
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)
CPU times: user 3.14 ms, sys: 193 µs, total: 3.33 ms
Wall time: 3.27 ms
MultinomialNB()
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)
0.9827709978463748
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
array([[1205, 8],
[ 16, 164]])
X_test.shape
(1393,)
# print message text for false positives (ham incorrectly classifier)
# X_test[(y_pred_class==1) & (y_test==0)]
X_test[y_pred_class > y_test]
2418 Madamregret disturbancemight receive reference...
4598 laid airtel line rest
386 Customer place call
1289 HeyGreat dealFarm tour 9am 5pm 95pax 50 deposi...
5094 Hi ShanilRakhesh herethanksi exchanged uncut d...
494 free nowcan call
759 Call youcarlos isare phones vibrate acting mig...
3140 Customer place call
Name: clean_msg, dtype: object
# print message text for false negatives (spam incorrectly classifier)
X_test[y_pred_class < y_test]
4674 Hi babe Chloe r smashed saturday night great w...
3528 Xmas New Years Eve tickets sale club day 10am ...
3417 LIFE never much fun great came made truly spec...
2773 come takes little time child afraid dark becom...
1960 Guess Somebody know secretly fancies Wanna fin...
5 FreeMsg Hey darling 3 weeks word back Id like ...
2078 85233 FREERingtoneReply REAL
1457 CLAIRE havin borin time alone wanna cum 2nite ...
190 unique enough Find 30th August wwwareyouunique...
2429 Guess IThis first time created web page WWWASJ...
3057 unsubscribed services Get tons sexy babes hunk...
1021 Guess Somebody know secretly fancies Wanna fin...
4067 TBSPERSOLVO chasing us since Sept forå£38 defi...
3358 Sorry missed call lets talk time 07090201529
2821 ROMCAPspam Everyone around responding well pre...
2247 Back work 2morro half term C 2nite sexy passio...
Name: clean_msg, dtype: object
# example of false negative
X_test[5]
'FreeMsg Hey darling 3 weeks word back Id like fun still Tb ok XxX std chgs send å£150 rcv'
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
array([2.11903975e-02, 3.97831612e-04, 1.06470895e-03, ...,
1.31939653e-02, 9.99821127e-05, 6.04083365e-06])
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)
0.9774342768159751
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
pipe = Pipeline([('bow', CountVectorizer()),
('tfid', TfidfTransformer()),
('model', MultinomialNB())])
pipe.fit(X_train, y_train)
Pipeline(steps=[('bow', CountVectorizer()), ('tfid', TfidfTransformer()),
('model', MultinomialNB())])
y_pred = pipe.predict(X_test)
metrics.accuracy_score(y_test, y_pred)
0.9669777458722182
metrics.confusion_matrix(y_test, y_pred)
array([[1213, 0],
[ 46, 134]])
📊 Comparing models
We will compare multinomial Naive Bayes with logistic regression:
Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
# import an instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear')
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)
CPU times: user 14.6 ms, sys: 919 µs, total: 15.5 ms
Wall time: 14.8 ms
LogisticRegression(solver='liblinear')
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
array([0.01694418, 0.0152182 , 0.08261755, ..., 0.02198942, 0.00531726,
0.00679188])
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
0.9842067480258435
metrics.confusion_matrix(y_test, y_pred_class)
array([[1213, 0],
[ 22, 158]])
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)
0.9835714940001832
📑The result of comparing
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(2)
model = ['Multinomial Naive Bayes model', 'LogisticRegression']
values = [metrics.roc_auc_score(y_test, y_pred_prob)
,metrics.roc_auc_score(y_test, y_pred_prob)]
plt.bar(x, values)
plt.xticks(x, model)
plt.show()
Multinomial Naive Bayes model's AUC is higher than LogisticRegrssion but has almost same values.
'AI & Data Analysis > Kaggle Notebook' 카테고리의 다른 글
[CNN] Traffic Signs Classification with Explanation (0) | 2022.05.27 |
---|---|
[EDA & Visualization] Netflix Dataset (0) | 2022.05.27 |
[EDA] Home-credit-default-risk Dataset (0) | 2022.05.27 |
[EDA & Visualization] San_Francisco Data (1) | 2022.03.31 |
[CNN] Fashion-Mnist-Data (0) | 2022.03.31 |