URL Classification using Naive Bayes

Pavan Purohit
3 min readMay 6, 2021

--

Hey guys, welcome to my technical blog. I was searching for the classification of websites(URLs) into different categories for quite a few days, unlike the traditional URL classification where the model is dependent on the historical data or continuous addition of the URLs to the data set in this tutorial, I would like to explain the classification of websites (URLs) into different categories by web scraping and machine learning algorithm.

Text Classification is an supervised machine learning that is widely used in natural language processing. A good text classifier can classify different URLs into categories. Some good text classifiers today are Naive Bayes Classifier, Linear Classifier, and Support Vector Machine. Let us Explore Naive Bayes Classifier in this tutorials. Lets us implement this in 2 steps.

1) Data Collection & Web Scrapping:

After a lot of research, I landed on a good Kaggle dataset for our implementation you can find it here. This dataset contains mainly three columns Category, Title, and Description also it contains a good amount of different categories of data however it’s not sufficient because of some local or desi words present on websites. The one method to collect the text on the websites is by scrapping the website. A preferred package to extract the data from the HTML page is by using BeautifulSoup as below.

from bs4 import BeautifulSoup 
import requests

url="https://www.mondovo.com/keywords/technology-keywords"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html
shoppingsites = []
for link in soup.find_all("td"):
shoppingsites.append(link.text)
# print("Title: {}".format(link.get("title")))
# print("href: {}".format(link.get("href")))
shoppingsites

From the above code snippet, we can access the website data which will extract all the HTML sources from the website, from this data we can fetch particular tag data like Table, row, column, or div, all that we want to do is just replace ‘td’ with whichever tag data you want to extract in soup.find_all(“td”). The main code to extract the title and content from the beautiful soup is as below:

def extract_url_details(url, count=0):
from bs4 import BeautifulSoup
import requests
content = []
title2 = []
header = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Safari/537.36'}
res = requests.get(url, headers=header, timeout=30)
if (res.status_code == 200):
try:

soup = BeautifulSoup(res.content, "html")
title = soup.title.string
title2.append(title)
meta = soup.find_all('meta')
for tag in meta:
if 'name' in tag.attrs.keys() and tag.attrs['name'].strip().lower() in ['description', 'keywords']: content.append(tag.attrs['content'])
except:
content.append("no_details")
title2.append("no_details")

Let us apply some data cleaning techniques to the obtained data from web scrapping as below

a) split data into tokens.

b) remove the stop-words, special characters, and apply lemmatization and stemming.

def preprocess(sentence):
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
sentence = str(sentence)
sentence = sentence.lower()
sentence = sentence.replace('{html}', "")
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', sentence)
rem_url = re.sub(r'http\S+', '', cleantext)
rem_num = re.sub('[0-9]+', '', rem_url)
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(rem_num)
filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
appending_or = [x for x in filtered_words]
return " ".join(appending_or)

2) Applying the Naive Bayes classifier.

import pandas as pd
recommendation_dataset = pd.read_csv('dmoz.csv')#dataset from kaggle
recommendation_dataset = recommendation_dataset[recommendation_dataset['words'].notna()]#remove nan rows from the dataset
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB #importing multinominal naive bayes
X_train, X_test, y_train, y_test = train_test_split(recommendation_dataset['words'], recommendation_dataset['type'], random_state = 5)#splitting train and test data
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
classifier = MultinomialNB().fit(X_train_tfidf, y_train)#train the train data with MultinomialNB model
result = classifier.predict(count_vect.transform([url_content]))[0]#pass the content or title for searching

once details are extracted make a pickled file of the dataset which will improve the performance(it will speed up when you have a large dataset)

import pickle
with open('model.pkl', 'wb') as fout:
pickle.dump((count_vect, classifier), fout)
import pickle
with open('your_pickle_file.pkl', 'rb') as fin:
vectorizer, clf = pickle.load(fin)

result = clf.predict(vectorizer.transform([url_content]))[0]

This is my first content in the medium so please help me to improve with your comments thank you so much for giving your precious time

contributors for this article are

Pavan Purohit: linkedin.com/in/pavan-purohit-aa51ab108

Sunil Prakash : linkedin.com/in/sunil-prakash-04306a14b

Divya Khandekar: linkedin.com/in/divya-khandekar-3807671b6

please check this full code in GitHub

https://github.com/pavanpurohit47/Url-classification

--

--