A complete guide to Web Scraping

A complete guide to Web Scraping

Introduction

What is Web Scraping?

Web Scraping is a process to extract the valuable data present in any kind of website such as job posting websites, eCommerce websites, and many more, through a programming language with minimal manual effort.

Why Web Scraping?

The keyword for this answer is DATA. In today’s era, data is the most important asset for any leading industry. Data can be used to analyze, and enhance the existing data, for advertising, and many more. And any major giants and government IT rules don’t allow to share user’s data. Hence Web Scraping comes into the picture. How a genuine user visits a website, looks into the data, in the same manner, a programming code also called a bot, visits the website and extracts the data on the website, and stores it in the local system.

Is Web Scraping illegal?

Many websites do allow scraping the data present on their website, and some don’t. That can be seen in the robots.txt file present. Later in the blog, we will see how to check the robots.txt file.

Why don’t websites want data to be scraped?

The reason being websites publish their data for the users (people). They don’t want their unique data to be downloaded and processed in any manner without authorization. Therefore, they try different ways to stop web scarping.

Few ways how companies stop web scraping are:

Using Captcha: This is one of the most used techniques to verify if the user is genuine or a bot i.e., a programming code. But the problem with this technique is it degrades the user experience and is very irritating at times.
Using Cookies: This is a new technique and very few websites use this technique. Whenever a user visits the website using a browser, websites store cookies. As it is an automatic process, it is a little difficult to store cookies using programming code. And when they don’t find cookies, they don’t show the data on the link hit.
Using Java Script: In this method, the data on the website is hidden using JavaScript. Therefore, though the data is visible on the website when a bot tries to extract it or a user tries using inspect element technique of the browser, they do not find the data.
Limiting the request: For a genuine user, it is certain that in each time frame the no of requests won’t be too much. For example, in a minute a genuine user may request maybe 5, 10, or max 20 but in the case of programming code there can be 100s and in a few cases 1000s requests in a minute. As the website sees too many requests from a single IP, they identify them as robots and then block that IP or ask them to verify using a captcha.
Updating the website regularly: Every website has a unique HTML. Hence if a website needs to be scraped properly, the code must be unique to that website. There can be a single code but that can extract data only in a particular format. And if the HTML is updated regularly, then the code also needs to be updated accordingly.

These are a few of the many ways commonly used to stop Web Scraping.

There are many ways of Web Scraping but the most common way is using Selenium or using 2 major python libraries (request and beautifulsoup)

What is selenium?

Selenium is an open-source tool that automates web browsers. To use selenium a driver must be downloaded from the website. The code can be written in any language such as Python, Java, etc.

Here we will see two ways to scape data from a website.

For consideration let’s scrape job postings on the statusneo website(https://statusneo.freshteam.com/jobs).

Before jumping to code, as mentioned above, we should check the robots.txt file of the website.

Read and obey robots.txt

To see the robots.txt file, the URL to be searched is statusneo.com/robots.txt(base URL/robots.txt)

Rules defined are:

No Access

Disallow: /

User-agent: *

This means no part of the website can be scraped and one should stay away.

Full Access

User-agent: *
Disallow:

This means any part of the website can be scraped.

Partial Access

User-agent: *

Disallow: /folder/

User-agent: *

Disallow: /file.html

This means few parts of the website are not allowed and the rest can be scraped.

When we investigate statusneo.com/robots.txt, it doesn’t stop us to scrape its job posting URL.

Using Selenium

To start with Selenium, the driver is required. There are two ways to get the driver

Downloading through code:

driver = webdriver.Chrome(ChromeDriverManager().install())

Downloading manually from:

https://chromedriver.chromium.org/downloads

We will use 1st method to install the driver.

Let’s jump to the coding part

Step 1:

# IMPORT NECESSARY PACAKGES
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from tqdm.auto import tqdm
import pandas as pd

Step 2:

#Install Driver
driver = webdriver.Chrome(ChromeDriverManager().install())

Step 3:

#Specify Search URL 
search_url="https://statusneo.freshteam.com/jobs" 
driver.get(search_url)

Step 4:

#Get the Header
driver.find_element(by=By.XPATH, value="/html/body/header/div/nav/h4").text

Step 5:

#For job posting
job_list = driver.find_element(by=By.CLASS_NAME, value='job-role-list')
job_list = job_list.find_elements(by=By.TAG_NAME, value="a")

Step 6:

#Getting Data from each post
job_title = []
job_desc = []
job_location = []
job_link = []
for job in tqdm(job_list):
    row_details = job.find_element(by=By.CLASS_NAME, value="row")
    job_title.append(row_details.find_element(by=By.CLASS_NAME, value="job-title").text)
    job_desc.append(row_details.find_element(by=By.CLASS_NAME, value="job-desc").text)
    job_location.append(row_details.find_element(by=By.CLASS_NAME, value="location-info").text)
    job_link.append(job.get_attribute("href"))

Step 7:

#Replacing /n with space
job_location = [i.replace("\n"," ") for i in job_location]

Step 8:

#Storing Data in DataFrame
data = pd.DataFrame()
data['Job Title'] = job_title
data['Job Description'] = job_desc
data['Job Location'] = job_location
data.head()

Using Python package:

Step 1:

import requests
from bs4 import BeautifulSoup
from tqdm.auto import tqdm
import pandas as pd

Step 2:

search_url="https://statusneo.freshteam.com/jobs" 
page = requests.get(search_url)

Step 3:

soup = BeautifulSoup(page.content, "html.parser")

Step 4:

soup.find("h4",class_ = "brand-text").text

Step 5:

job_list = soup.find("div", class_="job-role-list")
job_list = job_list.find_all("a")

Step 6:

job_title = []
job_desc = []
job_location = []
job_link = []
for job in tqdm(job_list):
    job_title.append(job.find_all("div")[2].text)
    job_desc.append(job.find_all("div")[3].text)
    job_location.append(job.find_all("div")[4].text)
    job_link.append(job.attrs['href'])

Step 7:

job_desc = [i.replace("\n"," ").replace("  ","") for i in job_desc]
job_location = [i.replace("\n"," ").replace("  ","") for i in job_location]

Step 8:

data = pd.DataFrame()
data['Job Title'] = job_title
data['Job Description'] = job_desc
data['Job Location'] = job_location
data.head()

**This code works on 13^th June 2022. If there is any change in the HTML of the code, the paths have to be updated according and there might be a need to update the whole as well.

0 Comments