A complete guide to Web Scraping
What is Web Scraping?
Web Scraping is a process to extract the valuable data present in any kind of website such as job posting websites, eCommerce websites, and many more, through a programming language with minimal manual effort.
Why Web Scraping?
The keyword for this answer is DATA. In today’s era, data is the most important asset for any leading industry. Data can be used to analyze, and enhance the existing data, for advertising, and many more. And any major giants and government IT rules don’t allow to share user’s data. Hence Web Scraping comes into the picture. How a genuine user visits a website, looks into the data, in the same manner, a programming code also called a bot, visits the website and extracts the data on the website, and stores it in the local system.
Is Web Scraping illegal?
Many websites do allow scraping the data present on their website, and some don’t. That can be seen in the robots.txt file present. Later in the blog, we will see how to check the robots.txt file.
Why don’t websites want data to be scraped?
The reason being websites publish their data for the users (people). They don’t want their unique data to be downloaded and processed in any manner without authorization. Therefore, they try different ways to stop web scarping.
Few ways how companies stop web scraping are:
- Using Captcha: This is one of the most used techniques to verify if the user is genuine or a bot i.e., a programming code. But the problem with this technique is it degrades the user experience and is very irritating at times.
- Using Cookies: This is a new technique and very few websites use this technique. Whenever a user visits the website using a browser, websites store cookies. As it is an automatic process, it is a little difficult to store cookies using programming code. And when they don’t find cookies, they don’t show the data on the link hit.
- Limiting the request: For a genuine user, it is certain that in each time frame the no of requests won’t be too much. For example, in a minute a genuine user may request maybe 5, 10, or max 20 but in the case of programming code there can be 100s and in a few cases 1000s requests in a minute. As the website sees too many requests from a single IP, they identify them as robots and then block that IP or ask them to verify using a captcha.
- Updating the website regularly: Every website has a unique HTML. Hence if a website needs to be scraped properly, the code must be unique to that website. There can be a single code but that can extract data only in a particular format. And if the HTML is updated regularly, then the code also needs to be updated accordingly.
These are a few of the many ways commonly used to stop Web Scraping.
There are many ways of Web Scraping but the most common way is using Selenium or using 2 major python libraries (request and beautifulsoup)
What is selenium?
Selenium is an open-source tool that automates web browsers. To use selenium a driver must be downloaded from the website. The code can be written in any language such as Python, Java, etc.
Here we will see two ways to scape data from a website.
For consideration let’s scrape job postings on the statusneo website(https://statusneo.freshteam.com/jobs).
Before jumping to code, as mentioned above, we should check the robots.txt file of the website.
Read and obey robots.txt
To see the robots.txt file, the URL to be searched is statusneo.com/robots.txt(base URL/robots.txt)
Rules defined are:
- No Access
This means no part of the website can be scraped and one should stay away.
- Full Access
This means any part of the website can be scraped.
- Partial Access
This means few parts of the website are not allowed and the rest can be scraped.
When we investigate statusneo.com/robots.txt, it doesn’t stop us to scrape its job posting URL.
To start with Selenium, the driver is required. There are two ways to get the driver
- Downloading through code:
driver = webdriver.Chrome(ChromeDriverManager().install())
- Downloading manually from:
We will use 1st method to install the driver.
Let’s jump to the coding part
# IMPORT NECESSARY PACAKGES from selenium import webdriver from selenium.webdriver.common.by import By from webdriver_manager.chrome import ChromeDriverManager from tqdm.auto import tqdm import pandas as pd
#Install Driver driver = webdriver.Chrome(ChromeDriverManager().install())
#Specify Search URL search_url="https://statusneo.freshteam.com/jobs" driver.get(search_url)
#Get the Header driver.find_element(by=By.XPATH, value="/html/body/header/div/nav/h4").text
#For job posting job_list = driver.find_element(by=By.CLASS_NAME, value='job-role-list') job_list = job_list.find_elements(by=By.TAG_NAME, value="a")
#Getting Data from each post job_title =  job_desc =  job_location =  job_link =  for job in tqdm(job_list): row_details = job.find_element(by=By.CLASS_NAME, value="row") job_title.append(row_details.find_element(by=By.CLASS_NAME, value="job-title").text) job_desc.append(row_details.find_element(by=By.CLASS_NAME, value="job-desc").text) job_location.append(row_details.find_element(by=By.CLASS_NAME, value="location-info").text) job_link.append(job.get_attribute("href"))
#Replacing /n with space job_location = [i.replace("\n"," ") for i in job_location]
#Storing Data in DataFrame data = pd.DataFrame() data['Job Title'] = job_title data['Job Description'] = job_desc data['Job Location'] = job_location data.head()
Using Python package:
import requests from bs4 import BeautifulSoup from tqdm.auto import tqdm import pandas as pd
search_url="https://statusneo.freshteam.com/jobs" page = requests.get(search_url)
soup = BeautifulSoup(page.content, "html.parser")
soup.find("h4",class_ = "brand-text").text
job_list = soup.find("div", class_="job-role-list") job_list = job_list.find_all("a")
job_title =  job_desc =  job_location =  job_link =  for job in tqdm(job_list): job_title.append(job.find_all("div").text) job_desc.append(job.find_all("div").text) job_location.append(job.find_all("div").text) job_link.append(job.attrs['href'])
job_desc = [i.replace("\n"," ").replace(" ","") for i in job_desc] job_location = [i.replace("\n"," ").replace(" ","") for i in job_location]
data = pd.DataFrame() data['Job Title'] = job_title data['Job Description'] = job_desc data['Job Location'] = job_location data.head()
**This code works on 13th June 2022. If there is any change in the HTML of the code, the paths have to be updated according and there might be a need to update the whole as well.