Web scraping with Python (demo included)

Saroj Humagain
4 min readDec 3, 2018

Web scraping is a technique to extract large amounts of data from websites and saved to a local file in your computer or to a database

Why Web Scraping?
Scraping is basically extracting data from different websites. There are many ways to collect data from the web. Let's suppose we want to build a system that compares the price of product or services that the enterprise has to offer. Here, we need data from the merchants. One way to collect data is, we copy prices from websites and paste in our local system and compare them. This will be a very tedious job and consumes a lot of time as well. So what can we do? We can simply scrap the prices from different websites. Apart from this, there are many use cases of web scraping. Some of them are as follows

  • E-commerce portal: We can build our own website and then scrap products from the retailer or from manufacturing websites. We can scrap prices, ratings, images, and any other specifications.
  • Market research: Web scraping can help you in providing important information to identify, to analyze the market, analyze the need and competition.
  • For marketing: Web scraping can be used to gather contact details of business or individuals from websites like yellopages.com and linkedin.com. Details like email address, phone, website URL etc can help you in marketing.
  • For research: Data is an integral part of any research. It can be academic, scientific or any marketing research. Web scraping can help you to gather structured data from multiple sources from the internet with ease.

Also, we do scraping because many websites do not provide API. So there will be no left alternative for web scraping to collect information. I personally used to have one question about web scraping. Is it legal? And the answer is, scarping is fine till you’re not causing considerable damages to the target website and you’re doing it responsibly. One should be careful while scraping because there is a fine line between collecting information and stealing it.

What is web scraping?

Web scraping is a technique where we automate the procedure (collecting a large amount of data ) instead of manually copying the data from the website and structuring it in a preferred format.

Basic steps in web scraping:

  1. Document load: To scrape a website, we need to first load the document which is HTML document.
  2. Parsing: It is the process of interpreting our document to make our searching possible.
  3. Extraction: In this step, we extract anything that you need, like names, prices or any element from the HTML document.
  4. Transformation: We need to transform the extracted data into a useful format.

Web scraping can be done with the help of different programming language but I prefer Python because it has lots of libraries like for everything and string manipulation is very easy in Python. So the Python is the go-to language for many developers. There are a lot of scraping libraries in Python, some of them are

  1. Pattern
  2. Scrapy
  3. Mechanize
  4. Beautiful Soup
  5. Request

Beautiful Soup and Request are the most used libraries for web scraping. Beautiful Soup is used for pulling data out of HTML and XML files. It provides us simply, in fact, a very simple way of navigating, searching and modifying the HTML. A request is a library which is widely used for sending and receiving information over HTTP.

For the demo, we are going to scrape some data from RedDoko. Let's scrape the following web page and extract all Samsung products.

First, we have to inspect the webpage by pressing ctrl+shift+i. You will get the following result.

Now, we will import libraries BeautifulSoup and Request. Following piece of code is self-explanatory.

And the CSV file containing all this data looks like

If you want to learn line by line, here is the GitHub link, it has code written in a jupyter notebook.

--

--

Saroj Humagain

I basically write on data science, ML and AI and sometimes random things.