This post will focus on making sense of HTML documents that you retrieve from a web server – using Python.
Look at the example pictured below. It displays useful code to retrieve a web page, and print out the content.
You can see HTML tags in the document. These tags are rendered on a web page to give it structure. Learning HTML is a whole other topic.
However, what you will focus on here is parsing through the content using Python, and looking for certain elements within the content.
In the above picture, look at the string highlighted in purple. This string represents a link to another web page.
You can create a loop that parses out these types of string, puts it in a “fhand” variable, and opens the page. This type of loop could continue until it opens and prints all the content on the internet.
Realistically, your computer would get drained of its memory long before your loop completed parsing through all the web links on the internet, but this concept outlines the beginning of a web crawler. A web crawler employs what is referred to as web scraping.
The Power of Web Scraping
Web Scraping gives you great power. You are literally able to make a copy of web, or part of it, given enough memory.
Some web servers employ shields, like a captcha for example, to ward off applications like Python from scraping their site. However, Python can usually outsmart these types of shields. On the other hand, some servers do not care if you scrape their pages.
Why Web Scrape HTML Documents
You can see that there are many reasons why you may want to scrape the web. You could write Python code that checks for new apartments on Craigslist, for example. You could write Python code to pull social data.
Web scraping provides a way to pull data when there is no application program interface.
Some websites have rules regarding web scraping. Facebook, for example, does not allow it. Facebook does not display public data. You have to be logged in to see anything. So if you did try to scrape their site, your code would have to log you in first, and then Facebook could easily know it’s you scraping.
What next? See how this BeautifulSoup example makes it easy to scrape HTML.