Use Python for Web Scraping

This post will demonstrate how you write Python for web scraping.

Learning the HTTP application is fairly complex, but it is simple to apply in Python. The picture demonstrates how to make an HTTP request in Python.

HTTP Request

The line starting with “mysock.connect” is what pushes the socket out across the internet, and connects it to an endpoint.

It is crucial there is a server there to connect to, or else your code will crash right there at the third line. A crucial difference between connecting with a socket versus reading, is you can send and retrieve data with a socket.

Because you are using HTTP protocol, and you established the socket connection, then it is your responsibility to make the first communication.

The line starting with “mysock.send” makes first communication with a GET request. Once you make the GET request, you can scrape the data you want.

The while loop will receive data at 512 characters at a time. If the data is less than 512 characters, you will still receive it, unless it is less than one character.  Running this program should return the following data:

web scraping

Make the HTTP Request Easier

You might agree that the previous example showed you that it is fairly simple to make an HTTP request with Python. Well, there is a library called “urllib” that makes it even easier.

urllib in Python

The urllib library work like an extra application layer that makes a URL seem like it is just a file.

You can see that using urllib is similar to using a handle to open and read a file.


Leave a Reply

Your email address will not be published. Required fields are marked *