How To a Code HTTP Client Using Socket - Python

In this tutorial we'll be writing a very simple HTTP client.
HTTP (or Hyper Text Transfer Protocol) is a text-based client-server protocol that runs over TCP.
Plain HTTP runs over TCP port 80.
It should be noted that plain HTTP is mostly deprecated for security reasons. Today, sites should use HTTPS, the secure version of HTTP. HTTPS secures HTTP by merely running the HTTP protocol through a Transport Layer Security (TLS) layer In order to code a http client we need to have a basic understanding of the protocol specification.

HTTP request and response

What You Should Know

It is suggested that you've the basic knowledge of the socket module

HTTP Request type

  • GET: Is used requesting a document from the server.
  • POST: Is used sending data from the client to the server.
  • HEAD: It's just like GET request but the response is just response header instead of receiving the both response header and response body.
  • PUT: Is used to send a document to the web server.
  • DELETE: Is used to request that the web server should delete a document or resource.
  • TRACE: Is used to request diagnostic information from web proxies.
  • CONNECT: Is sometimes used to initiate an HTTP connection through a proxy server.
  • OPTIONS: Is used to ask which HTTP request types are supported by the server for a given resource.

HTTP Request Format

If you open your web browser and navigate to http://www.example.com/page1.htm, your browser will need to send an HTTP request to the web server at www.example.com. That HTTP request may look like this:

  


GET /page1.htm HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Host: example.com
Connection: close

From the above request example, the GET means we want to get a document from the server, and the document path on the server is /page1.html.
and the HTTP/1.1 is protocol and the version(1.1).
The User-Agent header is the name of the software requesting the page.
The Host header is the name of the host we're requesting the page from.
The Connection header tells the server what to do after finishing request, the value of this field can be Keep-Alive or close which means to keep the connection alive or close the connection after sending response respectively.

HTTP Response Format

Like the HTTP request, the HTTP response also consists of a header part and a body part.
Also similar to the HTTP request, the body part is optional.
Most HTTP responses do have a body part, though. The server at www.example.com could respond to our HTTP request with the following reply:

  


HTTP/1.1 200 OK
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Sun, 22 May 2022 16:46:09 GMT
Etag: "1541025663+gzip"
Expires: Sun, 29 May 2022 16:46:09 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (ord/5730)
Vary: Accept-Encoding
X-Cache: HIT
Content-Length: 1270
<!doctype html>
<html>
<head>
<title>Example Domain</title>
...

The 200 from the above response is the status code which means everything is OK or request successful.
The Content-Length is the size of the document we're receiving.
In HTTP we use CRLF('\r\n') to indicate the end of each header, we use double CRLF('\r\n\r\n') to indicate the end of the request headers or response headers.
Now that we have some basics of how this protocol works let's make a HTTP request from our python program.

First we'll import the socket module and define a class


  


import socket

class HTTPClient:

    def __init__(self, url):
        self.url = url
        self.buffer_size = 4096
        self.__parse_url__()

In the constructor, the user needs to pass in the URL for this connection.
and we define a buffer_size which is the maximum size of data to be received, and then we called self.__parse_url__() to break-down the url, the method defination is next
. The next function will be used to break-down the URL into:

  • Protocol
  • Hostname
  • Port
  • Path
  • Query string
This is an example:

a URL example
  


# function for parsing the URL
def __parse_url__(self):

    # Get the protocol,e.g http, https etc
    # If no protocol specified we raised an exception
    url = self.url
    try:
        self.protocol, url = url.split('://')
    except ValueError:
        raise Exception('Invalid URL')

    # we make sure that the protocol is http
    # otherwise we raise an exception saying protocol is not supported
    if self.protocol != 'http':
        raise Exception(f'Protocol not supported: {self.protocol}')

    # Get the hostname and port, example www.google.com:80
    hostname_port = url.split('/')[0]

    # Python will raises a ValueError if there's no enough
    # values to unpack, which means there's no port number
    # so we'll just default to 80

    try:
        self.hostname, self.port = hostname_port.split(':')
    except ValueError:
        self.hostname = hostname_port
        self.port = 80

    # get the path
    try:
        self.path = url.split('/')[1]
        self.path = '/' + self.path
    except IndexError:
        self.path = '/'

    # remove any query string from our path
    if '?q=' in self.path:
        self.path = self.path.split('?q=')[0]

    # get the query string (if available)
    try:
        self.query = url.split('?q=')[1]
    except IndexError:
        self.query = ''

    # remove the '#' from the query string if it contains any
    if '#' in self.query:
        self.query = self.query.split('#')[0]


Next we'll define another method for establishing connection to the server.


  


# function for connecting to the server
def __connect__(self):

    # resolve hostname to ip
    ip = socket.gethostbyname(self.hostname)

    # we create a IPv4/IPv6 dual stack socket
    # for connection
    self.socket = socket.create_connection((ip, self.port), timeout=5)

Next we'll define our last function for sending and receiving request and response respectively.
This function will return the response from the server.


  


# function for sending a HTTP get request
def perform_request(self):

    # first we connect to the server
    self.__connect__()

    # we contruct our request header
    request_headers = \
        f"GET /{self.path}{self.query} HTTP/1.1\r\n" \
        f"Host: {self.hostname}:{self.port}\r\n" \
        "User-Agent: Python http/0.1\r\n" \
        "Connection: close\r\n" \
        "\r\n"

    # send the request headers
    self.socket.send(request_headers.encode())

    self.response = ''
    # receive response
    while True:
        data = self.socket.recv(self.buffer_size)

        if len(data) == 0:
            # finished receiving data
            break
        self.response = self.response + data.decode()

    # return the response
    return self.response

Full Source Code


  


import socket

class HTTPClient:

    def __init__(self, url):
        self.url = url
        self.buffer_size = 4096
        self.__parse_url__()


    # function for parsing the URL
    def __parse_url__(self):

        # Get the protocol,e.g http, https etc
        # If no protocol specified we raised an exception
        url = self.url
        try:
            self.protocol, url = url.split('://')
        except ValueError:
            raise Exception('Invalid URL')

        # we make sure that the protocol is http
        # otherwise we raise an exception ssaying protocol is not supported
        if self.protocol != 'http':
            raise Exception(f'Protocol not supported: {self.protocol}')

        # Get the hostname and port, example www.google.com:80
        hostname_port = url.split('/')[0]

        # Python will raises a ValueError if there's no enough
        # values to unpack, which means there's no port number
        # so we'll just default to 80

        try:
            self.hostname, self.port = hostname_port.split(':')
        except ValueError:
            self.hostname = hostname_port
            self.port = 80

        # get the path
        try:
            self.path = url.split('/')[1]
            self.path = '/' + self.path
        except IndexError:
            self.path = '/'

        # remove any query string from our path
        if '?q=' in self.path:
            self.path = self.path.split('?q=')[0]

        # get the query string (if available)
        try:
            self.query = url.split('?q=')[1]
        except IndexError:
            self.query = ''

        # remove the '#' from the query string if it contains any
        if '#' in self.query:
            self.query = self.query.split('#')[0]


    # function for connecting to the server
    def __connect__(self):

        # resolve hostname to ip
        ip = socket.gethostbyname(self.hostname)

        # we create a IPv4, IPv6 dual stack socket
        # for connection
        self.socket = socket.create_connection((ip, self.port), timeout=5)


    # function for sending a HTTP get request
    def perform_request(self):

        # first we connect to the server
        self.__connect__()

        # we contruct our request header
        request_headers = \
            f"GET /{self.path}{self.query} HTTP/1.1\r\n" \
            f"Host: {self.hostname}:{self.port}\r\n" \
            "User-Agent: Python http/0.1\r\n" \
            "Connection: close\r\n" \
            "\r\n"

        # send the request headers
        self.socket.send(request_headers.encode())

        self.response = ''
        # receive response
        while True:
            data = self.socket.recv(self.buffer_size)

            if len(data) == 0:
                # finished receiving data
                break
            self.response = self.response + data.decode()

        # return the response
        return self.response

We can use our class to parse URL as a bonus:

  


http_client = HTTPClient("http://www.google.com/search?q=python+socket")

print(f"protocol: {http_client.protocol}")
print(f"hostname: {http_client.hostname}")
print(f"port: {http_client.port}")
print(f"path: {http_client.path}")
print(f"query: {http_client.query}")
print(f"url: {http_client.url}")

Here is what I got after running the code:

URL break-down

It's finally time to make some request!


  


http_client = HTTPClient("http://www.example.com/")
response = http_client.perform_request()
headers_body = response.split("\r\n\r\n")
print(f"Response headers:\n{headers_body[0]}")
print(f"Response body:\n{headers_body[1]}")

After performing the request, This is what I get.


HTTP response from http://www.example.com

Share