QueerCoded/how-do-websites-on-the-internet-work

Introduction to web technology

Find a file

forest c67ea80bfa Explain how its different in chrome		2025-04-19 23:18:54 +00:00
diagrams	wip	2025-04-06 13:50:24 -05:00
screenshots	Add TLS, DNS, TCP sections	2025-04-08 02:23:09 -05:00
ReadMe.md	Explain how its different in chrome	2025-04-19 23:18:54 +00:00

ReadMe.md

How do Websites on the Internet Work?

There is a lot happening when you visit a website. Technologists split these complex process up into different "Layers of Abstraction" in order to understand them.

Typically the digital machinery of one layer enables the next layer, etc, and the're stacked up like Legos to form a whole cohesive tool or experience like a website.

This document is intended to be sort of like a textbook: You can go through the "excercises" and gain exposure to many different concepts, but you can also return here and search for things as a quick reference.

The CSS (Cascading Style Sheets) and JavaScript of the website
- CSS: Color, fonts, layout, animation
- JavaScript: Interaction, buttons, transformation, live chats, progress bars, graphs
The HTML (HyperText Markup Language) of the website
- Structure, content, text, images, accessibility features, forms
The Web Browser
- Program which runs on the web-surfer's computer or phone 🏄
- Provides a platform for HTML, CSS, and JavaScript to run on
HTTP (HyperText Transfer Protocol)
- World Wide Web protocol for communication between web browsers and web servers
- The two lifelong friends, always seen together: Request and Response
- Transfers files: HTML, CSS, and JavaScript, or any other file.
  - HTML, CSS, JavaScript, and all other "source code" files are just plain text files, like .txt.
TLS (Transport Layer Security)
- Used to be called SSL (Secure Sockets Layer)
- Encryption. Created to facilitate buying things online, now almost ubiquitous.
- HTTP running on top of TLS is referred to as HTTPS🔒
DNS Domain Name System
TCP Transmission Control Protocol
IP/NAT Internet Protocol and Network Address Translation
- IP Defines Networks and gives each computer an IP Address on 1 or more networks.
  - IPv4 Addresses look like this: 12.34.56.78
  - IPv6 Addresses look like this: 2600:1406:bc00:53::b81e:94ce
- There is only one Public network: The Internet.
- Unlimited number of private networks, or LANs (Local Area Network)
- 99.9% of computers we interact with are on LANs. They can only talk to the Internet through a NAT, also referred to as a Router.
  - NATs are like one-way valves: You can talk to the internet, and the internet can respond. But someone on the internet can't directly connect to your computer on a LAN ^{footnote 1}

But how can I see these things / interact with them?

ℹ️ The following guide uses screenshots taken from the Firefox web browser. Chrome and other browsers also have developer tools, and there may be some differences in your browser.

`HTML` (HyperText Markup Language) and `CSS` (Cascading Style Sheets)

Try right-clicking on this text and choosing Inspect Element or Inspect (Q). This should open the web browser developer tools!
If you don't see that option, look up "how to open developer tools in xyz browser"
You can modify all of this stuff!
- You can modify existing CSS rules or add new ones.
  - Try making the text red and bold by clicking on element { ... in the CSS area and adding:
    - color: red;
    - font-weight: bold;
- You can modify any of the HTML: Just right-click it and choose Edit as HTML
  - Try modifying the text inside the <li> tag (List Item tag).

`JavaScript`

Once you have the web browser developer tools open, you can:
- Try running some JavaScript on the current page using the Console.
  - Navigate to the Console Tab and type in alert("Hello!"), then press enter.
  - It should pop up an alert box on the page!
  - This kind of console is often called a REPL (Read-Evaluate-Print Loop), pronounced "reh pl".
    - If you make a mistake and the code contains invalid syntax or throws an error when executed, the Console will throw an error:
    - In this case the quotation marks around "Hello" were missing, so instead of it being interpreted as a literal value of a string of characters, it was interpreted as a reference to a variable named Hello. Thus the ReferenceError: Hello is not defined.
- You can also use a different tool, the Step-by-Step Debugger, to watch the page's JavaScript execute one line at a time.
  - I thought this post did a good job of explaining what it's like to use a debugger 😆
  - Navigate to Debugger Tab (in Chrome, the tab is called called Sources)
  - In the hierarchy pane on the left, open up Main Thread -> git.cyberia.club -> QueerCoded/how-... -> main
  - You may have to refresh the page.
  - Scroll down to Line 22, right below the openining <script> tag.
  - Click on the line number to add a Breakpoint
  - Refresh the page again, and the debugger should have paused the rendering and execution of code on the page. The page should be all white, and if it's wide enough you will see this debugger widget:
  - In the Debugger tab of the developer tools, you should see the current execution state paused on your breakpoint.
  - This code doesn't do anything exciting, it's just setting up error handlers and configuration for other JavaScript which will load later, but we can still step through it one line at a time and examine things as we go.
  - To use the Debugger and the Console at the same time, you'll need to show the split console (Under the ... menu in the top right):
  - Try running window.config in the Console. The resulting value that gets printed should be undefined.
  - Now press the Step Over button a few times, until the line which sets window.config has been executed.
  - If you run window.config again, now it should show you that there is an object there.
  - Don't forget to scroll back up and click on the breakpoint on line 22 again to clear it so it won't trigger every time you visit this page in the future.

Combining `Inspect Element` and the `Console`

There is a neat trick built into web browsers' developer tools which makes it easy to use JavaScript on any element you can see on the page.
The last element you used right-click -> Inspect Element on will be stored in a special variable called $0
So you can use JavaScript to modify elements as well: Just run something like $0.style.color = 'blue'; in the Console.
- Or, if horror is more your thing, you can make your HTML scream endlessly with: setInterval(() => $0.textContent += 'A', 10);
- The developer console may require you to type 'allow pasting' before you can paste these code snippets in.
  - This is a small warning that browser developers put there to help prevent scammers from manipulating people into running malicious code via the developer tools.

`HTTP` (HyperText Transfer Protocol)

Basic anatomy of an HTTP Request/Response pair:
- Request
  - Request Method: commonly GET, POST, PUT, or DELETE
  - Request Path: /files/index.html
  - Headers. For example:
    - Host: example.com
      - Host is the only required header for all requests.
    - Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQK
  - Body
    - GET and DELETE requests are not supposed to have request bodies.
    - Request bodies are optional for POST and PUT requests.
    - Usually used for uploading files or other information, like when you submit a form or send a message.
    - A request body can be any data, it's essentially a file.
- Response
  - Status Code. Common examples:
    - 200 (Ok / Success)
    - 302 (Redirect)
    - 404 (File not Found)
    - 500 (Internal Server Error)
  - Headers. For example:
    - Content-Type: text/plain
      - This header would instruct the browser to render the response as if it was a plain text file.
    - Location: https://example.com/
      - This header would be used in conjunction with a 302 Redirect status to tell the browser to redirect to a different URL.
  - Body
    - Almost all responses have a body, altho technically the body is still optional.
    - Again, the body is essentially just a file, the contents of a file. It could be anything.
ℹ️ NOTE: The total size of the HTTP Request and Response Headers, including the request path etc, typically must fit within a limited size (about 4 kilobytes). However, the size of the request and response body is unlimited.
- Usually a Content-Length header is provided on any request or response which has a body, which informs the client/server about how many bytes of data to expect.
Once you have the web browser developer tools open, you can see HTTP Requests and Responses in the Network tab.
The network tab only records and displays HTTP traffic that happens after it was opened. So you probably want to refresh the page after opening it.
It should show a list of requests like this:
Clicking on one of them will open up a panel on the right which shows details about the request.
- The request headers and response headers are both shown on this panel.
- ℹ️ NOTE: the Authorization and Cookie headers may contain sensitive information -- these are where your login session is stored!
- I cut off the following screenshot right before the part where it displayed the value of the i_like_gitea cookie which contains my login session 😉
There is a lot of useful information on this Headers tab of the request/response details pop-out pane.
- For Requests: Login / Auth info like Authorization and Cookie
- Date and Time of the Request/Response.
- Metadata, like Content-Type, Content-Length, etc.
Each of the Request Body / Response Body get their own tab:

`TLS` (Transport Layer Security)

Click on the Lock icon in the URL bar of your web browser. (The way this works is different for every browser)
In Firefox, it's 🔒 -> Connection Secure -> More Information -> View Certificate
- The three tabs you see here represent the cryptographic chain of trust that goes from our git.cyberia.club certificate all the way to the ISRG Root X1 certificate which is pre-installed on every copy of MacOS, Windows, and Linux.
When connecting to a server, a TLS client validates the following:
- That the server's certificate was signed by a trusted CA root certificate.
- That the server's certificate has the domain name that the user requested.
- That the server's certificate isn't expired.
There is a cool website called badssl.com which maintains examples of all of the failure cases of all of these validations!
- A self-signed certificate: https://self-signed.badssl.com/
- A certificate with the wrong domain name: https://wrong.host.badssl.com/
- An expired certificate: https://expired.badssl.com/

`DNS` (Domain Name System)

Open up your terminal and run nslookup example.com

forest@debian:~$ nslookup example.com
Server:         8.8.8.8
Address:        8.8.8.8#53

Non-authoritative answer:
Name:   example.com
Address: 23.192.228.84
Name:   example.com
Address: 23.192.228.80

`TCP` (Transmission Control Protocol)

TCP connections have the following elements:
- Source IP Address:
  - Who's trying to connect?
- Source Port Number:
  - This is generated randomly by the OS.
  - The source port is used by NATs, routers, and other computers to tell connections apart and route responses back to the correct one.
- Destination IP Address:
  - What computer are you trying to connect to?
- Destination Port Number:
  - What service on that computer are you trying to talk to?
  - Different ports are used for different services by convention. Here are some common ones:
    - 22 SSH (Secure Shell)
    - 53 DNS
    - 80 Plain Unencrypted HTTP
    - 443 HTTP inside TLS (https)
Open up your terminal and run curl -v localhost
- Since you probably aren't running a web server on your laptop (YET! 😉), curl won't be able to connect to anything.
- You will see a TCP error: Connection refused

forest@debian:~$ curl -v localhost
*   Trying 127.0.0.1:80...
* connect to 127.0.0.1 port 80 failed: Connection refused

There are multiple ways for TCP to fail. It can also time out (Connect Timeout or Read/Write Timeout, Idle Timeout, etc)
- We can see a Connect Timeout by trying to connect to something which is blocked by a firewall.
- Try connecting to example.com on a different port like 8080. It should just try to connect / hang forever.
- I added the -4 flag to my command to force it to use IPv4 only, just because it makes the output cleaner and easier to see.
```
forest@debian:~$ curl -v -4 example.com:8080
*   Trying 23.192.228.84:8080...
```
- If you waited long enough, you would see: connect to 23.192.228.84 port 8080 failed: Connection timed out
- Press Ctrl-C to cancel the command and get your command prompt back.
Yet another TCP failure mode: You can try to connect to an address which neither your computer, nor your router, nor the internet know how to get to.
This address might be on a different private network, and you can't get there from here.
In this case you will see a no route to host error:

forest@debian:~$ curl -v 10.69.4.20
*   Trying 10.69.4.20:80...
* connect to 10.69.4.20 port 80 failed: No route to host

Seeing it all together at once with `curl`

First, we see that the cyberia.club domain name resolves to 69.61.2.178.

forest@debian:~$ curl -v https://cyberia.club
*   Trying 69.61.2.178:443...

We see that TCP starts, and it succeeds in connecting to 69.61.2.178. It uses port 443 because that's the default port for HTTPS.

* Connected to cyberia.club (69.61.2.178) port 443 (#0)

Next, we see that TLS starts up inside the TCP connection, and the server's certificate is verified:

* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=cyberia.club
*  start date: Mar 14 16:13:25 2025 GMT
*  expire date: Jun 12 16:13:24 2025 GMT
*  subjectAltName: host "cyberia.club" matched cert's "cyberia.club"
*  issuer: C=US; O=Let's Encrypt; CN=E6
*  SSL certificate verify ok.

Finally, the HTTP Request is fired off (wrapped inside the TLS session, which is wrapped inside the TCP connection)
- We see the Method GET, the request path /, and three headers, the required Host header plus user-agent and accept.

> GET / HTTP/2
> Host: cyberia.club
> user-agent: curl/7.88.1
> accept: */*
>

And the server's HTTP Response comes back. First the HTTP Status Code (200, aka "OK"), and the Headers:

< HTTP/2 200 
< accept-ranges: bytes
< alt-svc: h3=":443"; ma=2592000
< content-type: text/html; charset=utf-8
< etag: "d8xhd0qredlh240"
< last-modified: Fri, 04 Apr 2025 01:57:55 GMT
< server: Caddy
< vary: Accept-Encoding
< content-length: 2736
< date: Tue, 08 Apr 2025 07:09:49 GMT
<

After the headers, we see the Response Body which contains our HTML.

<!doctype html>
<html lang="en">
<head>
        <title>Cyberia Computer Club</title>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width,initial-scale=1.0">
        <meta name="Description" content="Cyberia Computer Club">
        <link rel="stylesheet" href="/cyberia.css">
        <link rel="icon" href="/favicon.ico">
</head>

<body>
        <nav>
                <a href="/">Cyberia</a>
                <a href="/culture">Culture</a>
                <a href="https://blog.cyberia.club/read">Blog</a>
                <a href="/calendar">Calendar</a>
                <a href="/donate">Donate</a>
                <a href="/matrix">Matrix (Chat)</a>
                <a href="/mumble">Mumble</a>
                <br>

                <a class="external" href="https://capsul.org">Capsul</a>
                <a class="external" href="https://nullhex.com">Nullhex</a>
                <a class="external" href="https://git.cyberia.club">Git</a>
                <a class="external" href="https://wiki.cyberia.club">Wiki</a>
                <a class="external" href="https://layerze.ro/">Layer Zero (Twin Cities HQ)</a>
        </nav>

        <main><h1>Cyberia Computer Club</h1>
<p>A kind and amazing hacker collective centered in Minnesota, with global friends.</p>
<pre aria-label="ASCII art of Iwakura Lain saying 'close the old world, open the next'">
          _..--------.._
      ,-''              `-.
    ,'                     `.
  ,                         \
  /                           \
/          '.                 \
'          /  ||               ;
;       n /|  |/         |     |
|      / v    /\/`-'vv\'.|\    ,
:    /v`,---         ----.^.   ;
'   |  /  .`,        ,  .`\|   ;
|  n|  '.__/         \ ___/|\  ;
` | `                      | \/|
\ \ \                     | /\/
'; `-\          `'       /|/ |'
  `    \       -          /|  |
  `    `.              .' |  |
    v,_   `;._     _.-;    |  /
      `'`\|-_`'-''__/^'^' | |
              \-v-/        | |
    cL0s3 th3 o1d w0rld   | /
        0p3n th3 n3xt     ||
                          ||
                          |,
</pre>
....

How can I make this stuff myself?

I haven't written this section yet.

Where can I learn more?

I've only just barely scratched the surface, but hopefully opened up enough questions that you can find room to explore.

In my opinion, MDN (Mozilla Developer Network) has the best documentation for these things:

I often just search google for "MDN <insert name of thing here>" and almost never regret it 🙂

We use Let's Encrypt as our CA (Certificate Authority) for TLS. They have some nice docs:

https://letsencrypt.org/docs/

Wikipedia has a neat diagram which explains the fundamentals behind the asymmetric (public/private key) encryption that TLS uses.

https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange#General_overview

If you are curious about the intersection between the networking stuff like TCP, the Operating System, and application code (like Python, Go, Java, C, Rust, etc), you may be interested in a couple blog articles I wrote about it:

Footnotes

^{footnote 1} History of IPv4, IPv6, and NATs:
- It wasn't supposed to be this way.
- When the internet was originally created, it was planned to always be public, every device would get a public address.
- However, there's only about 3 billion IPv4 addresses total.
  - A lot of them end up being unnoccupied for various techno-social reasons, even though they are in short supply.
- IPv6 was created to try to remedy the situation. It has enough address space to give each grain of sand on earth a unique address.
  - However, IPv6 was never fully deployed, and much of the internet still depends on IPv4.
  - NATs were created and deployed instead because they were cheaper and easier, and the people deploying them didn't care about the idealistic vision of the internet: They just wanted it to work for consumers and make money right now.
^{footnote 2} Here is the full (overly detailed) diagram:

ReadMe.md

How do Websites on the Internet Work?

Table of Contents:

To summarize the layers of a typical website:

Simplified Sequence Diagram:

What is all this acronym soup?!!?

But how can I see these things / interact with them?

`HTML` (HyperText Markup Language) and `CSS` (Cascading Style Sheets)

`JavaScript`

Combining `Inspect Element` and the `Console`

`HTTP` (HyperText Transfer Protocol)

`Request`

`Response`

`TLS` (Transport Layer Security)

`DNS` (Domain Name System)

`TCP` (Transmission Control Protocol)

Seeing it all together at once with `curl`

How can I make this stuff myself?

Where can I learn more?

Footnotes

ReadMe.md Unescape Escape

How do Websites on the Internet Work?

Table of Contents:

To summarize the layers of a typical website:

Simplified Sequence Diagram:

What is all this acronym soup?!!?

But how can I see these things / interact with them?

HTML (HyperText Markup Language) and CSS (Cascading Style Sheets)

JavaScript

Combining Inspect Element and the Console

HTTP (HyperText Transfer Protocol)

Request

Response

TLS (Transport Layer Security)

DNS (Domain Name System)

TCP (Transmission Control Protocol)

Seeing it all together at once with curl

How can I make this stuff myself?

Where can I learn more?

Footnotes

ReadMe.md

`HTML` (HyperText Markup Language) and `CSS` (Cascading Style Sheets)

`JavaScript`

Combining `Inspect Element` and the `Console`

`HTTP` (HyperText Transfer Protocol)

`Request`

`Response`

`TLS` (Transport Layer Security)

`DNS` (Domain Name System)

`TCP` (Transmission Control Protocol)

Seeing it all together at once with `curl`