The Centralized Web: Location-based addressing
Before we dig into how linking data works on the decentralized web, let's take a minute to examine how we're used to accessing data on the centralized web.
Location addressing with URLs
URLs (Uniform Resource Locators) are the addresses we give each other for data on the centralized web (the internet that we are using right now). They make it possible for us to create links and connect data on the web, so that we can link images, videos to a webpage. (The web would be pretty lousy without links!) However, URLs are based on the location where data is stored, not on the contents of the resource stored there. We call this location addressing, and it presents us with some problems.
Most of us have a lot of experience with URLs, and we make some assumptions about them based on our experience. When we see https://www.puppies.com/puppy.jpg, for example, we'll probably guess from the file name and extension that the data stored at that location is an image of a beagle (in JPEG format), but we can't verify this from the URL alone. There may very well be a photo of a baby hiding at puppy.jpg, or even worse, an adorable kitty!
Through the domain name(puppies.com), URLs indicate which authority we should go to for the data(here, we go to puppies.com). Now as the links(https://www.puppies.com/puppy.jpg)refercing the data are location-based(stored on https://www.puppies.com/), hence the data is centralized on these authorities(https://www.puppies.com/) that host the data. We make assumptions about these authorities (or domains), just as we do with file names. For example, we might assume that a file hosted at puppies.com is safer to open than one hosted at evilhacker.com, but we can't be sure of it.
Ultimately, the contents of a file(puppy.jpg) hosted on the centralized web have no direct relationship with their location-based addresses(puppies.com). If we see a picture of an adorable puppy and are told it's stored on the web, there's no way for us to guess the URL that would lead us to the image. We cannot determine domain or filename just by seeing the puppy image.
Trust and efficiency on the centralized web
Since we can't verify what content resides at a particular URL and are dependent on central authorities (and human kindness) to label things as they really are, it's easy for us to get tricked by malicious actors.
It's also easy for thousands of people to store exactly the same photo of that adorable beagle, but all on different domains and with different filenames, leading to a lot of redundancy. Let's face it, even on our own laptops most of us have accidentally saved the same document as download.pdf and download(01).pdf without realizing it, or saved iterations of the same term paper over and over again with v1 or 2019-10-10 added to the title. The web is a confusing mess of data that's saved multiple times at different URLs, and there's no easy way to tell which items are identical to each other.
There must be a better way!
Now, as we have understood the problems with linking data using URLs(using location addressing), let's see how we can solve it using content addressing using hashes.