How small is the World Wide Web, really?

We are continuously told that the WWW is huge, and that we must rely on large corporations to navigate its immensity. But is this true?

Pollen grains

As people will tell you, the World Wide Web is big. The so-called ‘indexed’ Web, i.e. the documents you can access through a search engine, comes to around 45 billion pages (we are talking here ‘informative’ pages, and not repeats of other pages or links to blank calendar pages and so on, which currently come to a dazzling 60 trillion pages). At an average of 1.6MB per document, this means we would need 72 billion MB storage space, or 68,665TB, to store a backup of the indexed Internet.

Let’s now assume that the lovely English town of Shrewsbury decided to provide every single one of its inhabitants with a 1TB hard drive and asked them to each store a fraction of the Web. This would nicely solve our backup problem, supplying us with 71,715TB – more than we need. Shrewsbury would stand ready to repair any loss of data on the Web. The task now actually sounds manageable.

Backing up the Web might not be our most urgent problem, of course, so perhaps we should think about another application: let’s take Internet search. Search needs a queryable representation of all web pages out there on the Internet (a so-called ‘index’ which is, at the most basic level, a list of words appearing in those pages). The index for a page is typically smaller than the actual page: the rather sophisticated — and therefore storage-intensive — representation I am currently using in my work comes to an average of 440K per document. So 45 billion web pages would fit on 45,000,000,000 * 440K = 18,440TB. That’s a job for the pretty village of Mansfield Woodhouse in Nottinghamshire (population: 18,574).

So the Web is perhaps not as big as we thought. Why not take it into our own hands? The advantages of transferring Web indexing and search to individuals are immense. Privacy, of course, is a big point. Doing search yourself, or in collaboration with individuals, means less opportunities for a large organisation to track your searches and record your IP, location, etc. It doesn’t mean that your Internet traffic becomes invisible – accessing a web page does mean sending data over a cable that can be spied upon – but it can make a huge difference to know the search terms that led you to a particular page. For instance, whether you are visiting the site of a green tea producer after typing ‘Japanese teas’ or ‘cancer prevention’ does make a difference to the information you make available about yourself.

The other main advantage of being in control of your Web searches is the quality of results. Large search engines rely on advertisement money to grow and be financially profitable. They may have political agendas. They may also, simply, have a notion of an ideal search algorithm which doesn’t agree with yours. All this affects your daily travels on the Internet. You are often forced onto a motorway when you would rather have gone off the beaten track.

Of course, we may be able to store the Web but that doesn’t mean that the job of indexing and searching it is either fast or easy. So one first question to ask is whether we – as unique individuals with specific histories, cultures and interests – really need the whole of it. To take a simple example, I will probably not start browsing Mandarin websites until I have gone on a Chinese course. For various other reasons, I am also unlikely to visit scuba-diving and knitting sites this year. A quick glance at my browser history for the last three years shows that I have visited around 32,000 unique web pages. Around 6000 of those are results from search engines, so we’re really talking 26,000 documents. Let’s be generous and give me an allocation of 10,000 documents a year; that’s 0.000000222% of the indexed Web, or 4.2GB worth of index. So I could have my whole searchable index for the year on my laptop’s hard drive, with still plenty of space for my holiday pictures, and be entirely self-sufficient.

The problem, of course, is that I cannot exactly predict which pages I will want to visit tomorrow or next week. So it is no good thinking I’ll do a short indexing job on January 1st and thereby provide for all my indexing needs in the next twelve months. I might of course be clever and attempt to guess which domains I will visit. My search history shows that 70% of the web documents I visited come from a mere 400 individual domains, with Wikipedia and Stackoverflow high up on the list. Indexing those would keep me going for a little while. But still. What if I suddenly develop a unexpected passion for parrots – or scuba-diving, for that matter – and have no relevant page in my index?

This is where a collective effort becomes indispensable. Let’s now imagine a world where browsing the Web involves a background indexing task. That is, when I visit the Wikipedia article on Shrewsbury, my computer produces an index of that page and stores it on my hard drive (I do exactly this, by the way: it takes around 9s to index a simple representation of each document on my 4GB Ubuntu VirtualBox, 1 minute to get a complex linguistic analysis – I never notice it). Let’s also say that some friendly parrot experts out there have been indexing pages about parrots for a while and are ready to share their index with me. It would only take a clever search algorithm to find those people and use their data. This is the idea behind the PeARS project.

If we only index what we browse, of course, we may never have a full picture of the searchable content of the Internet. I have also consciously ignored the fact that the Web is dynamic and new pages appear constantly, while others change. The general point I wish to make, however, is that it wouldn’t take that many individuals to achieve a fair coverage of our daily searches and become a little more self-sufficient. The World Wide Web – and even more so, my World Wide Web – are really quite small. We should act on this realisation.