Copyright vs privacy

I have recently realised that a system that enforces copyright at all costs is simply not technologically compatible with respecting people’s privacy. Let me explain.

I have been in the process of completely redesigning PeARS, my prototype for a distributed search engine. I have streamlined the algorithm and made it so that the search process entirely happens on the user’s local machine. This is important to me because it means that my search queries – the things that right now are salient in my life – will not be stored by a third party. Just to be clear: when I type ‘cute cats’ in a centralised search engine such as Google, my query gets sent to that company for processing. My words, ‘cute cats’, travel to some server somewhere in the world which then does some number-crunching over a huge, privately stored index, and returns the list of potentially relevant pages to my browser. That is, I have advertised to that server that I am currently in the mood for cute cats. This may not seem so bad, but if you consider the number and range of queries you send every day to your favourite search engine, you may start realising the scale of what they know about you, from your holiday plans to your medical worries and the state of your relationship.

The reason we use search engines is that they provide us with an efficient way to search the 45 billion web pages available out there. They do so by building ‘indexes’, i.e. compact representations of the words that appear on each document on the Internet. A typical type of index is the so-called ‘positional index’ which, for any given word, records which documents contain that word and where exactly it appears within the relevant pages. Here is an example with two mini-documents containing just one sentence:

Document 1: The dog sleeps.
Document 2: Dogs are great.

It is possible to build a positional index for those two documents, which would look like this:

the:1[1]
dog:1[2];2[1]
sleep:1[3]
be:2[2]
great:2[3]

The representation above tells us that ‘the’ appears in document 1 in position 1, that ‘dog’ appears in documents 1 and 2 in positions 2 and 1, etc. Note that it is very easy to reconstruct the original documents from the index: we just have to retrieve the words that appear in a given document (for instance, ‘the’, ‘dog’ and ‘sleep’ for document 1) and check their respective positions on the page. So whenever we build a positional index for web pages that are not under a suitable license (like Creative Commons), it is unclear whether we can legally share that index with the world. This is not a problem for large search engines which anyway are keen to receive your query and process it privately on their servers, but it becomes a problem if you want to run a search algorithm on machine A using an index on machine B.

The goal of PeARS is to allow people to search the web locally and keep their queries on their private computer, using a distributed network of indexes, stored on other peoples’ websites. Say I am looking for cute cats, I need to access a record of which web pages mention the words ‘cute cats’, or ‘sweet felines’, or whatever.  I rely on the fact that someone, somewhere – let’s say Jean in Shrewsbury, England – has indexes lots of cute cats pages and is willing to share that information with me. The problem is, Jean cannot make her index public because, in some cases, this might violate copyright. To get around this issue, PeARS gives up on the indexing of the document’s words, and use instead compressed meaning representations, which are strings of numbers.  Like this:

document 1:190.826601 39.598095 39.237190 111.530819 86.312570 78.477238 59.353659 70.806475 49.927581 […]

document 2:75.154387 19.989724 35.188764 33.998277 65.646423 18.362618 32.714728 57.627935 25.953652 […]

There is no way to reconstruct the original document from such strings, making sure that the author’s copyright is not infringed. So far so good.

Happy enough with my current prototype, I was at the stage of adding niceties to the search results page  when I realised I had encountered a major problem. Without access to the actual words in the documents, it is impossible to provide the search engine user with page snippets to complement the results. So what was I to do?

Well, nothing. I could of course retrieve the top 20 pages returned by my search results on the fly, quickly build an index for them and show the relevant snippets, but that would really be a waste of resources – without talking about unacceptable processing time. For now, I have given up on snippets and I am displaying instead some ‘word clouds’ that given an impression of what is on the page, without having to store the whole text in a public location. But I am playing around with the idea of having a positive discrimination feature for pages with a decent license (translate: a Creative Commons license or equivalent). When PeARS does its indexing job, say on Jean’s computer, it can check whether the document it’s indexing can be redistributed, and if so, produce a word index for it. If not, some basic word cloud or something similar will have to suffice. People like to make a good appearance on search engine results. I wonder whether the positive discrimination feature might encourage them to publish the documents under a friendly license.

Regardless of the final solution, I think this illustrates the tension between copyright and privacy. The point is that, by withholding the rights on some information, we force people to publicise their need for such information in a way that I think is not compatible with basic privacy rights. In real life, when I go to the library, I don’t need to stand in front of the building and shout that I am looking for a book about, say, curing migraines. I am not even forced to talk to the librarian. I just go to the relevant shelf and help myself. Why should it be different online?

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *