The double footprint of Big Data
Privacy and sustainability, two of the Internet's greatest challenges
On 5 February 2021, the investigative journalists Charlie Warzel and Stuart A. Thompson published an article on The New York Times website in which they explained that someone had provided them with data on thousands of Donald Trump supporters who had staged a violent assault on the Capitol in Washington on 6 January (Warzel & Thompson, 2021). Despite the fact that mobile phone numbers are not associated with an identity, this had made it possible to trace the origins of many people back to their home and even to find out their names, addresses, and social media accounts.
Two years earlier, the two authors had already received location data for more than 12 million US citizens, which allowed them to track their movements. But for the Capitol storming, each location was associated with a mobile advertising ID, which is unique to each user and linked to their smartphone or tablet. Simply cross-referencing this ID with other databases yielded an immense amount of information about each person.
Many people may find the possibility of using an ID to prosecute crime useful and even necessary. But, as the two journalists explain, this ID is used by many companies, institutions, and entities, including banks and investment funds, which can thus obtain a great deal of information about any citizen.
Our daily activity makes this possible. Almost everyone browses the Internet and makes searches, and a large majority of citizens make an online purchase at some point. But even if we never do any of this, we are all included in many databases, from municipal registers to sports or cultural organisations, from medical services to banks and tax records.
Moreover, many of us are also active on social networks, which generates a great deal of information. As Mat Travizano explained to Entrepreneur in September 2018 (Travizano, 2018), Facebook had enough personal data, on average, for 400,000 Word documents per user, and Google could fill about three million for each Internet user.
How did we get to this situation where these companies have so much information about us? What impact does it have on our privacy and security?
Our data, a business model
To understand this, we have to go back to the end of the 1990s. Many technology companies, the so-called dotcoms, had been created. Expectations for growth were high and probably over-optimistic. And venture capital took off, which caused a meteoric rise on the stock market for companies that, for the most part, did not even have a business model. They were gaining market share, but many of them provided free services and therefore did not generate profits. By the year 2000, the situation of most of these companies was very delicate, if not desperate, and they had much more debt and promise than real value.
And it was then that companies like Google saw a potential way to monetise their services: data. Although smartphones were not very widespread and the capacity of computers and algorithms to process large amounts of data was limited, the use of information as a raw material proved to be enough. Thus, Google made $19 million in 2000, but by 2001, they were making $86 million; in 2002, it was $440 million; in 2003, $1.5 billion; and in 2004, $3.2 billion. An increase of 3,590 % in just four years. In 2020 it made $146.92 billion in advertising alone.
«In 2018, Facebook had enough personal data, on average, for 400,000 Word documents per user»
The key was personalised advertising. With all the data it had accumulated, Google could find out what users were interested in. And if a company wanted to send ads, Google could target them anonymously so that, instead of indiscriminate campaigns, the ads would go to the groups most likely to pay attention to them. They could offer golf equipment to people who were interested in golf, trips to the Caribbean to people who were looking for information about Caribbean beaches, and campsites to people who have just bought a caravan.
At first glance, this does not seem like a bad thing. Many people thought, and still think, that if they receive personalised advertising, they are doing them a favour, because they choose the advertising that might be of interest to them. And if they still do not want it, they lose nothing. An inconvenience, at most.
The problem arises when the information is not only used to send personalised advertising, but also to decide whether the user is allowed to take out health or life insurance, whether they are creditworthy enough for a loan or a mortgage, or even whether they are worth hiring.
Data is so valuable that many companies are already targeting it as a goal, regardless of whether they make fridges or cars or rent flats or vehicles. In November 2018, Ford’s CEO announced the goal of monetising and selling the data they were collecting from the 100 million people who drive their vehicles (Sadowski, 2020, p. 31). Apart from their habits and routes, it might be possible to find out whether they were driving carefully enough or on safe enough roads, which is of great interest to insurance companies.
We have already talked about this on another occasion (Duran, 2018), but the possibilities of knowing things about us are growing immeasurably. Moreover, Google, Facebook, and Amazon, which control most of the global digital advertising market (it is estimated that in 2018 Google and Facebook together controlled 84 % of the global market, excluding China) know who we are, because we have to sign in for their services.
According to 2018 data, Amazon had $6.72 billion in advertising revenue. The figure seems small compared to Facebook’s $38.37 billion and Google’s $83.68 billion. But then again, in theory, most of its revenue should come from selling products, not ads. But advertising gives it more margin than sales.
Dangerous inferences and photo identification
Not to mention, despite its importance, data leaks or sales. In early April 2021, the personal data of more than 553 million Facebook users was leaked. Beyond that, with millions of data and increasingly sophisticated algorithms, many more inferences can be made – which may or may not be accurate and might have an impact on more intimate aspects. This is what happened to Ángel Cuevas, professor at the Carlos III University in Madrid. He was at a work meeting in Barcelona and received a Facebook advertisement inviting him to connect with the gay community and book accommodation «for people like you».
Cuevas thought about how Facebook had come to interpret that he was gay if he had never provided any information about his sexual orientation – Facebook was thought to infer this, correctly or otherwise, based on other data. But he also wondered why the company allowed advertisers to send him commercials based on his hypothetical sexual orientation.
After this event, he and his team researched and discovered that in Saudi Arabia, for example, some 500,000 people were labelled as homosexual. In countries where that can mean a prison sentence or even a death sentence, this can be very dangerous, because even if Facebook does not reveal identities, it is not that difficult to identify specific individuals in relatively small communities (Garcia et al., 2018). In fact, randomly taking a random person from anywhere in the world and trying to find out as much as possible about them can be a practice for advanced data processing students (Véliz, 2020, p. 65).
Certainly, the European General Data Protection Regulation, in full force since May 2018, provides protection to citizens, and to allow a website to have access to our data we have to give explicit consent. But it is also true that the conditions of that famous «I accept» button are often very complicated and if we want access to a service or a website, we tend to say yes to everything that is proposed to us. Banks have even been sanctioned for serious offences, such as not providing customers with the possibility of using services without agreeing to provide data. Moreover, non-European companies may not always respect this regulation. In April 2021, a lawsuit was announced in the UK against TikTok, a Chinese short video sharing app, for allegedly illegally collecting personal data of millions of children, such as phone numbers, location, and even biometric data. In 2019, the Chinese firm was fined $5.7 million by the US Federal Trade Commission for misuse of such data.
In addition, when we enter a website, we may actually be giving data to numerous websites with which the first one has agreements. These are the famous third-party cookies, which allow them to follow our trail and use the data. This and many other privacy problems are exposed in the report «Tot el que sabem de tu», broadcast in November 2020 on the TV3 programme 30 minuts (Duran et al., 2020).
The same report described the case of the artist and researcher Joana Moll. When she was developing her project The dating brokers, she discovered, by chance, that the Internet was selling client data from dating websites. Moll bought one million user profiles from all over the world for 136 euros. It included some 600,000 profiles of men and some 300–400,000 profiles of women, including e-mail addresses, usernames, dates of birth, sexual orientation, very detailed descriptions of physique and personality, whether or not they had children, smoked, took drugs, etc.
But she also discovered that these websites belonged to companies that were, in turn, part of larger business groups. Because consent was given to transfer data to all the companies in the group, someone who had entered a dating website could end up providing sensitive data to more than 700 companies.
The potential use of personal data and manifold (Sadowski, 2020; Véliz, 2020). And there is probably nowhere to hide, as demonstrated by Clearview, a case that was also discussed in the previous report. Based on the image of any person and thanks to facial recognition, this US app can find other photographs of the same person, even from years ago, hosted on other sites, and the URLs of the websites where they can be viewed. Using the app, a user can go check those websites and build a profile on any individual.
Clearview’s database has some three billion images collected without the knowledge of the people in them. They have already received several lawsuits, but while the judges’ opinion is important, what we want to emphasise here is that there are more and more technologies every day that make it possible to find loads of information on anyone, however discreet they are, because we can all be tagged in photographs, oftentimes without even knowing about it.
Server business and emissions
We have already explained that Amazon obtains a large part of its revenue from advertising. In fact, of the 280 billion it made in 2019 – with 11.5 billion in profits – 50.37 % came from online sales. That’s just over half; the other half came from elsewhere. About 6 % came from physical shops, but more important was the 19.17 % from third-party sales. Amazon provides others with the digital infrastructure and takes a significant percentage of the sales. In fact, in its own sales, Amazon adjusts the price a lot and sets small margins, yet with a clear strategy: collecting quickly from customers and paying suppliers over a longer period of time. This generates enormous cash flow.
In 2019, 12.48 % of their revenue came from Amazon Web Services (AWS). This is a service that allows companies to keep their files and programs in the cloud so that they do not have to spend money on their own servers or databases.
This looks like one of the big businesses of the future. It is estimated that in 2019 there were 45 zettabytes of data in the cloud – 45,000 trillion bytes, equivalent to more than 7 billion years of high-definition video. By 2025, there will be 175 zettabytes. This will represent a market worth around 600 billion euros. And AWS currently controls just over 40 % of this market. Microsoft’s Azure had 29.4 % and Google Cloud had 3 % in 2019.
Amazon’s massive servers illustrate the great vision of its owner and CEO, Jeff Bezos. While talking about the cloud makes many people think that data just wanders around in the atmosphere until someone captures it on their computer, this cloud is made up of physical structures, which store programs and data, and thousands of kilometres of fibre optics through which it travels. If we send an e-mail to our next-door neighbour, by the time it reaches him it might have passed through a server near the North Pole. The servers – computers running programmes accessible from different points in the network – are always looking for the most suitable paths through the fibre network and, when we download a song or make an online purchase, we have little knowledge about which parts of the planet the bits that have allowed this to happen have passed.
«A study by the University of Bristol estimated that in 2016, YouTube video viewing produced 11.13 million tonnes of CO2»
However, that involves a lot of energy consumption and CO2 emissions, even if they do not take the form of fumes coming out of a chimney. When we type on our computer, tablet, or mobile phone, it is running and consuming energy, as is the device where someone will receive the message, and also the web server. And the servers are running 24 hours a day, because the Internet never sleeps. We are not aware of it, but the Internet is always available and we can move through millions of websites or have activity on any social network at any time because these large server farms are active.
Providing this service is good business because many companies, even large ones, no longer spend on their own infrastructures, but rent them – even Netflix is an AWS customer.
But regardless of who makes money, the business is bad for the environment. Processors have become more efficient, but the amount of information is growing exponentially. Some forecasts suggest that by 2030, information and communication technologies as a whole will consume 21 % of global electricity (Stern, 2020). Almost 40 % of the energy consumption of data centres is due to cooling. No matter how efficient they are, servers have thousands of processors, so they heat up and this heat has to be dissipated.
One option is to install the servers in very cold areas. Facebook has one in Luleå, in north-eastern Sweden, where it can also make use of large amounts of hydroelectric power, save costs, and reduce CO2 emissions. But energy consumption is still very high. And if we talk about CO2 equivalent emissions, we will understand the magnitude of this impact. Some studies estimate annual Internet emissions of 1 billion tonnes, equivalent to 2.8 % of total emissions – more than the aviation sector, which is responsible for 2 %.
How much CO2 does an e-mail generate?
According to a study by the British energy company OVO, Britons send 64 million unnecessary emails every day, the kind that only contain the equivalent to a «hello» or «thank you» (Tweedale, 2021). Each email produces one gram of CO2 emissions. Therefore, unnecessary emails are responsible for 23,475 tonnes of carbon dioxide (other reports place the figure at 16,433 tonnes). That sounds like a lot of emissions and is equivalent to 22 round-trip flights between London and New York. But the UK’s total annual emissions were 435.2 million tonnes in 2019, so stopping unnecessary mailings would only reduce them by 0.0037 % (although it all adds up, of course).
But there are many more practices to consider. A study by the University of Bristol estimated that in 2016, YouTube video viewing produced 11.13 million tonnes of CO2 (Preist et al., 2019). Compared to global emissions (some 35 billion tonnes) this again seems small. But they are equivalent to those produced by a city like Frankfurt or Glasgow or by countries like Luxembourg or Zimbabwe in the same year. And to YouTube we have to add all the music streaming, all the series, films, and documentaries on platforms, as well as all videogames. Or doing a search in a search engine, translating a text, sending photographs or presentations, etc. The more complex the material, the more bits and more emissions.
The problem is not just current emissions, but the prospects for growth. According to calculations by the International Energy Agency, bitcoin transactions cause as much emissions as Nigeria or Uruguay. According to the Digiconomist platform, they are even more extensive, higher than those of Colombia and Bangladesh. According to an article published in Nature Communications in early April (Jiang et al., 2021), at the current rate, in China, by 2024, the entire process surrounding bitcoin transactions and validation will produce as many greenhouse gas emissions as Italy or the Czech Republic. Domestically, emissions would rank in the top ten of 182 cities and 42 industrial sectors in China.
That is because making bitcoin a secure virtual currency requires complex calculations to ensure reliability while maintaining privacy, based on the so-called blockchain. But if we consider that bitcoin is just one of many virtual currencies and that it now represents only 0.4 % of the money in circulation, we can guess what the impact might be in a few years’ time.
Are there solutions that do not involve reducing the use of digital tools? For a start, perhaps we will have to learn not to waste resources and to use them more rationally. At the same time, we can hope that technologists will find more sustainable solutions and that energy will increasingly come from renewable sources. A Google search for «green Internet» shows that the topic is of interest: it returns 5,360,000,000 results. The search alone has already produced more emissions, and visiting a few of these sites will generate even more. If there are viable and efficient solutions on any of them, we can consider those emissions acceptable.
Duran, X. (2018). Tot allò que saben de nosaltres: Es pot navegar amb privacitat per l’oceà del ‘big data’? Mètode, 99, 4–9. https://metode.cat/revistes-metode/article/tot-allo-que-saben-de-nosaltres.html
Duran, X., Bonet, X. (authors), & Solà, C. (director). (2020, 8 November). Tot el que sabem de tu [TV show episode]. In C. Fernández (Productor) 30 minuts. Corporació Catalana de Mitjans Audiovisuals. https://www.ccma.cat/tv3/alacarta/30-minuts/tot-el-que-sabem-de-tu/video/6067763/
García, D., Mitike Kassa, Y., Cuevas, A., Cebrián, M., Moro, E., Rahwan, I., & Cuevas, R. (2018). Analyzing gender inequality through large-scale Facebook advertising data, PNAS, 115(27), 6958–6963. https://doi.org/10.1073/pnas.1717781115
Jiang, S., Li, Y., Lu, Q., Hong, Y., Guan, D., Xiong, Y., & Wang, S. (2021). Policy assessments for the carbon emission flows and sustainability of Bitcoin blockchain operation in China. Nature Communications, 12, 1938. https://doi.org/10.1038/s41467-021-22256-3
Preist, C., Schien, D., & Shabajee, P. (2019). Evaluating sustainable interaction design of digital services: The case of YouTube. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3290605.3300627
Sadowski, J. (2020). Too smart. How digital capitalisme is extracting data, controlling our lives, and taking over the world. MIT Press.
Stern, E. (2020, 27 October). Núvols i aire fred per a la computació. Divulcat. https://www.enciclopedia.cat/divulcat/nuvols-i-aire-fred-per-a-la-computacio
Travizano, M. (2018, 28 September). The tech giants get rich using your data. What do you get in return? Entrepreneur. https://www.entrepreneur.com/article/319952
Tweedale, A. (2021). The carbon footprint of the internet: What’s the environmental impact of being online? Ovo Blog. https://www.ovoenergy.com/blog/green/the-carbon-footprint-of-the-internet.html
Véliz, C. (2020). Privacy is power. Bantam Press.
Warzel, C., & Thompson, S. A. (2021). They stormed the Capitol. Their apps tracked them. The New York Times. https://www.nytimes.com/2021/02/05/opinion/capitol-attack-cellphone-data.html