Friday 26 September 2014

BIG DATA: A Powerful New Resource for the 21st Century

by Dirk Helbing

This chapter is a free translation of an introductory article on "Big Data - Zauberstab und Rohstoff des 21.Jahrhunderts" originally published in Die Volkswirtschaft - Das Magazin für Wirtschaftspolitik (5/2014), 

Abstract


Information and communication technology (ICT) is the economic sector that is developing most rapidly in the USA and Asia and generates the greatest value added per employee. Big Data - the algorithmic discovery of hidden treasures in large data sets - creates new economic value. The development is increasingly understood as a new technological revolution. Switzerland could establish itself as data bank and Open Data pioneer in Europe and turn into a leading place in the area of information technologies.


What is Big Data?

When the social media portal WhatsApp with its 450 million users was recently sold to Facebook for$19 billion - almost half a billion dollars was made per employee. "Big Data" is changing our world. The term, coined more than 15 years ago, means data sets so big that one can no longer cope with them with standard computational methods. Big Data is increasingly referred to as the oil of the 21st century. To benefit from it, we must learn to "drill" and "refine" data, i.e. to transform them into useful information and knowledge. The global data volume doubles every 12 months. Therefore, in just two years, we produce as much data as in the entire history of humankind.

Tremendous amounts of data have been created by four technological innovations:
  • the Internet, which enables our global communication
  • the World Wide Web, a network of globally accessible websites that evolved after the invention of hypertext protocol (HTTP) at CERN in Geneva
  • the emergence of social media such as Facebook, Google+, Whatsup, or Twitter, which have created social communication networks, and 
  • the emergence of the "Internet of Things'', which also allows sensors and machines to connect to the Internet. Soon there will be more machines than human users in the Internet.


Data sets bigger than the largest library

Meanwhile, the data sets collected by companies such as eBay, Walmart or Facebook, reach the size of petabytes (1 million billion bytes) - one hundred times the information content of the largest library in the world: the U.S. Library of Congress. The mining of Big Data opens up entirely new possibilities for process optimization, identification of interdependencies, and decision support. However, Big Data also comes with new challenges, which are often characterized by four criteria:
  • volume: the file sizes and number of records are huge,
  • velocity: the data evaluation has often to be done in real-time,
  • variety: the data is often very heterogeneous and unstructured,
  • veracity: the data is probably incomplete, not representative, and contains errors

Therefore, one had to develop completely new algorithms: new computational methods. Because it is inefficient for Big Data processing to load all relevant data into a shared memory, the processing must take place locally, where the data resides, on potentially, thousands of computers. This is accomplished with massively parallel computing approaches such as: MapReduce or Hadoop. Big Data algorithms detect interesting interdependencies in the data ("correlations"), which may be of commercial value, for example, between weather and consumption or between health and credit risks. Today, even the prosecution of crime and terrorism is based on the analysis of large amounts of behavioral data.


What do applications look like?

Big Data applications are spreading like wildfire. They facilitate personalized offers, services and products. One of the greatest successes of Big Data is automatic speech recognition and processing. Apple's Siri understands you when asking for a Spanish restaurant, and Google Maps can lead you there. Google Translate interprets foreign languages by comparing them with a huge collection of translated texts. IBM's Watson computer even understands human language. It can not only beat experienced quiz show players, but even take care of customer hotlines - often better than humans. IBM has recently decided to invest $1 billion to further develop and commercialize the system.

Of course, Big Data plays an important role in the financial sector. Approximately seventy percent of all financial market transactions are now made by automated trading algorithms. In just one day, the entire money supply of the world is traded. Such quantities of money also attract organized crime and financial transactions are scanned by Big Data algorithms for abnormalities to detect suspicious activities. The company Blackrock uses a similar software called "Aladdin", to successfully speculate with funds amounting to multiple times the gross domestic product (GDP) of Switzerland.

Box 1:
To get an overview of the ICT trends, it is worthwhile to look at Google with over 50 software platforms. The company invests nearly $6 billion in research and development annually. Within just one year, Google has introduced self-driving cars, invested heavily in robotics, and started a Google Brain project to add intelligence to the Internet. Through the purchase of Nest Labs, Google has also invested $3.2 billion in the "Internet of Things". Furthermore, Google X has been reported to have around 100 secret projects in the pipeline.


The potential is great...

No country today can afford to ignore the potentials of Big Data. The additional economic potential of Open Data alone - i.e. of data sets that are made ​​available to everyone - is estimated by McKinsey to be between 3,000 to 5,000 billion dollars globally each year [2]. This can benefit almost all sectors of society. For example, energy production and consumption can be better matched with "smart metering", and energy peaks can be avoided. More generally, new information and communication technologies allow us to build "smart cities". Resources can be managed more efficiently and the environment protected better. Risks can be better recognized and avoided, thereby reducing unintended consequences of decisions and identifying opportunities that would otherwise have been missed. Medicine can be better adapted to the patients, and disease prevention may become more important than curing diseases.


... but also the implicit risks

Like all technologies, Big Data also implies risks. The security of digital communication has been undermined. Cyber ​​crime, including data, identity and financial theft, quickly spread on ever greater dimensions. Critical infrastructures such as energy, financial and communication systems are threatened by cyber attacks. They could, in principle, be made dysfunctional for an extended time period.

Moreover, while common Big Data algorithms are used to reveal optimization potentials, their results may be unreliable or may not reflect causal relationships. Therefore, a naive application of Big Data algorithms can easily lead to wrong conclusions. The error rate in classification problems (e.g. the distinction between "good" and "bad" risks) is often relevant. Issues such as wrong decisions or discrimination must be seriously considered. Therefore, one much find effective procedures for quality control. In this connection, universities will likely play an important role. One must also find effective mechanisms to protect privacy and the right of informational self-determination, for example, by applying the Personal Data Purse [1] concept.


The digital revolution creates an urgency to act

Information and communication technologies are going to change most of our traditional institutions: our educational system (personalized learning), science (Data Science), mobility (self-driving cars), the transport of goods (drones), consumption (see amazon and ebay), production (3D printers), the health system (personalized medicine), politics (more transparency), and the entire economy (with co-producing consumers, so-called prosumers). Banks are losing more and more ground to algorithmic trading, alternative payment systems such as Bitcoins, Paypal and Google Wallet. Moreover, a substantial part of the insurance business takes place in financial products such as credit default swaps. For the economic and social transformation into a ``digital society'', we may perhaps just have 20 years. This is an extremely short time period, considering that the planning and construction of a road often requires 30 years or more.

The foregoing implies an urgent need for action on the technological, legal and socio-economic level. Some years ago, the United States started a Big Data research initiative amounting to 200 million dollars followed by further substantial investments. In Europe, the FuturICT project (www.futurict.eu) has developed concepts for the digital society within the context of the EU flagship competition. Other countries have already started to implement this concept, for example, Japan has recently launched a $100 million 10-year project at the Tokyo Institute of Technology. In addition, numerous other projects exist, particularly in the military and security sector, which often have multiples of the budgets mentioned above.


Switzerland can become a European driver of innovation for the digital era

Switzerland is well positioned to benefit from the digital age. However, it is insufficient to reinvent and build upon already existing technologies in Switzerland. New inventions that will shape the digital age must be invented. The World Wide Web was once invented in Switzerland, the largest civil Big Data competence in the world exists at CERN, however the USA and Asian countries have the lead in commercializing Big Data to date. With the NSA controversy, the ubiquity of wireless communication sensors as well as the "Internet of Things",  a new opportunity is emerging.

With targeted support of ICT activities at its universities, Switzerland could take the lead in Europe's research and development. Swiss academia has excelled with the scientific coordination of three out of six finalists of the EU FET flagship competition.
At the moment, however, there is only a focus on the digital modeling of the human brain and robotics. However from 2017 onwards, the ETH domain plans to increasingly invest into the area of Data Science, the emerging research field centered around the scientific analysis of data.

In view of the fast development of the ICT area, the huge economic potential as well as the transformative power of these technologies, a prioritized, broad and substantial financial support is a matter of Swiss national interest. With its basic democratic values, legal framework and ICT focus, Switzerland is well prepared to become Europe's innovation driver for the digital age.

Box 2:
How will the digital revolution change our economy and society? How can we use this as an opportunity for us and reduce the related risks? For illustration, it is helpful to recall the factors that enabled the success of the automobile age: the invention of cars and of systems of mass production; the construction of public roads, gas stations, and parking lots; the creation of driving schools and driver licenses; and last but not least, the establishment of traffic rules, traffic signs, speed controls, and traffic police.
What are the technological infrastructures and the legal, economic and societal institutions needed to make the digital age a big success? This question would set the agenda of the Innovation Alliance. A partial answer is already clear: we need trustworthy, transparent, open, and participatory ICT systems, which are compatible with our values. For example, it would make sense to establish the emergent "Internet of Things" as a Citizen Web. This would enable self-regulating systems through real-time measurements of the state of the world, which would be possible with a public information platform called the "Planetary Nervous System". It would also facilitate a real-time measurement and search engine: an open and participatory "Google 2.0."


To protect privacy, all data collected about individuals should be stored in a Personal Data Purse and, given informed consent, processed in a decentralized way by third-party Trustable Information Brokers, allowing everyone to control the use of their sensitive data. A Micro-Payment System would allow data providers, intellectual property right holders, and innovators to get rewards for their services. It would also encourage the exploration of new and timely intellectual property right paradigms ("Innovation Accelerator"). A pluralistic, User-centric Reputation System would promote responsible behavior in the virtual (and real) world. It would even enable the establishment of a new value exchange system called "Qualified Money," which would overcome weaknesses of the current financial system by providing additional adaptability.
A Global Participatory Platform would empower everyone to contribute data, computer algorithms and related ratings, and to benefit from the contributions of others (either free of charge or for a fee). It would also enable the generation of Social Capital such as trust and cooperativeness, using next-generation User-controlled Social Media. A Job and Project Platform would support crowdsourcing, collaboration, and socio-economic co-creation. Altogether, this would build a quickly growing Information and Innovation Ecosystem, unleashing the potential of data for everyone: business, politics, science, and citizens alike.

Further Reading

[1] Y.-A. de Montjoye, E. Shmueli, S. S. Wang, and A. S. Pentland (2014) openPDS: Protecting the Privacy of Metadata through SafeAnswers,

[2] McKinsey & Company (2013) Open data: Unlocking innovation and performance with liquid information,


Tuesday 23 September 2014

Creating ("Making") a Planetary Nervous System as Citizen Web

by Dirk Helbing

The goal of the Planetary Nervous System is to create an open, public, intelligent software layer on top of the "Internet of Things" as the basic information infrastructure for the emerging digital societies of the 21st century.

After the development of the Computer, Internet, the World Wide Web, Smartphones and Social Media, the evolution of our global information and communication systems will now be driven by the "Internet of Things" (IoT). Based on wirelessly connected sensors and actuators,[1] it will connect "things" (such as machines, devices, gadgets, robots, sensors, and algorithms) with things, and things with people.

Already now, more things than people are connected to the Internet. In 10 years time, it is expected that something like 150 billion sensors will be connected to the IoT. Given such masses of sensors everywhere around us -- sensors in our coffee machine, our fridge, our tooth brush, our shoes, our fire alarm etc. -- the IoT could easily turn into a dystopian surveillance nightmare, if largely controlled by one company or by the state. For the IoT to be successful, people need to be able to trust the new information and communication system, and they need to be able to exert their right of informational self-determination, which also requires the possibility to protect privacy.

Most likely, the only way to establish such a trustable, privacy-respecting IoT is to build it as a Citizen Web. Citizens would deploy the sensors in their homes, gardens, and offices themselves, and they would decide themselves what sensor information to open up (i.e. decrypt), and for whom (and for how long). In other words, the citizens would be in control of the information streams. A software platform such as open Personal Data Store (openPDS)[2] would allow everyone to manage the access to personal data produced by the IoT.

What are the benefits of having an "Internet of Things"?

  • One can perform real-time measurements of the (biological, technological, social and economic) world around us
  • This information can be turned into (real-time) maps of our world[3] and serve as compasses for decision-makers, enabling them to take better decisions and more effective actions, considering externalities
  • One can build self-organizing and self-regulating systems, based on real-time feedback and adaptation[4]
  • Sensor Kits and Smartphones, to measure the environment
  • Algorithms and filters to encrypt information or degrade it such that it is not sensitive anymore[5]
  • Ad hoc network / mesh net (e.g. firechat) to enable direct communication between wirelessly communicating sensors
  • Server architecture to collect, manage and process data
  •  A data analytics layer and possibly a search engine and Collective Intelligence/Cognitive Computing layer on top
  • An open Personal Data Store (such as openPDS) to empower users to exercise their right of informational self-determination
  • An app-store-like Global Participatory Platform (GPP) to share data, algorithms, and ratings
  • An editor allowing non-expert users to combine inputs and outputs in playful, creative ways 
  • A multi-dimensional reputation and micro-payment system
  •  A project platform to allow the Nervous community to coordinate and self-organize their activities and projects
Both Planetary Nervous System Apps would offer a rich Open Data stream accessible for everyone. They would build something like a "real-time data streaming Wikipedia", offering people and companies to build services and products on top. The PNS is hence an attempt to enable and catalyze new creative jobs in times where the digital revolution is expected to eliminate about 50% of the conventional jobs of today.
Uses of these kinds will be enabled by a software layer that we call the "Planetary Nervous System" (PNS) or just "Nervous". It offers new possibilities that will allow humanity to overcome some long-standing problems (such as systemic instabilities or "tragedies of the commons" like environmental degradation, etc.), and to change the world to the better.

Basic Elements of the Planetary Nervous System
We will build two variants of the Planetary Nervous System App for smart devices such as smartphones: Nervous and Nervous+. While Nervous would not save original sensor data, Nervous+ would potentially do so. Nervous is thought to be for users that are concerned about their personal data, while Nervous+ offers additional functionality for people who are happy to share data of all kinds. Hence, the users can choose the system they prefer.


Creating a public good, and business and non-profit opportunities for everyone by maximum openness, transparency, and participation
The main goal of the PNS project is to create a public good, namely the basic information infrastructure for the emerging digital societies of the 21st century. Besides providing Open Data streams, the Planetary Nervous System may nevertheless offer some premium services to people and/or institutions, who pay for the services or have qualified to receive them for free (such as committed scientists or citizens). "Qualification" means contributions made to the components of the Planetary Nervous System, but also a responsible use of the information services. In this way, we want to reduce malicious uses of the powerful functionality of Nervous+ as much as possible.

The profits created by the PNS would be managed, for example, by a benefit corporation, which is committed to improving social and/or environmental conditions. The largest share of the profits should be used to promote the science, research and development promoting the PNS and services built on top of it. Profits created with inventions of the PNS shall be also used to support the PNS project.

As the PNS project wants to grow a public good for everyone, the Planetary Nervous System project is committed to opening up its source codes, as much as this is not expected to create security issues or dangers to human rights. Depending on the competitive situation the PNS is in, the publication may be done with a delay (usually less than 2 years). To minimize delays we will create incentives for early sharing.

The goal of this strategy is to catalyze an open information and innovation ecosystem. Others will be able to use our codes (and other people's open source codes), modify them and share them back. The same will apply to data, Apps, and other contributions. In this way, the Nervous community will benefit maximally from contributions of other Nervous members, and everyone can build on functionality that has been created by others.

Contributions of volunteers will be acknowledged by mentioning the respective creators by name (if they don't prefer to stay anonymous or pseudonymous). In addition, contributions will be rewarded by ratings, reputational values, or scores, which may be later used to get access to premium services. These would include larger query or data volumes ("power users") or an earlier access to codes that will be publicly released with a delay, or further benefits. The PNS project may also hand out medals or prizes for outstanding contributions, or highlight them in social or public media.

The role of Citizen Science
For the Planetary Nervous System to be successful, it is crucial to grow a large community of users, but the underlying logic of sharing, bottom-up involvement and informational self-determination demands that everyone is encouraged to contribute to the creation of the system itself. The system would hence be built similar to Wikipedia or OpenStreetMap. In fact, the success of OpenStreetMap is based on the contributions of 1.5 million volunteers worldwide.

This is, why the Nervous project wants to engage with Citizen Science, to grow the Planetary Nervous System as a Citizen Web. As basis of citizen engagement, the Nervous Team will provide (a) kits containing sets of sensors[6] and actuators (e.g. a basic kit, and several extension kits) and (b) a GPP portal, where people can download (and upload) algorithms ("Apps"), which will run on the sensors and thereby produce certain kinds of functionalities.

The CitizenScience community will be engaged in certain measurement tasks (e.g. "measure the noise distribution in your city as a function of time", or "measure data enabling weather predictions"). It will also be engaged to come up with innovative ways to use sensor data and turn them into outputs (i.e. to produce new codes or modify existing ones, thereby creating new Apps). For this, the PNS team will provide tools (such as an editor), allowing non-expert users to transform inputs into outputs in playful, creative ways. Playfulness, fun and reputation are hence offered in exchange for contributing to the development and spreading of the PNS. As a result, we will get new measurement procedures for science, and adaptive feedback processes to create self-regulating systems.


The Planetary Nervous System is already being built, see the @NervousNet on Twitter. To join our development team as a volunteer in a similar way as Linux, Wikipedia or OpenStreetMap are being created, please contact Dirk Helbing dhelbing (at) ethz.ch







[1] Actuators are devices that can induce change, for example, motors.
[2] see http://openpds.media.mit.edu/ and http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0098790
[3] Such maps may include environmental maps depicting environmental changes and who causes them, resource maps visualizing resources and who uses them, as well as risk maps, crisis maps, or conflict maps.
[4] Smart city applications are often of this kind.
[5] For example, one may average over personal data, and throw away the original data. Microphone data may be degraded by using a lowpass filter, such that spoken words cannot be anymore understood, remaining basically with a noise level measurement. Access to unfiltered video or microphone data would require explicit approval, similar to accepting a call. Data shared with a restricted circle will be encrypted.
[6] Such sensors may measure temperature, noise, humidity, health, or anything else.

Monday 22 September 2014

BIG DATA SOCIETY: Age of Reputation or Age of Discrimination?

By Dirk Helbing

If we want Big Data to create societal progress, more transparency and participatory opportunities are needed to avoid discrimination and ensure that they are used in a scientifically sound, trustable, and socially beneficial way.

Have you ever "enjoyed" an extra screening at the airport because you happened to sit next to someone from a foreign country? Have you been surprised by a phone call offering a special service or product, because you visited a certain webpage? Or do you feel your browser reads your mind? Then, welcome to the world of Big Data, which mines the tons of digital traces of our daily activities such as web searches, credit card transactions, GPS mobility data, phone calls, text messages, facebook profiles, cloud storage, and more. But are you sure you are getting the best possible product, service, insurance or credit contract? I am not.

Like every technology, Big Data has some side effects. Even if you are not concerned about losing your privacy, you should be worried about one thing: discrimination. A typical application of Big Data is to distinguish different kinds of people: terrorists from normal people, good from bad insurance risks, honest tax payers from those who don't declare all income ... You may ask, isn't that a good thing? Maybe on average it is, but what if you are wrongly classified? Have you checked the information collected by the Internet about your name or gone through the list of pictures google stores about you? Even more scary than how much is known about you is the fact that there is quite some information in between which does not fit. So, what if you are stopped by border control, just because you have a similar name as a criminal suspect? If so, you might have been traumatized for quite some time.

Where does the problem originate? Normally, the groups of people to distinguish are overlapping -- their data points are not well separated. Therefore, mining Big Data comes with the statistical problem of false positives and false negatives [1]. That is, some people get an unintended advantage, while others suffer an unfair disadvantage -- an injustice hard to accept. Even with the overly optimistic assumption that the data mining algorithm has an accuracy of 99.9% -- when applied to 200 Million people, there are hundreds of thousands of people who will experience a wrong treatment. In medicine, the approach of mass screenings is therefore highly controversial [2]. Are you willing to sacrifice your breast or prostata for a wrongly diagnosed cancer? Probably not, but it happens more often than you think.

Similarly, tens of thousands of honest people are unintentionally mixed up with terrorists. So, how can you be sure you are getting your loan for fair conditions, and do not have to pay a higher interest rate, just because someone in your neighborhood defaulted? Can you still afford to live in an easy-going multi-cultural quarter, or do you have to move to another neighborhood to get a reasonable loan? And what about the tariff of your health insurance? Will you have to pay more, just because your neighbors do not go jogging? Will we have to put pressure on our facebook friends, colleagues, and neighbors, just to avoid possible future discrimination? And what would be the features that play out positively or negatively? How much Coke on our credit card bill will be acceptable to our health insurance? Is it ok to drink a glass of wine, or better not? What about another cup of coffee or tea? Can we still eat meat, or will we get punished for it with higher monthly rates? Would there be a right way of living at all, or would just everyone be discriminated for some behavior, while perhaps getting rewards for others? The latter is surely the case.

This might be fine, if everybody would benefit on average, but unfortunately this is rather unlikely. Some would be lucky and others would be unlucky, i.e. inequality would grow. But similar to stock markets, it would be difficult to tell before, who would benefit and who would lose. This is so not just because of the random distribution of individual properties, but also because the parameters of the data mining algorithms can be determined only with a limited accuracy. However, even tiny parameter changes may produce dramatically different results (a fact known as "sensitivity" or "butterfly effect") [3]. In other words, while the miners of Big Data may pretend to take more scientific, better and fairer decisions, the results will often have a considerable amount of arbitrariness. Many data miners probably don't know about this or don't care. But the fact that lots of algorithms produce outputs without warnings of their limitations creates a dangerous overconfidence in their results. Moreover, note that the choice of the model can be even more critical than the choice of parameters [4]. That's basically why people say: "Don't believe a statistics that you haven't produced yourself."

The problem is reminiscent of the experiences made with financial innovations. People used models without questioning their validity enough. It was discovered too late that financial innovations may have negative effects and destabilize the markets. One example is the excessive use of credit default swaps, which package risks in ways that buyers don't seem to understand anymore. The consequence of this was a financial meltdown that the public has to pay for at least for another decade or two. It is no wonder that trust in the financial system dropped dramatically, with serious economic implications (no trust means no lending). This time, we should not make the same mistakes, but rather use Big Data in a trustworthy, transparent, and beneficial way. To reap the benefits of personalized medicine, for example, we need to make sure that personal medical data will not be used to the disadvantage of patients who are willing to share their data in favor of creating a public good -- a better understanding of diseases and how to cure them.

In fact, we have worked hard to overcome discrimination of people for gender, race, religion, or sexual orientation. Should we now extend discrimination to hundreds or even thousands of variables, just because Big Data allows us to do so? Probably not! But how can we protect ourselves from such discrimination? In order to avoid that the information age becomes an age of discrimination fueled by Big Data, we need informational justice. This includes to establish (1) suitable quality standards like for medical drugs, (2) proper testing, and (3) fair compensation schemes. Otherwise people will quickly lose trust in Big Data. This requires us to decide what collateral damage for individuals would be considered tolerable or not. Moreover, we need to distinguish between “healthy” and “toxic” innovations, where “healthy” means innovations that produce long-term benefits for the economy and society (see Information Box below).  That is, the overall benefit should be bigger than the disadvantage caused by false positives, such that the corresponding individuals can be compensated for unfair treatments.

There are two fundamentally different ways to ensure a "healthy" use of Big Data and allow victims of discrimination to defend their interests. The classical approach would be to create a dedicated government agency or institution that establishes detailed regulations, in particular quality standards, certification procedures, and effective punitive schemes for violations. But there is a second approach -- one that I believe could be more effective for companies and citizens than complicated legal and executive procedures. This framework would be based on next-generation reputation systems creating feedback loops that support self-regulation.

How would such a next-generation reputation system work? The proposal is to establish a Global Participatory Plattform [5], i.e. a public store for models and data. It would work a bit like an appstore, but people and companies could upload not only apps. They could also upload data sets, algorithms (e.g. statistical methods, simulation models, or visualization tools), and ratings. Everybody could use these data sets for free or for a fee, and annotate user feedbacks. It would be as if we could submit not only queries to google, but also algorithms to determine the answers. In this way, we could better control the quality of results extracted from the data.

So, assume we would store all data collected about individuals in a data bank (for reasons of data security, a decentralized and encrypted storage would be preferable). Moreover, assume that everyone could submit algorithms to be run on these datasets. The algorithms would be able to perform certain operations within the bounds of privacy laws and other regulations. For example, they could generate aggregate information and statistics, while privacy-invasive queries violating user consent would not be executed. Moreover, if executable files of the algorithms used by insurance or other companies using Big Data would be uploaded as well, it would allow scientists and citizens to judge their statistical properties and verify that undesirable discrimination effects are below commonly accepted thresholds. This would ensure that quality standards would be met and continuously improved.

The advantages of such a transparent and participatory approach are multifold for business, science, and society alike: (1) results can be verified or falsified, thereby uncovering possible methodological issues, (2) the quality of Big Data algorithms and data will increase more quickly, (3) “healthy” innovation and economic profits will be stimulated, (4) the level of trust in the algorithms, data and conclusions will increase, and (5) an "information ecosystem" will be grown, creating an enormous amount of new business opportunities, to fully unleash the potential of Big Data.

I fully agree with the US Consumer Data Privacy Bill of Rights [6] stating that “Trust is essential to maintaining the social and economic benefits that networked technologies bring to the United States and the rest of the world.” A report on personal data as a new asset class, published by the World Economic Forum, therefore suggests a “New Deal on Data” [7]. This includes establishing a data ecosystem that creates a balance between the interest of companies, citizens, and the state. Important elements of this would be: transparency, more control by citizens over their personal data, and the ability for individuals to participate in the value generated with their personal data.

This has implications for the design of the Global Participatory Platform I am proposing. Data collected about individuals would be stored in a personal data purse. Individuals could add and comment the data, have them corrected, if factually wrong, and determine, who could use them for what kind of purpose, to meet the regulations regarding privacy and self-determination. When personal data are used, both the user and the company that collected the data would earn a small amount, triggering micropayments. Finally, to keep misuse of data and malicious applications on a low level, there would be a certain reputation system, which would act like a social immune system.

Reputation and recommender systems are quickly spreading all over the Web. People can rate products, news, and comments. In exchange, amazon, ebay, tripadvisor and many other platforms offer recommendations. Such recommendations are beneficial not only for the user, who tends to get a better service, but also for a company offering the product or service, as higher reputation allows it to take a higher price [8]. However, it is not good enough to leave it to a company to decide, what recommendations we get, because then we don't know how much we are being manipulated. We want to look at the world from our own perspective, based on our own values and quality criteria. It would be terrible if everyone ended up reading the same books and listening to the same music. Therefore, it is important that recommender systems do not undermine socio-diversity.

Diversity is an important factor for innovation, social well-being, and societal resilience [9]. It deserves to be protected in the very same way as biodiversity. Modern societies need a complex interaction pattern of diverse people and ideas, not average people who all do the same things. The socio-economic misery in many countries of the world is clearly correlated with the loss of socio-economic diversity. While some level of norms and standardization appears to be favorable, too much homogeneity turns out to be bad. This also implies that we need to be careful about discriminating people who are different -- such discrimination may undermine socio-diversity.

Today's personalized recommender systems endanger socio-diversity as well. They are manipulating people’s opinions and decisions, thereby imposing a certain perspective and value system on them. This can seriously undermine the “wisdom of crowds” [10], which is central to the functioning of democracies. The "wisdom of crowds" requires independent information gathering and decision-making -- a principle not sufficiently respected by most recommender systems [11].

How could we, therefore, build "pluralistic" reputation and recommender systems, which support socio-economic diversity, and are also less prone to manipulation attempts? First, one should distinguish three kinds of user feedbacks: facts (linked to information allowing to check them), advertisements (if there is a personal benefit for posting them), and opinions (all other feedbacks). Second, user feedbacks could be made in an anonymous, pseudonymous, or personally identifiable way. Third, users should be able to choose among many different reputation filters and recommender algorithms. Just imagine, we could set up the filters ourselves, share them with our friends and colleagues, modify them, and rate them. For example, we could have filters recommending us the latest news, the most controversial stories, the news that our friends are interested in, or a surprise filter. So, we could choose among a set of filters that we find most useful. Considering credibility and relevance, the filters would also put a stronger weight on information sources we trust (e.g. the opinions of friends or family members), and neglect information sources we do not want to rely on (e.g. anonymous ratings). For this, users would rate information sources as well, i.e. other raters. Therefore, spammers would quickly lose reputation and, with this, their influence on recommendations made.

In sum, the system of personal reputation filters would establish an “information ecosystem”, in which increasingly good filters will evolve by modification and selection, thereby steadily enhancing our ability to find meaningful information. Then, the pluralistic reputation values of companies and their products (e.g. insurance contracts or loan schemes) would give a pretty differentiated picture, which can also help the companies to develop better customized and more successful products.

In conclusion, I believe it's high time to create suitable institutions for the emerging Big Data Society of the 21st century. In the past, societies have created institutions such as public roads, parks, museums, libraries, schools, universities, and more. But information is a special resource: it does not become less, when shared, and it can be shared as often as we like. In fact, our culture results from what we share. At the moment, however, the world of data is highly proprietary and fragmented. It’s as if every individual owned a few words but had to pay for using all the others, and some words could not be used at all for proprietary reasons. Obviously, such a situation is not efficient and does not make sense in an age where data are increasingly important. Business and politics have pushed hard to remove barriers to the free trade of goods -- it is now time to remove the obstacles to the global use of data. Providing access to Big Data would unleash the power of information for business, politics, science and citizens. Access to Big Data is surely needed for science to provide a good service to society [12,13]. In the past, reading and writing was a privilege, which came with personal advantages. But public schools opened literacy to everyone, thereby boosting the development of modern service societies. In the very same way could we now boost the emerging digital society by promoting digital literacy and investing into transparent, secure, participatory and trustworthy information and communication systems [14]. The benefits for our societies would be huge!

References


[2] G. Gigerenzer, W. Gaissmaier, E. Kurz-Milcke, L.M. Schwartz, and S. Woloshin (2008) Helping doctors and patients make sense of health statistics, Psychological Science in the Public Interest 8(2), 53-96.

[3] I. Kondor, S. Pafka, and G. Nagy (2007) Noise sensitivity of portfolio selection under various risk measures. Journal of Banking & Finance 31(5), 1545-1573.

[4] T. Siegfried (2010) Odds are, it's wrong, Science News 177(7), p. 26ff, see http://www.sciencenews.org/view/feature/id/57091/description/Odds_Are_Its_Wrong; J.P.A. Ioannidis (2005) Why most published research findings are false, PLoS Medicine 2(8): e124.

[5] S. Buckingham Shum, K. Aberer, A. Schmidt, S. Bishop, P. Lukowicz et al. Towards a global participatory platform (2012) Democratising open data, complexity science and collective intelligence. EPJ Special Topics 214, 109-152.

[6] The White House (2012) Consumer data privacy in a networked world: A framework for protecting privacy and promoting innovation in the global digital economy, see http://www.whitehouse.gov/sites/default/files/privacy-final.pdf

[7] World Economic Forum (2011) Personal Data: The Emergence of a New Asset Class, see www3.weforum.org/docs/WEF_ITTC_PersonalDataNewAsset_Report_2011.pdf

[8] W. Przepiorka, Buyers pay for and sellers invest in a good reputation: More evidence from eBay, The Journal of Socio-Economics 42, 31-42 (2013).

[9] S.E. Page (2007) The Difference (Princeton University Press, Princeton).

[10] J. Lorenz, H. Rauhut, F. Schweitzer, and D. Helbing (2011) How social influence can undermine the wisdom of crowd effect. Proceedings of the National Academy of Sciences of the USA 108(28), 9020-9025.

[11] T. Zhou, Z. Kuscsik, J-G. Liu, M. Medo, J.R. Wakeling, and Y-C. Zhang (2010) Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences of the USA 107, 4511-4515.

[12] B.A. Huberman (2012) Big data deserve a bigger audience, Nature 482, 308.

[13] F. Berman and V. Cerf (2013) Who will pay for public access of research data? Science 341, 616-617.

[14] D. Helbing (2013) Economics 2.0: The natural step towards a self-regulating, participatory market society, Evolutionary and Institutional Economics Review 10(1), 3-41.


Information Box: How to define quality standards for data mining


Assume that the individuals in a population of N people fall into one of two classes. Let us consider people of kind 1 “desirable” (e.g. honest citizens, good insurance risks) and people of kind 2 “undesirable” (criminals, bad insurance risks, etc.). We represent the number of people classified as kind 1 and 2 by N1 and N2 respectively.  Let the rate of false positives, that is individuals who are faced with unjustified discrimination, be given by α, and the rate of false negatives be β. Then, the actual number of people of kind 1 is (1-β)N1+αN2, and the actual number of people of kind 2 is (1-α)N2+ βN1. Furthermore, assume that the classification is creating an advantage of A>0 for people classified as kind 1, but a disadvantage of –D<0 for people classified as kind 2. Then, each false positive classified person has a double disadvantage of -(A+D), because he or she should have received the advantage A while suffering the disadvantage -D. This will be considered unfair and question the legitimacy of the procedure. False negatives, in contrast, those who are classified “desired” but are in fact “undesired”, enjoy a double advantage of (A+D). They may also create an extra damage -E to society. Overall, the classification produces a gain of G=N1[(1-β)A+β(A+D)] to individuals classified to be of kind 1 and a cost of C=-N2[(1-α)D+α(D+A)] to individuals classified to be of kind 2. The overall benefit to society would be B=G-C-E. Unfortunately, there is no guarantee that it would be positive.

To demonstrate this, let us assume a business application of Big Data, in which the economic profit P (e.g. by selling cheaper insurance contracts to people of kind 1) is a fraction f of the gain, i.e. P=fG. If applied to many people, the application may be profitable even if the fraction f<1 is quite small. Moreover, from the point of view of a company, discrimination may be rewarding even if it has an overall disadvantage to people (i.e. if the overall benefit B is negative). This is because a company typically cares about its own profits and its customers, but not everybody else. Clearly, if some insurance contracts get cheaper, others will have to be more expensive. In the end, people with high risks will not be offered insurances anymore, or only at an unaffordable price, so some victims of accidents may not be compensated at all for their damage.

Even if B is positive, the profit P may be smaller than the unjust disadvantage U, which is the price that false positives have to pay. Such a business model would create a situation that I will call a "discrimination tragedy," where citizens have to pay the price for economic profits, even though they are not getting a good service in exchange.


It is, therefore, in the public interest to establish binding standards for the "healthy" use of Big Data algorithms, regulating the required predictive power and the acceptable values of α, D, B and U. A cost-benefit analysis suggests to demand B>0 (there is a benefit) and B>U (the benefit is high enough to compensate for unjust treatments). Moreover, αN1 and D should be below some acceptable thresholds. Today, these values are often unknown, and that means we have no idea what economic and societal benefits or damages are actually created by current applications of Big Data.