If Mark Zuckerberg’s hearings on data privacy taught us anything, it’s that U.S. lawmakers either don’t understand how technology works or they’re unable to express their thoughts, questions, and concerns in technical terms.
However, lawmakers across the world most certainly understand the issues and impact that technology is having and will likely continue to have on our society. The problem is that if they want to regulate technology appropriately, they need a complete understanding of data and algorithms. What is currently being asked of them is the overwhelmingly difficult task of identifying with a variety of technical roles, the work associated with these roles and the ethics and consequences they engender. It is hard to blame them for the lack of their acculturation in data. It does not, however, excuse them from their duty to the populations they represent.
Most developers don’t touch data, and thus lack this understanding. When developers build applications, they are assembling functional code with the objective of ‘making something work.’ The field of data is vastly different. The aspect with which developers are most familiar involves data collection or user tracking, known better to developers as analytics. While many developers are familiar with analytics, they either don’t implement these systems or favor solutions that block access to raw data like Google Analytics.
Data also involves analysis work. In most companies data scientists are responsible for evaluating raw data. Most data scientists can model data, but don’t necessarily grasp the consequences of their models. In a commercial context, data scientists leverage a set of tools that help them find patterns in information so they can be exploited generally to increase revenue or engagement. The factors they control for thus may or may not account for the side effects of their models.
Product managers and data scientists are often aligned on metrics. Most product managers are goal-oriented, often favoring to ignore the secondary impacts of product decisions on their userbase in favor of increased revenue and retention. They leverage data models and how their results are displayed with the purpose of intentionally influencing user behavior.
In order for reasonable and effective policy to emerge through the legislative process, our elected officials must be able to empathize with the problems faced by technical workers, while setting boundaries and limitations in accordance with societal ethics.
While there are many voices advocating for change or for limitations, the perceived problems surrounding data privacy can be separated into three distinct notions:
1) the breadth of data capture that enables high performance, high accuracy predictive algorithms to be deployed;
2) once gathered, control over access to stored information; and
3) unchecked (perhaps unknown) biases that may result from an algorithm’s prediction.
Below are 12 of the hundreds of questions and follow-up questions that could help lawmakers draw parallels between the technical nature of data and algorithms and the societal impacts that unregulated companies may be propagating across the population. By clicking on each question, you can have an explanation of why the question is so important.
THE 12 QUESTIONS THAT WOULD MAKE ZUCK NERVOUS
- Can you explain how Facebook prioritizes the types of information it collects on its users and how the process of defining which data to collect works?
- Can you explain the extent to which Facebook uses demographic data in algorithms for the purposes of collecting, storing, distributing or otherwise manipulating content?
- Can you explain the extent to which Facebook uses behavioral data in algorithms for the purposes of collecting, storing, distributing or otherwise manipulating content?
- If a user chooses to delete his or her profile, is the data the user generated during his or her usage permanently deleted
- Does Facebook have the ability to continue collecting information on users that have closed or deleted their accounts or individuals that have never created accounts on Facebook? If so, does Facebook do so and how does this work?
- If user data is permanently deleted, how could you provide proof that it has been deleted? Could this proof be provided by a third party in your opinion?
- If a user shares data with a third party application, is this third party application technically (not legally) able to share a user’s information with a fourth party? How and why?
- Has Facebook considered an opt-in approach to data collection? What do you believe the ramifications of such a policy would have on Facebook?
- Can you explain how Facebook’s content algorithms are conceived and the goals for which they are optimized? Specifically, can you explain how algorithms that distribute content to users work and the metrics for which they are optimized?
- Does Facebook use deep learning in any of its algorithms that distribute content to or collect and process data on users? If yes, can you describe what they do?
- Does Facebook distinguish good quality from poor quality content? If so, can you tell us the variables that are considered? Please be explicit and quantify the attributes you refer to. If Facebook does not track the quality of content, why has it chosen not to?
- To what extent is Facebook able to evaluate the extent to which an algorithm biases or may bias a user?
As a society, we are far from dealing with the ethical questions that have arisen as a result of the innovations of the 21st century. It is difficult to fathom how the political system will be able to catch up. In Europe, the GDPR is a great start. It will regulate data privacy in many relevant and groundbreaking ways. In my opinion, more than the explicit regulations it accounts for, the GDPR will open the door to debates on data privacy that will begin to take place across European courtrooms. Judges are not lawmakers. They are likely to encounter the same issues involving an adequate understanding of technology. However, the nature of the judicial system offers judges an environment of thoughtful deliberation. They will ultimately be the ones to assess the goals and implications of case decisions on policy and the consequences of case outcomes on guaranteed rights and freedoms.
Answers to the questions whether obtained through testimony in the U.S., Europe or elsewhere would contribute to an improved understanding by policymakers. Even so, lawmakers still face an uphill battle – these questions require attention to detail and as well as public support. In order to serve as guardians of public interest, lawmakers must develop a more refined understanding of the technical aspects of data privacy and the mechanisms by which data is collected and crunched.
In my opinion, the lack of agility of the American political system (and to a lesser extent the European one), the illiteracy of the general public towards the data they create online and an aging political class will likely see policy lag increasingly behind the technological revolution. We are only just beginning to address individuals’ rights regarding the data they produce as we enter an age where *how *that data is used will be just as, if not more important.
Can you explain how Facebook prioritizes the types of information it collects on its users and how the process of defining which data to collect works?
Mark Zuckerberg may or may not have a detailed operational vision of how this process works. However, it is likely that Facebook uses a set of proprietary tools to capture event data (e.g. behavioral information) for every possible interaction on its platform. Tools like Segment and MixPanel allow their customers to mimic GAFAM-style (Google, Apple, Facebook, Amazon and Microsoft) data capture, meaning that the tech giants likely pursue a similar approach, at a much higher scale.
These analytics events may or may not have secondary descriptive characteristics. Secondary descriptive characteristics (e.g. properties) link an interaction to a static piece of information.
For example, a user updating a relationship status would trigger Facebook’s data capture system to log the update. The update takes the form of an interaction with the platform. An interaction in analytics is known as an event. Events track users’ behavior on a given screen, page or feature.
In this case, the behavioral analytic event would record that a specific user’s relationship status was updated. A property, or qualifying piece of information of this interaction would be the ID (and thus name and other identifying information) of a second user, the person with whom the original unique user who updated her profile identified as having a relationship. This secondary characteristic represents a piece of demographic information that is a consequence of behavioral information.
Most static or demographic information is collected through the collection of behavioral analytics. The number of events (e.g. points of data capture) will likely be directly correlated to the number of possible interactions on the platform from liking and commenting to searching and scrolling. Facebook goes so far as tracking users across domains it doesn’t own and even tracks users that don’t have accounts on Facebook via cookies and snippets. In this case, it would be interesting for Zuckerberg to explain why this is necessary, how this occurs, how transparent this is to end users and what his opinion is on the ethics of it.
In turn, every interaction is associated with a corresponding piece of content or a feature. For example, when a user likes content, they are in fact reacting to an image, text, a video, etc. Facebook’s algorithms correlate behavior to content in order to provide more “relevant” content during future user sessions. Understanding what Facebook considers when collecting information on content could open the door to other questions.
The reason this core question could be important is because it allows politicians to recognize the breadth of data capture in which Facebook engages. It also ensures a confirmation of what types of data Facebook captures and by which means.
Can you explain the extent to which Facebook uses demographic data in algorithms for the purposes of collecting, storing, distributing or otherwise manipulating content?
While the concept of demographic and behavioral data isn’t formally defined in data science or in the field of technology more generally, dividing data into two parts can be helpful. Understanding that Facebook has information on *who an individual is *(as opposed to what he or she does), is an important factor to consider in how algorithms ingest data and refine their future predictions based on feedback and validation loops.
Let’s take an extreme example. This example is solely for the purposes of illustrating a potentially high level of gravitas and is in no way true or verified. Imagine that Facebook uses an image recognition algorithm to distinguish unique individuals for the purpose of suggesting tags in photos or for any other unknown internal reason. It is extremely likely that this algorithm processes data such as the hexadecimal color of each pixel.
Because a group of pixels may indicate an individual’s skin color (a property understood by humans, but a parameter that is not explicitly expressed in the image recognition algorithm), there may be a level of ambiguity regarding how the algorithm predicts that an individual is a specific person or how the algorithm clusters similar individuals. From a human ethical perspective, one may consider that by its nature, the use of this information in the context of image recognition, could potentially contribute to demographic bias in its current application or in a known or unknown future application.
At the moment, the decision to leverage such an algorithm is subject to Facebook’s internal policy. It is perhaps in the public’s general interest to stop, control, or permit Facebook to leverage these types of algorithmic features or control the context in which they are used. In other words, policymakers should contemplate whether Facebook or any other company should use specific demographic information (whether programmed intentionally or unintentionally) to profile individuals in such a way.
Can you explain the extent to which Facebook uses behavioral data in algorithms for the purposes of collecting, storing, distributing or otherwise manipulating content?
It is important to consider the nature of the data Facebook may collect. While demographic information identifies the traits of an individual, behavioral information defines how they comport themselves relative to a given stimulus – in this case, the display of a page and its content. Almost unquestionably no two users have the same behavior over time on a given service.
An organization like Facebook may rely on a user’s behavior to profile users as much or more than demographic or descriptive information about that individual user. In such a case, it becomes significantly more difficult to associate a bias with a type of person, unless correlations between behavior and demographic information can be identified.
Whether this is the case or not, ensuring transparency into the use of behavioral information can help shape policy regarding the extent to which bias can be monitored and controlled with respect to the types of data being used in an algorithm.
If a user chooses to delete his or her profile, is the data the user generated during his or her usage permanently deleted?
Because tech companies currently manage their own data policies without third party audit imposed through a legal framework, companies may not delete user data despite a user’s decision to close his or her account.
Zuckerberg has said explicitly that user data is deleted when a user deletes his or her account. However, at the moment, there is little transparency about whether the deletion has actually taken place. And even if companies and their CEOs claim to delete user information permanently, proof of deletion is usually not provided. This question merits a finer understanding of how Facebook defines the word ‘deletion’ as well as how Facebook could use the data of a deleted user.
If a company like Facebook chooses not to delete a user’s data, it might be important for the public to understand the perceived value and purpose that saving it serves. Data may be used to categorize other users based on commonalities with the data from a deleted user. For example, a model that classifies, segments or clusters new users or new user behavior based on existing user behavior of active or inactive users, could make use of this data in order to determine the specific content to serve new or existing users.
One argument Zuckerberg may favor is that a user’s data may be worth saving in case that user returns to the platform in the future. As a result, Facebook’s algorithms that create “personalized” experiences would not have to fully reprofile the existing user. They would simply base predictions about content and interests on the existing data trail, providing a way to bootstrap content recommendations.
For all intents and purposes, the refusal to delete a person’s data may equate to a denial of the right to be forgotten. This so-called right is still yet to be defined legally as a right in the US. In Europe, however, the right to be forgotten is beginning to gain increasing public and legal traction. The public must consider the advantages and disadvantages that such a policy would have on the record of history, factual accuracy and theexpression of opinion as much as an individual’s control of their own privacy and the data they knowingly and unknowingly produce.
Does Facebook have the ability to continue collecting information on users that have closed or deleted their accounts or individuals that have never created accounts on Facebook? If so, does Facebook do so and how does this work?
Much of the data collected by companies with an online presence is done through cookies. Cookies allow a company to track unidentified individuals (e.g. visitors) across a company’s site. Cookies are often used to define and implement acquisition strategies based on the particular behavior of an unknown visitor or group visitors.
Some companies like Facebook reach further than their own domains by offering their clients and partner sites bits of code (e.g. snippets) intended to optimize advertising. Thesebits of code simultaneously allow Facebook and similar companies to unify user tracking across proprietary domains and those of third parties. Because Facebook can evaluate user behavior on their own site as well as those of their advertisers and partners, with or without the knowledge of the visitors and end users may be an additional question in the privacy debate.
Much of the online advertising industry is based on this technology, so if lawmakers favor user privacy outright over deregulation of this type of tracking, then the online advertising world will likely take a very heavy financial hit.
If user data is permanently deleted, how could you provide proof that it has been deleted? Could this proof be provided by a third party in your opinion?
In the case that legislators favor the right to be forgotten, then they must have an enforcement or audit mechanism in place to protect end users. While users can request their data, and in some jurisdictions, specify the precise data points they would like to permanently delete, no proof or verification that their request has been carried out can be provided.
This question goes to the heart of auditing data lakes and warehouses of companies that capture and leverage end user data. While the concept of audits in the world of information can be applicable in a variety of contexts, in practice, validation and proof of data deletion is likely to be at the forefront of the data rights debate.
As a result, if lawmakers accept this principle, they must begin exploring the human and mechanical means that such proof can be provided. Regardless of the proof and validation systems that governments decide to deploy, these systems will likely be scaled to other operations involving data. They must thus be conceived for flexibility in the face of governance.
This idea touches on the use of blockchains. While blockchains are largely associated with cryptocurrencies, they are simply ledgers that allow multiple parties access to a log of truthful events. Blockchain ledgers are immutable, meaning that they cannot be tampered with or modified. So long as the participating actors in a blockchain agree that an event has taken place, it cannot be reverted, modified or deleted.
Whether current blockchains represent a long-term means of addressing this particular question, lawmakers should be familiar with how and why blockchains are pertinent. It will serve them not only in this instance, but in the debates, discussions and vital decisions about data, how it is used and the fundamental ways it will affect our societies, cultures and economics.
If a user shares data with a third party application, is this third party application technically (not legally) able to share a user’s information with a fourth party? How and why?
The answer to this question is without hesitation ‘yes.’ When a Facebook user decides to offer access to their data to a third party, the access to that data is validated (i.e. authenticated) by a Facebook user. When the application begins reading that data, it is effectively copying that information to its own servers.
Once the data is on a third party’s servers, that actor can use it however it likes. While Facebook may cite policies in their terms and conditions about the use of this data, there is no formal recourse imposed by government on the mishandling of private information by third parties.
The ‘how’ question is perhaps most important factor. Cambridge Analytica was able to extract data on 87 million users through a base of only 270,000 that participated in Cambridge Analytica’s survey. How is it possible that a third party could access such a breadth of information?
Facebook provides third parties with access to an API (i.e. a plug into its database). Facebook controls this access by permitting a fixed number of types of requests for data. Some of this data might concern a user and his or her activity, while other data might concern his or her friends. The extent to which Cambridge Analytica was able to access data from second degree connections is wholly limited and governed by Facebook.
The ‘why’ aspect can be addressed in many ways, and it would be up to Zuckerberg to chose his reasons for allowing access to data in the ways Facebook does. Zuckerberg and Facebook likely never thought that data from the platform would be exploited so successfully in such a malicious way. However, Facebook’s APIs are intentionally designed to allow for the depth of access exploited by Cambridge Analytica.
Has Facebook considered an opt-in approach to data collection? What do you believe the ramifications of such a policy would have on Facebook?
Facebook may have considered an opt-in policy at some point in its history, but has never deployed one that covers the entirety of the data a user produces. Of course, it favors Facebook to leverage information on user behavior in addition to demographic information, over which users have some control.
Facebook currently allows users to control the data that other users on Facebook can access. It also provides information about what third parties use when sharing data with them. However, when Facebook launches new features, they rarely favor an opt-in approach instead favoring an opt-out strategy (e.g. private profiles), and in some cases no option (e.g. some automated Facebook notifications) at all.
The ramifications of an opt-in policy would be drastic for Facebook. The amount of data it or any other company collects would be drastically reduced. Mark Zuckerberg and every other Silicon Valley executive would agree that one of the most viable business models in tech would all but cease to exist.
Ultimately such a policy translates to lower overall engagement and less time on apps, meaning less eyeballs, less ads and lower revenue. Such a policy would effectively destroy Facebook’s business model, potentially forcing Facebook to move towards a pay-to-play business model. And who would pay for that?
Can you explain how Facebook’s content algorithms are conceived and the goals for which they are optimized? Specifically, can you explain how algorithms that distribute content to users work and the metrics for which they are optimized?
Every algorithm processes a set of inputs and renders an output. In predictive modeling, algorithms can take many forms from simple linear regressions to deep learning neural networks. There is a wide range of complexity in algorithms, but the vast majorities are conceived to manage inputs and produce outputs.
Algorithms in the enterprise context are built with purpose. They are goal-oriented. For example, an algorithm that distributes advertising content could optimize for the value of a campaign. While value is subjective, one could define it as a function of the relative demand for a volume of ad impressions (i.e. ad displays) at an expected click through rate.
In order to optimize an ad campaign or calculate its expected future performance, regardless of the complexity of the predictive model, the input parameters might include:
- the past, current and expected available volume of displayable content
- the past, current and expected volume of demand over a period for a given audience or audience parameters
- the expected performance of content relative to a given audience or audience parameters
- the actual rate of performance of content relative to a given audience or audience parameters
- the price paid for a given combination of content and audience or audience parameters at a given time
A full explanation from Zuckerberg should entail the data that these algorithms take into account and the parameters for which they are optimized. This discussion could help nourish an understanding of the need to address transparency within the mechanisms that treat data, in addition to the collection and storage of this information.
Does Facebook use deep learning in any of its algorithms that distribute content to or collect and process data on users? If yes, can you describe what they do?
Deep learning algorithms are composed of one or multiple neural networks that process data, but offer no means of tracking algorithmic decision-making. Rather, the performance of deep learning modes is measured on the accuracy of the output relative to a benchmark.
Facebook likely uses deep learning models for extracting meaning from images, text, video and other types of content. It would likely be difficult for Zuckerberg to know the full extent to which Facebook uses, deploys or experiments with deep learning.
Because the decision-making of deep learning models cannot be deciphered, Zuckerberg would only be able to describe to what they are applied, why and for what they are optimized.
Does Facebook distinguish good quality from poor quality content? If so, can you tell us the variables that are considered? Please be explicit and quantify the attributes you refer to. If Facebook does not track the quality of content, why has it chosen not to?
This question is subjective, but it allows Zuckerberg to show how deeply he has contemplated this topic. If his response is shallow, then he is either not offering his real views or he has not considered the idea enough. The latter would be difficult to fathom, however. The quality of content is at the heart of a content-based social platform, and even more so since the Cambridge Analytica revelations.
If Zuckerberg uses this moment to offer his candid thoughts on the topic, he would likely point to his revenue and engagement metrics. These metrics are driven by perhaps dozens or more discernible factors over which Facebook may have control.
For factors that are more nuanced such as the veracity of the content, the reputation of its creator and the level of third party automation (e.g. bots) involved in its dissemination, Facebook has only made passing efforts to address these notions. The company has never taken a public standpoint or made these aspects transparent to end-users.
Nevertheless, the level of optimization of these factors within the company’s content distribution algorithms remains largely unknown. Understanding how Facebook and its peers control content quality may also have an impact on our society and should likely be made open to public scrutiny and discussion.
To what extent is Facebook able to evaluate the extent to which an algorithm biases or may bias a user?
Let’s start with the obvious. Bias is a loaded term. In the US, the types of discrimination that are not permitted are codified and can serve as a guide for qualifying bias on Facebook or any other digital service. In the US, it is legal to collect and use data that identify individuals on the basis of race, religion, ethnicity, nationality, gender, etc. that could be used to measure bias and discrimination. In Europe, however, while types of discrimination are explicit in many countries, data cannot be collected about these characteristics, precisely to avoid enacting bias.
In order for bias to even be addressed it must be quantified in some way. Nevertheless, the factors that are accounted for in the quantification of bias are up for debate, and therefore there is no consistent, homogenous legal framework to quantify bias. These decisions are currently being leftup to private companies. Furthermore, even if bias were quantifiable homogeneously across all digital and non-digital services, the thresholds that determine whether bias is expressed or not, could arguably be arbitrary and irrelevant depending on the service.
By way of an example, an overly simplified way of controlling for bias in the US may be to segment users by say, sexual preference. Facebook has the ability to monitor whether or not an individual piece of content is distributed in higher or lower volumes to users with a particular sexual preference relative to a random control group. If the variance from the control group is higher than some threshold, say 5%, then Facebook’s content distribution algorithm, could theoretically adjust to account for its own bias.
Monitoring and acting on bias in this way is nuanced. Not only is the information on which the decisions about bias are based, sensitive, but the core business of advertising platforms like Facebook is audience segmentation. Notably in the case of audience segmentation for paid content, there may be correlations between criteria selected by advertisers and the sexual preference of a given user. So then how and to what extent should Facebook prevent advertising to a set of audience parameters correlated to bias.
How can a legal framework control for algorithmic bias? This question does not have an easy answer, but it most certainly deserves as much contemplation in political spheres as does data privacy and an individual’s right to their private data.