Big data is a perfect representation of the difficulty in governing innovation. A complex web of technologies creates a seemingly endless chain of questions that require regulatory attention, if not answers. How should personal data be collected? What should be collected? How much consent should be required for the collection? How much of that consent should be based on knowing the end-use of the data? Do the companies who collect the data even understand how it is used? Do the people who wrote the algorithms which analyze it even know?
This last question has become particularly interesting and difficult to answer as machine learning’s ability ability to process big data removes the requirement of explicitly programming the decision of what to do with the information. Rather, only an objective function (which, in the case of the private sector, is typically profit) and a method programmed to optimize that function are required. This is referred to as a "black box": things go into it, things come out of it, but the transmutation itself is inscrutable.
Facebook made a few headlines when it patented an algorithm which could allow lenders, when looking at someone’s credit score, to also look at the scores of those in their friend network on the platform, subsequently bolstering—or lowering—it. This was not even particularly new technology – Facebook filed for the patent in 2012 but it went unnoticed until this year. Similarly, the Chicago Police Department created a list of 400 people who were considered high-risk for committing a violent crime from algorithms based on a Yale sociologist’s work. ProPublica found that SAT tutoring packages offered by the Princeton Review resulted in higher prices being charged to Asian markets based on geography variables entered into its pricing model. There are many other examples of how big data inadvertently discriminates on the basis of gender, race, socio-economic background. Some of the ways in which it manifests itself initially appear somewhat benign, such as 11% of the Google Image search results for “C.E.O” being a picture of woman, but which provides an unintentionally misleading portrait of the position when you consider that 27% of C.E.Os in the US are women.
Governments appear to be aware of the need for some oversight, particularly since many of these groups fall under legal protection in these situations. The Executive Office of the President of the United States put out a document entitled “Big Data: Seizing Opportunities, Preserving Values” in May of 2014. In it, the advisory committee tasked with examining the effects of big data on the American way of life wrote: “The increasing use of algorithms to make eligibility decisions must be carefully monitored for potential discriminatory outcomes for disadvantaged groups, even absent discriminatory intent” (pg 47). If we agree that algorithms which discriminate need some form of oversight, the next question, naturally, is how?
A group of computer scientists* focused on discriminatory algorithms have proposed one method to address them in the US. Their paper examines the algorithms from a legal framework using a theory of US anti-discrimination law called disparate impact. To summarize a highly complex piece of legal theory in the briefest way possible: disparate impact is used to guard against unintended discrimination against a protected class, such as race (protected classes are defined by statutes). Disparate impact causes a protected class to experience an adverse effect; again, this is not an intentional effect to harm a protected class but an indirect outcome of some policy or in the age of big data, an algorithm. (Additionally, in order to actually be illegal, the discrimination caused by disparate impact must not be a provable, necessary requirement in the context in which it occurs). ProPublica also has a good explanation of the legalities of disparate impact in the context of new technology.
The authors put forth a mathematical way to determine how well an algorithm can predict one of these protected classes (“protected attribute”) based on the data it uses (other “attributes”). They then introduce methods of transforming datasets so that the algorithm in question cannot predict the protected attribute, while maintaining the other data necessary for the algorithm to function to an acceptable degree of accuracy. These methods are then put into practice on real-life datasets from actual disparate impact cases. This has the potential to be a particularly effective approach because neither the test to determine whether an algorithm has the potential to cause disparate impact, nor the remedy to prevent it from doing so, relies on access to the algorithm itself (which tend to be proprietary and therefore difficult to obtain). Instead, it focuses on the dataset used by the algorithm. The authors note that this paper explores only numerical attributes, and thus other types of attributes (eg categorical) may prove more of a challenge.
Discrimination by credit lenders, law enforcement and employers is not new, but big data has, intentionally or not, enabled new ways to obfuscate it, hidden in rows of code, and new rows of code may be the only way to catch it.
*The group has also compiled a list of further reading on the topic of algorithmic discrimination here: http://fairness.haverford.edu/