Data science: Because data isn’t just for scientists
In February 2013, a 31-year-old man was arrested driving a car in La Crosse County, Wisconsin. Eric Loomis pleaded guilty to eluding an officer and was sentenced to six years in prison. Before his sentencing, Loomis received a score from an algorithm called COMPAS, a program developed by software company Northpointe to score the rate of recidivism for parole candidates.
Loomis’s COMPAS score, which claimed to suggest he was at a high risk of committing another crime, played a part in his six-year sentence. Loomis later challenged the use of the algorithm in court, saying it violated his right to due process. Loomis lost the case and COMPAS remains in use throughout the country.
COMPAS is an example Dan Runfola, the director of William & Mary’s new data science program, uses to demonstrate the ethical issues with relying on predictive data to calculate recidivism rates. Northpointe has never released the algorithm behind its risk assessment software, but an investigation by ProPublica suggested race was a factor.
“If we are trying to predict the risk of a criminal recidivating and going back to prison, and race is quantitatively a good predictor for that, should we use it?” Runfola said. “William & Mary students need to be ready to critically answer these questions.”
Ethics is a persistent theme throughout the new program, which was designed to give students in any discipline the computational chops to create and analyze giant data sets.
“I see our job as preparing the next generation of students to be able to understand the advantages and limitations of computation, but also to consider the implications,” said Runfola who also serves as AidData’s senior geospatial scientist and is an assistant professor of applied science. “The ethical dilemmas in data are enormous, everywhere and hidden at the same time.”
Since the data science program began accepting students this past fall, it has made a strong case that computational applications are valuable to students with a wide array of interests. The program has already amassed 17 declared minors and 15 self-designed majors (an official major in data science is still in the works). A total of 251 students are enrolled in data science courses this spring, a 32 percent increase from the fall semester, and nearly every discipline is represented in the student body.
“It is exceeding expectations,” said Dennis Manos, vice provost for research at William & Mary. Data science is the most fully developed in a series of educational initiatives spearheaded by Manos that are aimed at integrating data analysis, engineering and design into the university’s liberal arts curriculum. A second initiative, the new physics track in Engineering, Physics and Applied Design (EPAD), will begin accepting students next semester.
“My appreciation of it at this point is that the courses are to capacity or beyond,” Manos said of the data science program. “Which is all we can ask for. If you hang out a shingle and people come, you're happy.”
So far, the program has done far more than hang out a shingle, it has served as a kind of pipeline for research laboratories on campus. Researchers who collect massive amounts of data have come to lean on the data science program to train students for work in the lab.
“We have affiliate faculty that are in departments that don’t have a depth of computational skills. Those faculty have a need,” Runfola said. “They want students in their labs helping them do research, but they can’t because the students don’t yet have the skills that they need.”
Margaret Saha, Chancellor Professor of Biology, says a full half of her lab is involved in the new program. When it comes to her field, technological innovations have made data collection exponentially easier in recent years. For example, a routine RNA sequencing gives about 60 million reads, with roughly 300 bases in each read, she said.
“How do we take all of this data and turn it into knowledge?” Saha said. “That’s where data science comes in. It is addressing this major problem and huge need in biology.”
Saha concedes that not all biology students are initially interested in learning how to write programs or manage massive data sets, but the way data science is pitched -- as just another tool in their biology toolbox -- helps them “see the light.”
Manos says that is exactly the point of data science. It is meant to attract students who have strong interests outside of computer science and quantitative analysis. He says the endgame for students should not be understanding coding architecture or the underpinnings of Boolean algebra, but “using these skills as tools to find some long-term benefit or solution they’re being asked to provide.”
“I like to jokingly refer to data science’s teaching model as entrapment,” Runfola said. “A lot of the students we are serving would never consider themselves computer science people. We want to show them that they’re wrong.”