Society should do more to encourage promising students to become the next generation of data scientists, according to one speaker at last week's O'Reilly Strata Conference in Santa Clara, Calif.
Rachel Schutt, a senior research scientist at analytics startup Johnson Research Labs in New York, told Strata attendees that a shortage of data scientists is looming at a time when the world needs more individuals to take up a still-evolving profession. "Why do we need next-gen data scientists? The short answer is that we have a large number and large space of problems that need to be solved and there is a talent shortage of people with the skills sets to solve them," she said. "Universities aren't necessarily producing those people in some cases."
Data scientists of the future will work on solutions to problems in many industries and fields of study, Schutt said, including biomedical research, informatics, disease prediction, government, education and urban planning, among others. Next-gen data scientists will also help improve the lives of citizens, and help business and society gain a deeper understanding of human beings and their behavior. "In order to work with this data to solve problems, data scientists need a hybrid skill set in computer science, statistics, utilization, engineering," she said. "But we face a talent shortage, and so we need to train the future generation to sign up."
A former senior statistician at Google Research, Schutt is also an assistant adjunct professor at Columbia University, where she taught a course titled "Introduction to Data Science" during the fall 2012 semester. She said that another benefit of encouraging students to take up the profession is that it will lead to a clearer definition of just what the "data scientist"role entails.
"Data science" needs to be defined in a more deep way to rigorously to merit the term 'science,'" she said, "and getting more trained people into the profession will also help actually define the scope of data science in a natural and rigorous way."
Data scientist role covers many bases
The exact definition of "data scientist" is still difficult to pinpoint because those in the profession currently tackle a wide variety of tasks and boast varying skills -- and not all have the words "data" and "scientist" in their titles. In an effort to help define the current state of data science, Schutt brainstormed a list of the types of things that data scientists are doing today. She presented the list to the crowd at Strata.
According to Schutt, data scientists often do a great deal of exploratory data analysis to create data visualizations for reporting purposes. They spend a great deal of time using data to come up with unique business insights and metrics. They help companies make important data-driven decisions, and they tend to be skilled users of big data technologies like Hadoop, MapReduce, Hive and Pig. They often are hackers and usually boast proficiency in R, Python, C, Java and other programming languages. They write patents and act like data detectives in their efforts to uncover useful information.
Data scientists spend time predicting future behavior based on what has been seen in the past. They write their findings and predictions in reports, presentations and journals. They are good at creating algorithms and statistical models, and they know how to encourage machines to learn over time. They ask good questions, and they have developed the ability to make inferences from data. They often have good instincts when it comes to correlating the relationships between data, they know how to design and analyze experiments, and they try to establish causality when they investigate an issue.
Data scientists usually have training in such quantitative fields as statistics or mathematics, but they might also need to develop their communications skills so they can relate their finding to nontechnical business professionals. Companies embarking on data science initiatives might also find that they need a team to be effective, because each data scientist is different.
"There is variability among data scientists in terms of their backgrounds and in terms of their ability," Schutt explained. "No one data scientist can actually do all this stuff."
An aspiring data scientist weighs in
It's easy to understand why it might be difficult to encourage students to take up data science, according to conference attendee Ramesh Sampath, an aspiring data scientist who listened to Schutt speak at the conference.
For more on data scientist careers
Get to know the key characteristics of successful data scientists
Learn the right questions to ask when interviewing data scientist candidates
Find out more about the role of the data scientist
Sampath, a statistics and data mining student at Texas A&M University who does consulting work on the side, said the profession is not clearly defined yet and that can present challenges to up-and-comers.
"If you want to be a Web developer, you learn Java script, you learn some HTML tags, and you are a Web developer," Sampath said. "But that's not as simple when you talk in terms of a "data scientist," because the skills are so wide-ranging."
Sampath's advice for other students or aspiring data scientists is to avoid the temptation to jump right in and start analyzing humongous data sets with Hadoop and related tools. Instead, students should begin by looking at small data sets and getting a feel for how different data points relate to one another. "Just try to look curiously at small data sets and then try to understand the data," he said. "Just explore the data with charting and graphing. I think charting and graphing is great because it shows the relationships between data."
After gaining a better understanding of the data, Sampath said, it's time to start thinking about whether those relationships can be modeled to create forecasts or predictions. He added that any data scientist wannabe should also spend a great deal of time learning how to program. "Be a hacker," he said, "because I think we have moved away from saying that we don't have to be very good technically. You have to really track and explore things and not delegate that work to somebody else."