Data jobs often get lumped together. However, there are significant differences between a data scientist vs. data engineer. When the two roles are conflated by management, companies can encounter various problems with team efficiency, system performance, scalability and getting new analytics and AI models into production.
The biggest problem companies run into is hiring data scientists when they need more data engineers, said Jesse Anderson, managing director of the Big Data Institute.
Data scientist is the hot new title with all the hype around AI, he said. Data engineering plays a bigger role behind the scenes, so their efforts tend not be as visible or understood by management and hiring managers.
Additionally, data scientists tend to have a higher profile in providing actionable information and insights to management, so executives want more of them. The problem is the average company needs far more data engineers setting things up behind the scenes for data scientists to be productive.
This article is part of
Anderson suggests one of the most important things companies can do is ensure that the data engineering team gets proper credit when new insights are discussed with management. This helps ensure proper investment in building up the data engineering team.
Experiments vs. software projects
Data scientists are far more comfortable with thinking about probability and uncertainty, where data engineers tend to be better at completing projects.
"There are many ways to describe the difference as the roles have evolved over the last two years, but I find it useful to think of it like this: data scientists run experiments, and data engineers run software projects," said Andy LaMora, global director of data, analytics and AI at Topcoder, a crowdsourced coding service.
Data scientists use a trained understanding of the math and theory underlying the analytics tools of data science, including discrete math, linear algebra and graph theory to apply the right models and evaluation metrics to a problem.
Data engineers are typically data storage and transformation professionals, solving problems around how incredibly large or fast data sets are maintained and rendered useful.
Both roles required
"Most cloud-native-type companies need five data engineers for each data scientist to get the data into the form and location needed for good data science," said Jason Preszler, head data scientist at Karat, a technical hiring service. "Without both roles, the data [that] companies are easily collecting is just sitting around or underutilized."
Anderson estimates that about 30%-50% of companies have a correct ratio. He often runs into companies that task data scientists with data engineering jobs.
"It leads to technical debt," Anderson said.
Some of the problems he has run into have been data scientists writing applications that don't scale in production. Often data scientists aren't even aware of their limitations.
In one case, a team of data scientists was trying to scale an image analytics algorithm and reached out to another data scientist who was an expert in the algorithm. He couldn't scale it either. A data engineer was able to look at the problem in a different way and figured out how to scale the data processing infrastructure instead of the algorithm to achieve the desired result.
Another time, Anderson found a data scientist coding an experiment that took 15 minutes to run. This was an experiment that was run multiple times a day, and the data scientist ended up spending a lot of time exploring different iterations of the experiment. A data engineer was brought in to refactor the experiment to run in a few seconds, which made the data scientist more productive.
It's far less common for companies to use data engineers to do data science.
Some of the problems to explore in this case are whether the models are accurate and whether they're the right models for the job. Managers need to figure out if data engineers have the necessary background in statistics to eliminate bias and non-statistically significant results.
Data science skills
Vivek Ravisankar, co-founder and CEO at HackerRank, a code skill testing and hiring service, said skilled data scientists have a deep knowledge of statistics and at least one area of machine learning or AI. They must be able to build highly specialized mathematical models and have a thorough understanding of machine learning algorithms.
Preferably they have basic programming skills in R and/or Python and a good understanding of distributed data computing tools like MapReduce, Hadoop, Hive, Spark, Gurobi and MySQL, among others.
"The best data scientists build effective models, use appropriate techniques for different kinds of problems and strategize on augmenting data sets," Ravisankar said.
Maintaining clean, extensive data sets is the biggest challenge in many data science projects. Data scientists must also be excellent communicators with business acumen, have a boardroom presence and be able to build strong teams to support them.
Data engineering skills
"Data engineers make data scientists' jobs possible," Ravisankar said.
Data engineers have a much heavier focus on the software development skills required for building and managing the architectures that capture the data that data scientists to use. If they're not building or managing data pipelines, they're maintaining databases and large-scale processing systems. Since they're tasked with maintaining the environment that data scientists work in, they must not only be technically effective, but also team-oriented.
Strong data engineers tend to have a background in software development, with the ability to comfortably switch between and combine technologies to achieve an overarching goal.
"They're familiar with the needs of a data-driven team and the architectural groundwork necessary to allow data analysts and data scientists to thrive," Ravisankar said.
Data engineers should have an extensive background in one or more of the frameworks utilized by the data engineering team, such as Hadoop, NoSQL, Spark, Java and Python. Finally, they should have proven experience promoting data accessibility, efficiency and quality within an organization.
"The main data engineering responsibility is to keep the data fast, accessible and safe," LaMora said.
Depending on the firm, this can involve everything from installing and managing data storage systems -- such as relational or NoSQL databases and/or streaming and storage engines like Spark -- to creating and managing useful and insightful extracts of the data into data warehouses and microservices.
LaMora finds that some firms make a distinction between the skills applied to creating and scaling massive data stores to developing data extracts for data scientists, but they belong to the same family of skills.
The opportunities for a data scientist vs. data engineer aren't too varied.
"The employment outlook for both roles is superb," LaMora said.
Although the data platforms and cloud services are getting better at automating many aspects of data engineering, new frontiers in using or mashing up data are appearing just as quickly. All this data needs to be flowed or stored, and all of it needs to be analyzed.
"There may come a future where data scientists see weakened demand as more problems are converted to engineering, but it's not likely to happen in the next few years," LaMora said.
HackerRank has found that demand for data scientists has grown 256% since 2013, and their 2020 Developer Skills Report found that data scientists are the top hiring priority for nearly one in six hiring managers globally. Many of these candidates come from non-computer science backgrounds, including physics, math and biology.
"Companies need to ensure they define exactly what they're looking for when designing the hiring process -- this will lead them to candidates with the right skills much faster," Ravisankar said.
Despite the increased priority, data scientists earn a bit more on average than data engineers, but not much. According to Glassdoor, the average salary in the U.S. for a data scientist vs. a data engineer was $113,000 versus $103,000 respectively.
Some data engineers ultimately end up developing an expertise in data science and vice versa. Anderson calls a person with these cross-functional skills a machine learning engineer. Gaining these skills can be a long process driven by curiosity and the right personality of someone comfortable navigating the uncertainty of data science with the rigor of data engineering.