Apache Spark

This definition is part of our Essential Guide: An enterprise guide to big data in cloud computing
Contributor(s): Ed Burns

Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers. 

Apache Spark can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores such as Apache Hive. Spark supports in-memory processing to boost the performance of big data analytics applications, but it can also do conventional disk-based processing when data sets are too large to fit into the available system memory.

Spark became a top-level project of the Apache Software Foundation in February 2014, and Version 1.0 of Apache Spark was released in May 2014. The technology was initially designed in 2009 by researchers at the University of California, Berkeley, as a way to speed up processing jobs in Hadoop systems. Spark provides programmers with a potentially faster and more flexible alternative to MapReduce, the software framework that early versions of Hadoop were tied to. Spark's developers say it can run jobs 100 times faster than MapReduce when processed in memory and 10 times faster on disk.

In addition, Spark can handle more than the batch processing applications that MapReduce is limited to running. The core Spark engine functions partly as an application programming interface (API) layer and underpins a set of related tools for managing and analyzing data, including a SQL query engine, a library of machine learning algorithms, a graph processing system and streaming data processing software.

Spark has been adopted by large enterprises that work with big data applications because of its speed and its ability to tie together multiple types of databases and run different kinds of analytics applications. As of this writing, Spark is the largest open source community in big data, with over 1000 contributors from over 250 organizations.

This was last updated in June 2016

Continue Reading About Apache Spark



Find more PRO+ content and other member only offers, here.

Join the conversation


Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

Good article and overview. We're beginning to see more companies explore using Spark with a broad set of applications, especially for in-memory functionality for real-time apps (recommendations, search, etc.)
What are you looking forward to most when Spark 2.0 comes out?


File Extensions and File Formats

Powered by: