Optimizing Spark for Big Data in the Cloud

We don’t Make the Engine, We Make the Engine Run Faster

ApacheSpark        spark_logo1      AmazonWebservices_Logo.svg

Algebraix Query Accelerator (AQUA) for Apache Spark

Powered by Data Algebra

The Algebraix Query Accelerator (AQUA) is a software component for Spark SQL that lets you automatically provision computations of Spark SQL’s directed acyclic graph. AQUA leverages patented inter-query reuse technology to improve performance and reduce cloud infrastructure costs.

By applying AQUA to the Spark framework, developers and data scientists can use less expensive resources, fewer nodes, and shorten processing times to save total cost of ownership.

Whereas most SQL optimization techniques are focused on establishing adjacent data stores, AQUA optimizes the actual query execution plans from Spark’s catalyst. Our software uses Data Algebra to cache a variety of equivalent opportunities and subsequently removes work from Spark’s SQL jobs while maintaining the correct end computations.

AQUA is a simple to install software package that works in conjunction with Amazon Web Services, Elastic Map Reduce, and Amazon’s S3 filesystem. The application of our product requires no change to your current Spark scripts or queries.

The initial version of AQUA runs alongside Apache Spark to improve SQL performance and user concurrency in that environment; however AQUA is being developed for other database and big data cloud environments to include Microsoft Azure and IBM Bluemix.




The Benefits of AQUA

Improved SQL Query Performance 

AQUA’s inter-query optimization approach speeds up SQL query performance by as much as 10-1000x. Even more importantly, SQL query performance improves exponentially over time. As more users submit more SQL queries, the amount of query repetition, in whole or in part, increases. Thus, AQUA effectively gets smarter over time and delivers ever-increasing SQL
performance results. 

Improved Multi-User Concurrency

As the number of users submitting SQL queries increases, the resulting performance levels traditionally drop, as users are competing for computational resources. Concurrency in big data environments  is a compounding issue and will become a much larger issue as the big data landscape evolves. AQUA’s approach helps to solve this problem by speeding SQL query times. In fact, each user’s queries actually benefit from other users.

Reduced Operational Costs

Instead of caching data to deliver the required SQL response times, AQUA instead caches computations. As a result, the software effectively substitutes storage costs for compute costs driving down net operational costs.

Our Core Technology: Data Algebra™

Our core technology, Data Algebra™, is a mathematical approach to manipulating and representing data. Whereas other technologies leverage meta-data or adjacent data stores to process data, data algebra translates queries into simple algebraic lookups. 

Our book The Algebra of Data: A Foundation for the Data Economy – is an introduction to this genuine game changing technology concept. The book was co-written by Gary J. Sherman, PhD, the inventor of the Algebra of Data™ and founding mathematician of Algebraix Data, and Robin Bloor, PhD, also a mathematician as well as an influential researcher, analyst, and well-known author.

Information week and several other editorials have noted, “Data algebra is a new approach for managing, integrating, and searching data faster and more efficiently”. Download our E book to learn more.


Our Technology Patents

The Algebraix Technology Platform is based on our fundamental innovation in the field of applied mathematics: the algebra of data. The company is building a portfolio of patents around its technology platform. We currently hold nine U.S. patents and expect to receive dozens more.

U.S. Patents Granted to Date
  • 7613734 Systems and Methods for Providing Data Sets using a Store of Algebraic Relations
  • 7720806 Systems and Methods for Data Manipulation using Multiple Storage Formats
  • 7769754 Systems and Methods for Data Storage and Retrieval using Algebraic Optimization
  • 7797319 Systems and Methods for Data Model Mapping
  • 7865503 Systems and Methods for Data Storage and Retrieval using Virtual Data Sets
  • 7877370 Systems and Methods for Data Storage and Retrieval using Algebraic Relations
  • 8032509 Systems and Methods for Data Storage and Retrieval using Algebraic Relations Composed from Query Language Statements
  • 8380695 Systems and Methods for Data Storage and Retrieval Using Algebraix Relations to Optimize Calculations
  • 8583687 Systems and Methods for Indirect Algebraic Partitioning