Skip to content

Sphinx Architecture

Ammar Bakeer edited this page Aug 31, 2016 · 1 revision

Sphinx updated Cloudera's Impala layers starting from SQL Scanner/Parser layer to Execution layer, this allows Sphinx to run both Spatial and Non-Spatial queries on data, as Sphinx is an added support for Spatial data for Impala users.

Query parser:

Sphinx modifies the query parser layer by adding spatial data types (e.g., Point and Polygon), spatial predicates (e.g., Overlap and Touch), and spatial functions (e.g., Intersect and Union).

Storage:

In the storage/indexing layer, Sphinx constructs a two-layered spatial indexes that can be based on Grid and R-tree. Similar to Impala, Sphinx adopts HDFS as a distributed storage engine. However, un-like Impala, Sphinx arranges the records inside HDFS in a spatial-aware manner, placing nearby records in the same HDFS block.

Query planner:

In the query planner layer, Sphinx adds new query plans for spatial range and spatial join queries. Depending on the existence of indexes on input files, the query planner proposes two plans for the range query and three different plans for spatial join. Sphinx automatically uses a suitable plan based on the indexes.

Query executor:

In the query executor layer, the query plan, created by the planner, is physically executed on the worker nodes of the cluster. Sphinx introduces two new components, R-tree scanner and spatial join, which are both implemented in C++ for efficiency. These new components use runtime code generation to optimize the generated machine code based on query selectivity and the types of constructed indexes, if any.

Clone this wiki locally