Translating SQL to MapReduce
This section provides information on how to translate SQL to a MapReduce environment.
If you have existing SQL queries and need materialized reductions, you can follow the guidelines on this page for translating SQL queries to MapReduce views.
If you don’t need materialized reductions, we recommend that you reformulate your SQL queries to use SQL++ rather than MapReduce views. |
Here’s an example of a SQL statement that you might want to translate:
SELECT fieldlist FROM table WHERE condition GROUP BY groupfield ORDER BY orderfield LIMIT limitcount OFFSET offsetcount
The different elements within the SQL statement affect how a view is written. The following table describes how each clause is handled in a MapReduce view:
SQL clause | MapReduce |
---|---|
|
The field list within the SQL statement affects either the corresponding key or value within the |
|
There are no table compartments within Couchbase Server and you cannot perform views across more than one bucket boundary.
However, if you are using a |
|
The |
|
Grouping within SQL is handled within views through the use of the |
|
The order of record output within a view is directly controlled by the key specified during the |
|
There are a number of different paging strategies available within the MapReduce and views mechanism. |
View map()
function, reduce()
function, selection parameters and other miscellaneous parameters have the following interactions:
SQL clause | View key | View value | map() function |
reduce() function |
Selection parameters | Other parameters |
---|---|---|---|---|---|---|
SELECT fields |
Yes |
Yes |
Yes |
No: with |
No |
No |
FROM table |
No |
No |
Yes |
No |
No |
No |
WHERE clause |
Yes |
No |
Yes |
No |
Yes |
No |
ORDER BY field |
Yes |
No |
Yes |
No |
No |
|
LIMIT x OFFSET y |
No |
No |
No |
No |
No |
|
GROUP BY field |
Yes |
Yes |
Yes |
Yes |
No |
No |
Within SQL, the basic query structure can be used for a multitude of different queries.
For example, the structure SELECT fieldlist FROM table WHERE xxxx
can be used with a number of different clauses.
Within MapReduce and Couchbase Server, you might need to create multiple views to handle different types of queries. For example, performing a query on all the blog posts on a specific date needs a very different view definition than one needed to support selection by the author.
Translating SQL SELECT to MapReduce
The field selection within an SQL query can be translated into a corresponding view definition by:
-
Adding the fields to the emitted key if the value is also used for selection in a
WHERE
clause. -
Into the emitted value if the data is separate from the required query parameters.
For example, to get the sales data by country from each stored document using the following map()
function:
function(doc, meta) { emit([doc.city, doc.sales], null); }
If you want to output information that can be used within a reduce function, this should be specified in the value generated by each emit()
call.
For example, to reduce the sales figures the above map()
function could be rewritten as:
function(doc, meta) { emit(doc.city, doc.sales); }
In essence this does not produce significantly different output (albeit with a simplified key), but the information can now be reduced using the numerical value.
If you want to output data or field values completely separate to the query values, then these fields can be explicitly output within the value portion of the view. For example:
function(doc, meta) { emit(doc.city, [doc.name, doc.sales]); }
If the entire document for each item is required, load the document data after the view has been requested through the client library. For more information on this parameter and the performance impact.
Within a SELECT
statement it is common practice to include the primary key for a given record in the output.
Within a view this is not normally required, since the document ID that generated each row is always included within the view output.
Translating SQL WHERE to MapReduce
The WHERE
clause within an SQL statement forms the selection criteria for choosing individual records.
Within a view, the ability to query the data is controlled by the content and structure of the key
generated by the map()
function.
In general, for each WHERE
clause you need to include the corresponding field in the key of the generated view, and then use the key
, keys
or startkey
/ endkey
combinations to indicate the data you want to select.
The complexity occurs when you need to perform queries on multiple fields.
There are a number of different strategies that you can use for this.
The simplest way is to decide whether you want to be able to select a specific combination, or whether you want to perform range or multiple selections.
For example, using our recipe database, if you want to select recipes that use the ingredient ‘carrot’ and have a cooking time of exactly 20 minutes, then you can specify these two fields in the map()
function:
function(doc, meta) { if (doc.ingredients) { for(i=0; i < doc.ingredients.length; i++) { emit([doc.ingredients[i].ingredient, doc.totaltime], null); } } }
Then the query is an array of the two selection values:
?key=["carrot",20]
This is equivalent to the SQL query:
SELECT recipeid FROM recipe JOIN ingredients on ingredients.recipeid = recipe.recipeid WHERE ingredient = 'carrot' AND totaltime = 20
If, however, you want to perform a query that selects recipes containing carrots that can be prepared in less than 20 minutes, a range query is possible with the same map()
function:
?startkey=["carrot",0]&endkey=["carrot",20]
This works because of the sorting mechanism in a view, which outputs the information sequentially with carrots first and a sequential number.
More complex queries though are more difficult. What if you want to select recipes with carrots and rice, still preparable in under 20 minutes?
A standard map()
function like that above wont work.
A range query on both ingredients will list all the ingredients between the two.
There are a number of solutions available to you.
First, the easiest way to handle the timing selection is to create a view that explicitly selects recipes prepared within the specified time.
Here’s an example:
function(doc, meta) { if (doc.totaltime <= 20) { ... } }
Although this approach seems to severely limit your queries, remember you can create multiple views, so you could create one for 10 minutes, one for 20 minutes, one for 30 minutes, or whatever intervals you select. It’s unlikely that anyone will really want to select recipes that can be prepared in 17 minutes, so such granular selection is overkill.
The multiple ingredients is more difficult to solve.
One way is to use the client to perform two queries and merge the data.
For example, the map()
function:
function(doc, meta) { if (doc.totaltime && doc.totaltime <= 20) { if (doc.ingredients) { for(i=0; i < doc.ingredients.length; i++) { emit(doc.ingredients[i].ingredient, null); } } } }
Two queries, one for each ingredient can easily be merged by performing a comparison and count on the document ID output by each view.
The alternative is to output the ingredients twice within a nested loop, like this:
function(doc, meta) { if (doc.totaltime && doc.totaltime <= 20) { if (doc.ingredients) { for (i=0; i < doc.ingredients.length; i++) { for (j=0; j < doc.ingredients.length; j++) { emit([doc.ingredients[i].ingredient, doc.ingredients[j].ingredient], null); } } } } }
Now you can perform an explicit query on both ingredients:
?key=["carrot","rice"]
If you really want to support flexible cooking times, then you can also add the cooking time:
function(doc, meta) { if (doc.ingredients) { for (i=0; i < doc.ingredients.length; i++) { for (j=0; j < doc.ingredients.length; j++) { emit([doc.ingredients[i].ingredient, doc.ingredients[j].ingredient, recipe.totaltime], null); } } } }
And now you can support a ranged query on the cooking time with the two ingredient selection:
?startkey=["carrot","rice",0]&key=["carrot","rice",20]
This would be equivalent to:
SELECT recipeid FROM recipe JOIN ingredients i1 ON ingredients.recipeid = recipe.recipeid JOIN ingredients i2 ON ingredients.recipeid = recipe.recipeid WHERE (i1.ingredient IN ('carrot',rice')) AND (i2.ingredient IN ('carrot',rice')) AND (totaltime < 20 AND totaltime > 0)
Translating SQL ORDER BY to MapReduce
The ORDER BY
clause within SQL controls the order of the records that are output.
Ordering within a view is controlled by the value of the key.
However, the key also controls and supports the querying mechanism.
In SELECT
statements where there is no explicit WHERE
clause, the emitted key can entirely support the sorting you want.
For example, to sort by the city and salesman name, the following map()
will achieve the required sorting:
function(doc, meta) { emit([doc.city, doc.name], null) }
If you need to query on a value, and that query specification is part of the order sequence then you can use the format above. For example, if the query basis is city, then you can extract all the records for ‘London’ using the above view and a suitable range query:
?endkey=["London\u0fff"]&startkey=["London"]
However, if you want to query the view by the salesman name, you need to reverse the field order in the emit()
statement:
function(doc, meta) { emit([doc.name,doc.city],null) }
Now you can search for a name while still getting the information in city order.
The order the output can be reversed (equivalent to ORDER BY field DESC
) by using the descending
query parameter.
Translating SQL GROUP BY to MapReduce
The GROUP BY
parameter within SQL provides summary information for a group of matching records according to the specified fields, often for use with a numeric field for a sum or total value, or count operation.
For example:
SELECT name,city,SUM(sales) FROM sales GROUP BY name,city
This query groups the information by the two fields ‘name’ and ‘city’ and produces a sum total of these values. To translate this into a MapReduce function within Couchbase Server:
-
From the list of selected fields, identify the field used for the calculation. These will need to be exposed within the value emitted by the
map()
function. -
Identify the list of fields in the
GROUP BY
clause. These will need to be output within the key of themap()
function. -
Identify the grouping function, for example
SUM()
orCOUNT()
. You will need to use the equivalent built-in function, or a custom function, within thereduce()
function of the view.
For example, in the above case, the corresponding map function can be written as map()
:
function(doc, meta) { emit([doc.name,doc.city],doc.sales); }
This outputs the name and city as the key, and the sales as the value.
Because the SUM()
function is used, the built-in reduce()
function _sum
can be used.
An example of this MapReduce combination can be seen _sum
.
More complex grouping operations may require a custom reduce function.
Translating SQL LIMIT and OFFSET
Within SQL, the LIMIT
and OFFSET
clauses to a given query are used as a paging mechanism.
For example, you might use:
SELECT recipeid,title FROM recipes LIMIT 100
To get the first 100 rows from the database, and then use the OFFSET
to get the subsequent groups of records:
SELECT recipeid,title FROM recipes LIMIT 100 OFFSET 100
With Couchbase Server, the limit
and skip
parameters when supplied to the query provide the same basic functionality:
?limit=100&skip=100
Performance for high values of skip can be affected.