Data Enrichment
Goal: Given a legacy document set containing attributes whose format makes them difficult to search on. In order to correct this search deficiency new searchable attributes will be added to the document. These new attributes related to and can be calculated from the original attributes. On any mutation (a document creation or modification) the new attributes should also be created (or updated)
Implementation: Implementation: Create JavaScript function that contains an OnUpdate handler. The handler listens for mutations or data-changes within a specified "Listen To Location" (or source collection). When any document within the collection is created or modified, the Eventing Function executes a user-defined routine. In this example, if the created or altered document contains two specifically named fields containing IP addresses (these respectively corresponding to the beginning and end of an address-range), the Eventing Function routine, get_numip_first_3_octets(ip), converts each of the IP addresses to an integer and upserts them as new fields in the document.
-
Case 1: A new document is created in a specified, target collection: this new document is identical to the old, except that it has two new additional fields, which contain integers that correspond to the IP addresses. The original document, in the source collection, is not changed.
-
Case 2: The original document, in the source collection, is mutated (or changed) to have two additional fields, which contain integers that correspond to the IP addresses. In this case the Eventing Service automatically suppresses the recursive mutation.
Preparations (Common):
For this example, two (2) buckets 'bulk' and 'rr100' are required where the latter is intended to be 100% resident. Create the buckets with a minimum size of 100MB. For information on buckets, see Create a Bucket. Within the buckets we need three (3) keyspaces 'bulk.data.source', 'bulk.data.target', and 'rr100.eventing.metadata' (we loosely follow this organization).
For the Function Scope or RBAC grouping, we will use the 'bulk.data', assuming you have the role of either "Full Admin" or "Eventing Full Admin". For standard or non-privileged users refer to Eventing Role-Based Access Control.
If you run a version of Couchbase prior to 7.0 you can just create the buckets 'source', 'target', and 'metadata' and run this example. Furthermore if your cluster was subsequently upgraded from say 6.6.2 to 7.0 your data would be moved to 'source._default._default', 'target._default._default', and 'metadata._default._default' and your Eventing Function would be seamlessly upgraded to use the new keyspaces and continue to run correctly.
For complete details on how to set up your keyspaces refer to creating buckets and creating scopes and collections.
The Eventing Storage keyspace, in this case 'rr100.eventing.metadata', is for the sole use of the Eventing system, do not add, modify, or delete documents from it. In addition do not drop or flush or delete the containing bucket (or delete this collection) while you have any deployed Eventing functions. In a single tenancy deployment this collection can be shared with other Eventing functions. |
Procedure (Case 1):
-
Access the Couchbase Web Console > Buckets page and click the Scopes and Collections link of the bulk bucket.
-
Click Documents in the upper right banner for the data scope.
-
Select the keyspace bulk, data, source
-
You should see no user records.
-
Click Add Document in the upper right banner
-
For the ID in the Create New Document dialog specify SampleDocument
ID [ SampleDocument ]
-
For the document body in the Create New Document dialog, the following text is displayed:
{ "click": "to edit", "with JSON": "there are no reserved field names" }
-
replace the above text with the following JSON document via a cut-n-paste
{ "country": "AD", "ip_start": "5.62.60.1", "ip_end": "5.62.60.9" }
-
Click Save.
-
-
From the Couchbase Web Console > Eventing page, click ADD FUNCTION, to add a new Function. The ADD FUNCTION dialog appears.
-
In the ADD FUNCTION dialog, for individual Function elements provide the below information:
-
For the Function Scope drop-down, select 'bulk.data' as the RBAC grouping.
-
For the Listen To Location drop-down, select bulk, data, source as the keyspace.
-
For the Eventing Storage drop-down, select rr100, eventing, metadata as the keyspace.
-
Enter case_1_enrich_ips as the name of the Function you are creating in the Function Name text-box.
-
Leave the "Deployment Feed Boundary" as Everything.
-
[Optional Step] Enter text On mutation create a new document in a different collection with additional fields, in the Description text-box.
-
For the Settings option, use the default values.
-
For the Bindings option, add two bindings.
-
For the first binding, select "bucket alias", specify src as the "alias name" of the collection, select bulk, data, source as the associated keyspace, and select "read only" for the access mode.
-
For the second binding, select "bucket alias", specify tgt as the "alias name" of the collection, select bulk, data, and target as the associated keyspace, and select "read and write" for the access mode.
-
-
After configuring your settings the ADD FUNCTION dialog should look like this:
-
-
After providing all the required information in the ADD FUNCTION dialog, click Next: Add Code. The case_1_enrich_ips dialog appears.
-
The case_1_enrich_ips dialog initially contains a placeholder code block. You will substitute your actual case_1_enrich_ips code in this block.
-
Copy the following Function, and paste it in the placeholder code block of case_1_enrich_ips dialog.
function OnUpdate(doc, meta) { log('document', doc); doc["ip_num_start"] = get_numip_first_3_octets(doc["ip_start"]); doc["ip_num_end"] = get_numip_first_3_octets(doc["ip_end"]); tgt[meta.id]=doc; } function get_numip_first_3_octets(ip) { var return_val = 0; if (ip) { var parts = ip.split('.'); //IP Number = A x (256*256*256) + B x (256*256) + C x 256 + D return_val = (parts[0]*(256*256*256)) + (parts[1]*(256*256)) + (parts[2]*256) + parseInt(parts[3]); return return_val; } }
After pasting, the screen appears as displayed below:
-
Click Save and Return.
-
-
The OnUpdate routine specifies that when a change occurs to data within the bucket, the routine get_numip_first_3_octets is run on each document that contains ip_start and ip_end. A new document is created whose data and metadata are based on those of the document on which get_numip_first_3_octets is run; but with the addition of ip_num_start and ip_num_end data-fields, which contain the numeric values returned by get_numip_first_3_octets. The get_numip_first_3_octets routine splits the IP address, converts each fragment to a numeral, and adds the numerals together, to form a single value; which it returns.
-
From the Eventing screen, click the case_1_enrich_ips function to select it, then click Deploy.
-
In the Confirm Deploy Function Click Deploy Function.
-
-
The Eventing function is deployed and starts running within a few seconds. From this point, the defined Function is executed on all existing documents and will also more importantly it will also run on subsequent mutations.
-
To check the results of the deployed Eventing Function:
-
Access the Couchbase Web Console > Buckets page and click the Scopes and Collections link of the bulk bucket.
-
Click Documents in the upper right banner for the data scope.
-
Select the keyspace bulk, data, target
-
Edit the document and you will see a duplicate of the source bucket but without two new calculated fields as follows:
{ "country": "AD", "ip_end": "5.62.60.9", "ip_start": "5.62.60.1", "ip_num_start": 87964673, "ip_num_end": 87964681 }
-
Click Cancel to close the editor.
-
-
Because our Eventing Function is deployed it will continue to process all new mutations, let’s test this out.
-
Access the Couchbase Web Console > Buckets page and click the Scopes and Collections link of the bulk bucket.
-
Click Documents in the upper right banner for the data scope.
-
Select the keyspace bulk, data, source
-
You should see one user record (the one we entered at the beginning of this procedure).
-
Click Add Document in the upper right banner
-
For the ID in the Create New Document dialog specify AnotherSampleDocument
ID [ AnotherSampleDocument ]
-
For the document body in the Create New Document dialog, the following text is displayed:
{ "click": "to edit", "with JSON": "there are no reserved field names" }
-
replace the above text with the following JSON document via a cut-n-paste
{ "country": "RU", "ip_start": "7.12.60.1", "ip_end": "7.62.60.9" }
-
Click Save.
-
-
To check results (which were updated in real time) by the deployed Eventing Function:
-
Access the Couchbase Web Console > Buckets page and click the Scopes and Collections link of the bulk bucket.
-
Click Documents in the upper right banner for the data scope.
-
Select the keyspace bulk, data, target
-
Edit the newly created document and you will see a duplicate of the source bucket but without two new calculated fields as follows:
{ "country": "RU", "ip_end": "7.62.60.9", "ip_start": "7.12.60.1", "ip_num_start": 118242305, "ip_num_end": 121519113 }
-
Click Cancel to close the editor.
-
Procedure (Case 2):
-
IMPORTANT undeploy the Eventing Function (if running) case_1_enrich_ips. Access the Couchbase Web Console > Eventing page and click the function name case_1_enrich_ips link of the source bucket.
-
Click Undeploy
-
Click Undeploy Function to confirm.
-
-
We assume that the two documents from Case 1 above exist in the 'source' collection. If they don’t please create them in the 'source' collection.
-
Access the Couchbase Web Console > Buckets page and click the Scopes and Collections link of the bulk bucket.
-
Click Documents in the upper right banner for the data scope.
-
Select the keyspace bulk, data, source
-
You should see two user records (as previously created above).
{ "country": "AD", "ip_start": "5.62.60.1", "ip_end": "5.62.60.9" } { "country": "RU", "ip_start": "7.12.60.1", "ip_end": "7.62.60.9" }
-
-
From the Couchbase Web Console > Eventing page, click ADD FUNCTION, to add a new Function. The ADD FUNCTION dialog appears.
-
In the ADD FUNCTION dialog, for individual Function elements provide the below information:
-
For the Function Scope drop-down, select 'bulk.data' as the RBAC grouping.
-
For the Listen To Location drop-down, select bulk, data, source as the keyspace.
-
For the Eventing Storage drop-down, select rr100, eventing, metadata as the keyspace.
-
Enter case_2_enrich_ips as the name of the Function you are creating in the Function Name text-box.
-
Leave the "Deployment Feed Boundary" as Everything.
-
[Optional Step] Enter text On mutation create a new document in the same collection with additional fields, in the Description text-box.
-
For the Settings option, use the default values.
-
For the Bindings option, add two bindings.
-
For the only binding, select "bucket alias", specify src as the "alias name" of the collection, select bulk, data, source as the associated keyspace, and select "read and write" for the access mode.
-
-
After configuring your settings the ADD FUNCTION dialog should look like this:
-
-
After providing all the required information in the ADD FUNCTION dialog, click Next: Add Code. The case_2_enrich_ips dialog appears.
-
The case_2_enrich_ips dialog initially contains a placeholder code block. You will substitute your actual case_2_enrich_ips code in this block.
-
Copy the following Function, and paste it in the placeholder code block of case_2_enrich_ips dialog.
function OnUpdate(doc, meta) { log('document', doc); doc["ip_num_start"] = get_numip_first_3_octets(doc["ip_start"]); doc["ip_num_end"] = get_numip_first_3_octets(doc["ip_end"]); // !!! write back to the source bucket !!! src[meta.id]=doc; } function get_numip_first_3_octets(ip) { var return_val = 0; if (ip) { var parts = ip.split('.'); //IP Number = A x (256*256*256) + B x (256*256) + C x 256 + D return_val = (parts[0]*(256*256*256)) + (parts[1]*(256*256)) + (parts[2]*256) + parseInt(parts[3]); return return_val; } }
After pasting, the screen appears as displayed below:
-
Click Save and Return.
-
-
The OnUpdate routine specifies that when a change occurs to data within the bucket, the routine get_numip_first_3_octets is run on each document that contains ip_start and ip_end. A new document is created whose data and metadata are based on those of the document on which get_numip_first_3_octets is run; but with the addition of ip_num_start and ip_num_end data-fields, which contain the numeric values returned by get_numip_first_3_octets. The get_numip_first_3_octets routine splits the IP address, converts each fragment to a numeral, and adds the numerals together, to form a single value; which it returns.
-
From the Eventing screen, click the case_2_enrich_ips function to select it, then click Deploy.
-
In the Confirm Deploy Function Click Deploy Function.
-
-
The Eventing function is deployed and starts running within a few seconds. From this point, the defined Function is executed on all existing documents and will also more importantly it will also run on subsequent mutations. Unlike our fist example the documents that are the source of the mutations will be updated.
-
To check results (which were updated in real time) by the deployed Eventing Function:
-
Access the Couchbase Web Console > Buckets page and click the Scopes and Collections link of the bulk bucket.
-
Click Documents in the upper right banner for the data scope.
-
Select the keyspace bulk, data, source
-
Edit the "SampleDocument" it will have been enriched or modified with two new calculated fields:
{ "country": "AD", "ip_end": "5.62.60.9", "ip_start": "5.62.60.1", "ip_num_start": 87964673, "ip_num_end": 87964681 }
-
Edit the "AnotherSampleDocument" it will also have been enriched or modified with two new calculated fields:
{ "country": "RU", "ip_end": "7.62.60.9", "ip_start": "7.12.60.1", "ip_num_start": 118242305, "ip_num_end": 121519113 }
-
Click Cancel to close the editor.
-
-
Because our Eventing Function is deployed it will continue to process all new mutations, let’s test this out.
-
Access the Couchbase Web Console > Buckets page and click the Scopes and Collections link of the bulk bucket.
-
Click Documents in the upper right banner for the data scope.
-
Select the keyspace bulk, data, source
-
Edit at "AnotherSampleDocument" again BUT change "ip_start" to "6.12.60.1"
{ "country": "RU", "ip_end": "7.62.60.9", "ip_start": "6.12.60.1", "ip_num_start": 118242305, "ip_num_end": 121519113 }
-
Click Save to update the document and close the editor.
-
Edit at "AnotherSampleDocument" again and see the recalculation of "ip_num_start": 118242305 to "ip_num_start": 101465089 happened in real-time.
{ "country": "RU", "ip_end": "7.62.60.9", "ip_start": "6.12.60.1", "ip_num_start": 101465089, "ip_num_end": 121519113 }
-
Click Cancel to close the editor.
-
Cleanup (both Case 1 and Case 2):
Go to the Eventing portion of the UI and undeploy the Function(s) case_1_enrich_ips and case_2_enrich_ips, this will remove the 1024 documents for each function from the 'rr100.eventing.metadata' collection (in the Bucket view of the UI). Remember you may only delete the 'rr100.eventing.metadata' keyspace if there are no deployed Eventing Functions.
Now flush the 'bulk' bucket if you plan to run other examples (you may need to Edit the bucket 'bulk' and enable the flush capability).