Configuration

    +
    A description of the sections and properties defined in the sample connector config file $CBES_HOME/config/example-connector.toml (view on GitHub), followed by a description of the Consul configuration options specific to Autonomous Operations mode.
    If this is your first time working with the TOML config file format, check out Nate Finch’s excellent Intro to TOML, or the official specification.
    Some configuration properties are filesystem paths. The base for a relative path is the connector installation directory.

    Group Membership

    First, let’s define some terminology:

    connector instance

    A single connector process. When you run the cbes command, you are starting an instance of the connector. Sometimes referred to as a "connector worker", or simply an "instance."

    connector group

    A set of one or more connector instances configured to share the task of replicating from the same bucket.

    The first section of the config file tells the connector instance which group it belongs to. A connector group is defined by configuring one or more connector instances to use the same group name.

    [group]
      name = 'example-group' (1)
    1 Each connector group must be assigned a unique name (in order to keep its replication checkpoints separate). The group name is required even if there is only one connector instance in the group.
    Renaming a group invalidates its replication status checkpoint, causing the connector to start replicating from the beginning again. To preserve the checkpoint (i.e. the replication progress), run the cbes-checkpoint-backup command before renaming, and cbes-checkpoint-restore afterwards.

    Sharing the load

    A group with more than once instance is called "distributed". Each instance in a distributed group must be told which part of the workload it is responsible for.

    The default configuration is not distributed, since it specifies a group with only one member.

    [group.static] (1)
      memberNumber = 1 (2)
      totalMembers = 1 (3)
    1 This config section is named "static" because the group membership is predetermined and does not change while the connector is running. (See Autonomous Operations for an alternative mode where group membership is dynamic.)
    2 A value from 1 to 'totalMembers', inclusive. Each member in the group must be assigned a unique member number.
    3 The total number of workers in the group.
    To safely add or remove workers from a static group, first stop all running workers, then reconfigure them with the new 'totalMembers' value, and finally start all the workers again.

    Logging

    The connector’s log output is controlled by the log4j2.xml file in the config directory. Please see the Log4j 2 configuration reference for more information, including how to activate Automatic Reconfiguration.

    The connector config file has additional options for controlling higher-level logging features:

    [logging]
      logDocumentLifecycle = false (1)
      redactionLevel = 'NONE' (2)
    1 If true, document lifecycle milestones will be logged at INFO level instead of DEBUG. Enabling this feature lets you watch documents flow through the connector without having to edit the Log4j config file. Disabled by default because it generates many log messages.
    2 Determines which kinds of sensitive log messages will be tagged for later redaction by the Couchbase log redaction tool. NONE = no tagging; PARTIAL = user data is tagged; FULL = user, meta, and system data is tagged. Defaults to NONE.

    Metrics

    Performance metrics may be written to a log file and/or exposed via HTTP to enable external monitoring.

    [metrics]
      httpPort = 31415 (1)
      logInterval = '1m' (2)
    1 Set the port number to -1 to disable the embedded HTTP server.
    2 "One minute." A value of '0m' disables metrics logging.

    The metrics log file location defaults to CBES_HOME/logs/cbes-metrics.log. The location and retention policy can be changed by editing $CBES_HOME/config/log4j2.xml (consult the Log4j2 Configuration Manual for details).

    Trust Store for Secure Connections

    Here you can specify the location of a Java keystore containing the CA certificates for the Couchbase and/or Elasticsearch clusters.

    This config section is DEPRECATED and will be removed in a future version of the connector. Please use the pathToCaCertificate properties in the [couchbase] and/or [elasticsearch] sections instead.
    [truststore]
      path = 'path/to/truststore' (1)
      pathToPassword = 'secrets/truststore-password.toml' (2)
    1 The filesystem path to the Java keystore containing the CA certificates for the Couchbase and/or Elasticsearch clusters.
    2 Path to a separate TOML file with a single 'password' key.

    Client Certificates

    If secure connections are enabled, you can optionally authenticate using a client certificate instead of a username and password.

    [couchbase.clientCertificate]
      use = false (1)
      path = 'path/to/couchbase-client-cert.p12' (2)
      pathToPassword = 'secrets/couchbase-client-cert-password.toml' (3)
    
    [elasticsearch.clientCertificate] (4)
      use = false
      path = 'path/to/elasticsearch-client-cert.p12'
      pathToPassword = 'secrets/elasticsearch-client-cert-password.toml'
    1 Use a client certificate for authentication? If true, the username and password configured in the [couchbase] / [elasticsearch] section are ignored.
    2 Filesystem path to a Java keystore or PKCS12 bundle holding the private key and certificate chain.
    3 Filesystem path to a separate TOML file containing the password for the PKCS12 bundle.
    4 The client certificates for Couchbase and Elasticsearch are configured separately.

    Couchbase

    Here’s where the Couchbase connection parameters and credentials are specified.

    [couchbase]
      hosts = ['127.0.0.1'] (1)
      network = 'auto' (2)
      bucket = 'travel-sample' (3)
      metadataBucket = '' (4)
      metadataCollection = '_default._default' (5)
      username = 'Administrator' (6)
      pathToPassword = 'secrets/couchbase-password.toml' (7)
      secureConnection = false (8)
      pathToCaCertificate = '' (9)
      hostnameVerification = true (10)
      scope = '' (11)
      collections = ['myScope.widgets','myScope.invoices'] (12)
    1 A list of bootstrap nodes for the Couchbase cluster. Multiple nodes can be specified like ['host1','host2'].
    2 The network selection strategy for connecting to a Couchbase Server cluster that advertises alternate addresses. A Couchbase node running inside a container environment (like Docker or Kubernetes) might be configured to advertise both its address within the container environment (known as its "default" address) as well as an "external" address for use by clients connecting from outside the environment. Setting the network config property to default or external forces the selection of the respective addresses. Setting the value to auto tells the connector to select whichever network contains the addresses specified in the hosts config property; this heuristic works well in most environments and is the recommended mode.
    3 The source bucket to replicate from.
    4 The bucket for storing metadata like replication checkpoint documents. Empty string means store metadata in the source bucket.
    5 The name of the collection for storing metadata like replication checkpoint documents. Must be qualified by a scope name. For example: myScope.cbesCheckpoints. This collection must already exist; it will not be created by the connector. Empty string means store metadata in the default collection: _default._default.
    6 At a minimum, the Couchbase user must have the "Data DCP Reader" role for the source bucket, and "Data Reader" & "Data Writer" roles for the metadata bucket.
    7 Path to a separate TOML file with a single 'password' key.
    8 To encrypt the Couchbase connections, set secureConnection to true and also configure the pathToCaCertificate property.
    9 Path to a separate file containing the trusted Couchbase CA certificate(s) in PEM format. For more details, see Setting Up Secure Connections. If you decide not to enable secure connections, you can ignore this property.
    10 Set this to false if you are using a secure connection to Couchbase but for whatever reason need to disable TLS hostname verification. Note that disabling hostname verification will cause the TLS connection to not verify that the hostname/ip is actually associated with the certificate and as a result not detect certain kinds of attacks. Only disable if you understand the impact and can accept the risks.
    11 If you want to replicate from only one scope, name it here.
    12 If you want to replicate from a subset of collections within a scope, or collections in different scopes, name the collections here. Qualify each collection name with its parent scope, like 'scope.collection'.
    The scope and collections properties are mutually exclusive. You can specify one or the other, but not both. If you specify neither, the connector will examine every document in the bucket.
    If you want to replicate from multiple buckets, you can run a separate connector group for each bucket.

    Custom Couchbase settings

    The connector uses the Couchbase Java SDK to save checkpoint documents in Couchbase. The SDK’s "cluster environment" settings may be specified in the connector config file.

    Here’s an example that tunes the checkpoint I/O timeout settings and disables native libraries:

    [couchbase.env] (1)
      timeout.kvTimeout = '2.5s'
      security.enableNativeTls = false
      ioEnvironment.enableNativeIo = false
    1 Any property name recognized by the Couchbase Java SDK may be specified in this section.

    Each property name in this section must be one of the system properties recognized by the Couchbase Java SDK, but without the com.couchbase.env. prefix. For a list of system properties, see Java SDK Client Settings.

    DCP

    Options for the Couchbase Database Change Protocol (DCP).

    Couchbase Server notifies the connector about database changes as soon as they are stored in memory, even before they are written to disk. You can tell the connector to immediately propagate the changes to Elasticsearch, or you can defer writing to Elasticsearch until the changes have been saved to disk on all Couchbase replicas.

    Immediate propagation gives the lowest possible latency, but increases the likelihood that a Couchbase node failure will result in document changes being present in Elasticsearch but not in Couchbase.

    [couchbase.dcp]
      compression = true (1)
      flowControlBuffer = '16mb' (2)
      persistencePollingInterval = '100ms' (3)
    1 Disabling compression uses more network bandwidth and increases Couchbase Server’s CPU usage. Enabling compression increases the connector’s CPU usage.
    2 The flow control buffer limits how much data Couchbase will send before waiting for the connector to acknowledge the data has been processed. The recommended size is between 10 MiB ("10m") and 50 MiB ("50m").
    3 To propagate changes immediately, disable persistence polling by setting this to '0ms'. A non-zero duration tells the connector to defer propagation until the change is persisted on all Couchbase replicas. Longer intervals reduce network traffic at the cost of increased end-to-end latency.
    When replicating from an ephemeral bucket, always set persistencePollingInterval = '0s' to disable persistence polling, since documents are never persisted.
    Make sure to allocate enough memory to the Elasticsearch connector process to accommodate the flow control buffer, otherwise the connector might run out of memory under heavy load. Read on for details.

    There’s a separate flow control buffer for each node in the Couchbase cluster. When calculating how much memory to allocate to the Elasticsearch connector, multiply the flow control buffer size by the number of Couchbase nodes, then multiply by 2. This is the amount of memory required for the flow control buffer (not counting the connector’s baseline memory usage).

    Elasticsearch

    [elasticsearch]
      hosts = ['127.0.0.1'] (1)
      username = 'elastic' (2)
      pathToPassword = 'secrets/elasticsearch-password.toml' (3)
      secureConnection = false (4)
      pathToCaCertificate = '' (5)
    1 A list of bootstrap nodes for the Elasticsearch cluster. Multiple nodes can be specified like ['host1','host2']. Specify a custom port like ['host:19002'].
    2 Elasticsearch user to authenticate as. Username and password are only required if Elasticsearch is set up to require authentication.
    3 Path to a separate TOML file with a single 'password' key.
    4 If your Elasticsearch cluster requires secure connections, set secureConnection to true and also configure the pathToCaCertificate property.
    5 Path to a separate file containing the trusted Elasticsearch CA certificate(s) in PEM format. For more details, see Setting Up Secure Connections. If you decide not to enable secure connections, you can ignore this property.

    Bulk Request Limits

    The Elasticsearch documentation offers these guidelines for sizing bulk requests. Experiment with these parameters to see what yields the best performance with your data and cluster configuration.

    [elasticsearch.bulkRequestLimits]
      bytes = '10mb' (1)
      actions = 1000 (2)
      timeout = '1m' (3)
      concurrentRequests = 2 (4)
    1 Limits the size in bytes of a single bulk request.
    2 Limits the number of actions (index or delete) in single bulk request.
    3 A bulk request will be retried if it takes longer than this duration.
    4 Limits the number of simultaneous bulk requests the connector will make. Setting this to 1 will reduce the load on your Elasticsearch cluster.
    A bulk request is full when either the bytes limit or the actions limit is reached, whichever comes first.
    Actual bulk request size may exceed the bytes limit by approximately the size of a single document. Make sure the limit configured here is well under the Elasticsearch cluster’s http.max_content_length setting.

    Document Structure

    You control whether the Couchbase document is indexed verbatim, or whether it is transformed to include Couchbase metadata. If you decide to include metadata, it will be in a top-level field of the Elasticsearch document, with a field name of your choice. You also control whether the Couchbase document content is at the top level of the Elasticsearch document, or nested inside field named doc.

    The connector does not replicate a document’s extended attributes (xattrs).
    [elasticsearch.docStructure]
      metadataFieldName = 'meta' (1)
      documentContentAtTopLevel = false (2)
      wrapCounters = false (3)
    1 Name to assign to the metadata field, or empty string ('') to omit metadata.
    2 If false, the Elasticsearch document root will have a doc field whose value is the Couchbase document. If true, the Elasticsearch document will be identical to the Couchbase document with the possible addition of the metadata field.
    3 If false, ignore Couchbase counter documents. If true, replicate them as Object nodes like {"value":<counter>}
    The defaults mimic the behavior of version 3.x of the connector. If you don’t care about metadata, you can make the Elasticsearch document identical to the Couchbase document by setting documentContentAtTopLevel = true and metadataFieldName = ''.
    If you set documentContentAtTopLevel = true, be sure to omit metadata or select a metadata field name that does not conflict with any document fields.

    Type Definitions

    Not to be confused with Elasticsearch types — these are not those.

    A type definition is a rule for matching a document ID, and instructions for what to do with matched documents.

    The order of type definitions is significant. If a document matches more than one type, the definition that appears first in the config file takes precedence.

    Type Definition Defaults

    Here’s where you can specify the default values for all type definitions. This may be useful, for example, if you want to write all documents to the same index, or send them all through the same pipeline. The default values can be overridden by specific type definitions, which we’ll look at in just a moment.

    [elasticsearch.typeDefaults]
      index = '' (1)
      pipeline = '' (2)
      ignore = false (3)
      ignoreDeletes = false (4)
      matchOnQualifiedKey = false (5)
    1 Write matching documents to this index. Empty string ('') means "no default".
    2 Send matching documents though this pipeline. Empty string ('') means "no pipeline".
    3 If true, ignore matching documents entirely.
    4 If true, never delete matching documents from Elasticsearch.
    5 If true, the prefix and regex properties described in the next section match against the qualified document ID, which includes the document’s scope and collection. Otherwise, they match against just the document ID. For example, "scope.collection.documentId" versus just "documentId".

    Document matching rules

    Every type definition must have a rule for matching document IDs. The matching rule is specified by exactly one of the following fields:

    prefix

    A type definition with a prefix field matches any document whose ID starts with the given case-sensitive string.

    regex

    A type definition with a regex field matches any document whose ID fully matches the given Java regular expression.

    If the regular expression contains a capturing group named "index", the captured value will be used as the destination index. We’ll see an example of this shortly.

    Sample Type Definitions

    The first sample definition we’ll look at is one you should include whenever the Couchbase Sync Gateway is present. It ignores any Sync Gateway metadata documents based on their ID prefix.

    Ignore Sync Gateway Metadata

    [[elasticsearch.type]]
      prefix = '_sync:' (1)
      ignore = true (2)
    1 This type definition matches any document whose ID begins with the specified case-sensitive string.
    2 Any matched documents will be ignored completely.
    Did you notice that unlike the config sections we’ve looked at so for, the [[elasticsearch.type]] section name is enclosed in double brackets? This indicates it’s a repeated element. You can declare any number of these sections, and each one will define an additional type.

    Prefix Match

    Here’s another type definition that uses prefix matching. This time, instead of ignoring the matched documents, the connector will write them to the "airlines" index using the ingestion pipeline named "audit".

    [[elasticsearch.type]]
      prefix = 'airline_'
      index = 'airlines' (1)
      pipeline = 'audit' (2)
    1 Matching documents will be written to this Elasticsearch index.
    2 A pipeline lets you apply additional processing to a document before it is indexed.
    Specifying the empty string ('') as the prefix will match any document.

    Regular Expression Match

    Now let’s look at a type definition that matches document IDs using a Java regular expression instead of a literal prefix.

    [[elasticsearch.type]]
      regex = '.*port_.*' (1)
      index = 'ports'
    1 Matches "airport_sfo", "seaport_oakland", etc.

    Index Inference

    Finally, here’s the promised example of using a regular expression with a capturing group named "index" to set the index based on document ID.

    [[elasticsearch.type]]
      regex = '(?<index>.+?)::.*' (1)
    1 Matches IDs that start with one or more characters followed by "::". It directs "user::alice" to index "user", and "foo::bar::123" to index "foo".

    Collection and Scope match

    When the matchOnQualifiedKey property is true, the regex and prefix values match against the qualified key (for example: "scope.collection.documentId").

    You can take advantage of this to create a type rule specific to a collection, or a rule that derives the Elasticsearch index name from the Couchbase collection name.

    Match anything in a specific collection
    [[elasticsearch.type]]
      matchOnQualifiedKey = true
      prefix = 'scope.collection.' (1)
      index = 'foo'
    1 Matches any document in collection "scope.collection", and writes it to Elasticsearch index "foo".
    Couchbase collection → Elasticsearch index
    [[elasticsearch.type]]
      matchOnQualifiedKey = true
      regex = '[^.]+.(?<index>[^.]+).*' (1)
    1 Matches any document. Writes a document with qualified key "scope.collection.foo" to Elasticsearch index "collection".
    Couchbase scope.collection → Elasticsearch index
    [[elasticsearch.type]]
      matchOnQualifiedKey = true
      regex = '(?<index>[^.]+.[^.]+).*' (1)
    1 Matches any document. Writes a document with qualified key "scope.collection.foo" to Elasticsearch index "scope.collection".

    Custom Routing and Parent/Child Relationships

    In the travel-sample data model, a route is the child of an airline. Each sample route document has an airlineid field whose value is the ID of its parent airline document.

    If you tell Elasticsearch that airlineid is a join field, you can take advantage of this relationship when searching. For this to work, the connector must save each child document to the same Elasticsearch index and shard as its parent.

    [[elasticsearch.type]]
      prefix = 'route_' (1)
      index = 'airlines' (2)
      routing = '/airlineid' (3)
      ignoreDeletes = true (4)
    1 It’s just a coincidence that airline route documents are being used to demonstrate custom routing.
    2 Must specify the same index as parent document.
    3 A JSON pointer to the field to use for Elasticsearch routing. This is how the child document gets routed to the same shard as its parent.
    4 The connector is unable to delete documents that use custom routing, so ignoreDeletes must always be true for child documents.

    Rejection Log

    When Elasticsearch rejects a document (usually due to a type mapping error) the connector writes a rejection log entry document to Elasticsearch. The log entry’s document ID is the ID of the rejected Couchbase document.

    Table 1. Rejection Log Entry Fields
    Field Name Type Description

    index

    string

    Name of the index the connector failed to write to

    type

    string

    Document type name used for the write attempt

    action

    string

    Failed action type ("INDEX" or "DELETE")

    error

    string

    Error message received from Elasticsearch

    Related configuration properties:

    [elasticsearch.rejectionLog]
      index = 'cbes-rejects' (1)
    1 Rejection log entries are written to this index.
    If you’re running multiple connector groups, you may wish to use a separate rejection log index for each group.

    Environment Variables

    The connector configuration may reference environment variables.

    Variable substitution happens in a separate step before the TOML is parsed. Values are replaced verbatim, without regard to context. When using variables in string values, make sure the result conforms to the the TOML string syntax.

    Here’s an example with completely made up config properties, and placeholders that reference environment variables NAME, AGE, and FAVORITE_COLOR:

    name = '${NAME}' (1)
    age = ${AGE}
    favoriteColor = '${FAVORITE_COLOR:blue}' (2)
    1 Alternatively, you could write name = ${NAME} and include the quotes in value of the environment variable.
    2 A colon in the placeholder separates the environment variable name from the default value to use if the variable is not set. If the FAVORITE_COLOR environment variable is set, its value will be used. Otherwise, the specified default value blue will be used.
    As you may have noticed, age = ${AGE} is not valid TOML syntax. This is fine if the connector is the only program that reads the config file, but other TOML processing tools might complain. To accommodate other tools, integer and boolean values may be specified as strings. For example: age = '${AGE}'. This lets you use placeholders for integers and booleans without invalidating the TOML syntax.

    Consul

    When running the connector in Autonomous Operations mode, you can configure how the connector communicates with Consul. Configuration options specific to Consul are defined in a separate file, which may be specified on the cbes-consul command line with the --consul-config option. If this command line option is not specified, default values are used for the configuration keys in this section.

    Why are the Consul options in a separate file?

    In Autonomous Operations mode, the connector configuration is stored in Consul. The connector needs to know how to talk to Consul before it can read the connector configuration.

    The example file included in the connector distribution is called $CBES_HOME/config/consul.toml. Here’s what it looks like:

    [consul]
      aclToken = '${CBES_CONSUL_ACL_TOKEN:}' (1)
      deregisterServiceOnGracefulShutdown = true (2)
      deregisterCriticalServiceAfter = '168h' (3)
    1 Access Control List Token to include in all Consul requests.
    You should not typically need to set this value. Instead, configure the local Consul agent to use a token when talking to the Consul cluster.

    The default value is an empty string, in which case the token sent to the Consul server will be determined by the Consul agent. This example configuration shows how to read the token from the CBES_CONSUL_ACL_TOKEN environment variable if present, falling back to empty string if the variable is not set. If you prefer not to use an environment variable, you can specify the ACL token directly instead.

    2 (Since 4.4.1) Whether to automatically remove the service registration from Consul when the connector exits cleanly. If not specified, defaults to true.
    3 (Since 4.4.1) How long Consul should wait before automatically removing the service definition for a connector that fails its health check or shuts down due to an error. Valid time units are "m" for minutes or "h" for hours. If not specified, defaults to 168 hours (7 days). According to the Consul documentation:

    The minimum timeout is 1 minute, and the process that reaps critical services runs every 30 seconds, so it may take slightly longer than the configured timeout to trigger the deregistration. This should generally be configured with a timeout that’s much, much longer than any expected recoverable outage for the given service.