0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-03-16 05:53:38 +00:00

Database docs edits ()

Co-authored-by: ilyam8 <ilya@netdata.cloud>
This commit is contained in:
Fotis Voutsas 2024-11-25 13:34:33 +02:00 committed by GitHub
parent d4c77d7e12
commit 88defc763e
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
15 changed files with 160 additions and 311 deletions

View file

@ -16,6 +16,7 @@ When the context is clear, we can omit the "Netdata" prefix for brevity.
| Term | Definition |
|----------------------|----------------------------------------------------|
| **Tier** (**Tiers**) | Database storage layers with different granularity |
| **Mode(s)** | The different Modes of the Database |
## Streaming

View file

@ -81,7 +81,7 @@ beyond stored historical metrics, you'll see this message:
![image](https://user-images.githubusercontent.com/70198089/225851033-43b95164-a651-48f2-8915-6aac9739ed93.png)
At any time, [configure the internal TSDB's storage capacity](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md) to expand your
At any time, [configure the internal TSDB's storage capacity](/src/database/README.md) to expand your
depth of historical metrics.
### Timezone selector

View file

@ -2,7 +2,7 @@
## Deployment Options Overview
This section provides a quick overview for a few common deployment options for Netdata.
This section provides a quick overview of a few common deployment options for Netdata.
You can read about [Standalone Deployment](/docs/deployment-guides/standalone-deployment.md) and [Deployment with Centralization Points](/docs/deployment-guides/deployment-with-centralization-points.md) in the documentation inside this section.
@ -24,9 +24,9 @@ An API key is a key created with `uuidgen` and is used for authentication and/or
#### Child config
As mentioned above, we do not recommend to connect the Child to Cloud directly during your setup.
As mentioned above, we do not recommend connecting the Child to Cloud directly during your setup.
This is done in order to reduce the footprint of the Netdata Agent on your production system, as some capabilities can be switched OFF for the Child and kept ON for the Parent.
This is done to reduce the footprint of the Netdata Agent on your production system, as some capabilities can be switched OFF for the Child and kept ON for the Parent.
In this example, Machine Learning and Alerting are disabled for the Child, so that the Parent can take the load. We also use RAM instead of disk to store metrics with limited retention, covering temporary network issues.
@ -34,14 +34,14 @@ In this example, Machine Learning and Alerting are disabled for the Child, so th
On the child node, edit `netdata.conf` by using the [edit-config](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script and set the following parameters:
```yaml
```text
[db]
# https://github.com/netdata/netdata/blob/master/src/database/README.md
# none = no retention, ram = some retention in ram
mode = ram
# The retention in seconds.
# This provides some tolerance to the time the child has to find a parent in
# order to transfer the data. For IoT this can be lowered to 120.
# This provides some tolerance to the time the child has to find a parent
# to transfer the data. For IoT, this can be lowered to 120.
retention = 1200
# The granularity of metrics, in seconds.
# You may increase this to lower CPU resources.
@ -56,8 +56,7 @@ On the child node, edit `netdata.conf` by using the [edit-config](/docs/netdata-
# Disable remote access to the local dashboard
bind to = lo
[plugins]
# Uncomment the following line to disable all external plugins on extreme
# IoT cases by default.
# Uncomment the following line to disable all external plugins on extreme IoT cases by default.
# enable running new plugins = no
```
@ -65,7 +64,7 @@ On the child node, edit `netdata.conf` by using the [edit-config](/docs/netdata-
To edit `stream.conf`, use again the [edit-config](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script and set the following parameters:
```yaml
```text
[stream]
# Stream metrics to another Netdata
enabled = yes
@ -77,7 +76,7 @@ To edit `stream.conf`, use again the [edit-config](/docs/netdata-agent/configura
#### Parent config
For the Parent, besides setting up streaming, this example also provides configuration for multiple [tiers of metrics storage](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md), for 10 Children, with about 2k metrics each. This allows for:
For the Parent, besides setting up streaming, this example also provides configuration for multiple [tiers of metrics storage](/src/database/README.md#tiers), for 10 Children, with about 2k metrics each. This allows for:
- 1s granularity at tier 0 for 1 week
- 1m granularity at tier 1 for 1 month
@ -92,7 +91,7 @@ Requiring:
On the Parent, edit `netdata.conf` by using the [edit-config](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script and set the following parameters:
```yaml
```text
[db]
mode = dbengine
dbengine tier backfill = new
@ -124,19 +123,19 @@ On the Parent node, edit `stream.conf` by using the [edit-config](/docs/netdata-
```yaml
[API_KEY]
# Accept metrics streaming from other Agents with the specified API key
enabled = yes
# Accept metrics streaming from other Agents with the specified API key
enabled = yes
```
### ActiveActive Parents
In order to setup activeactive streaming between Parent 1 and Parent 2, Parent 1 needs to be instructed to stream data to Parent 2 and Parent 2 to stream data to Parent 1. The Child Agents need to be configured with the addresses of both Parent Agents. An Agent will only connect to one Parent at a time, falling back to the next upon failure. These examples use the same API key between Parent Agents and for connections for Child Agents.
To set up activeactive streaming between Parent 1 and Parent 2, Parent 1 needs to be instructed to stream data to Parent 2 and Parent 2 to stream data to Parent 1. The Child Agents need to be configured with the addresses of both Parent Agents. An Agent will only connect to one Parent at a time, falling back to the next upon failure. These examples use the same API key between Parent Agents and for connections for Child Agents.
On both Netdata Parent and all Child Agents, edit `stream.conf` by using the [edit-config](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script:
#### stream.conf on Parent 1
```yaml
```text
[stream]
# Stream metrics to another Netdata
enabled = yes
@ -147,11 +146,12 @@ On both Netdata Parent and all Child Agents, edit `stream.conf` by using the [ed
[API_KEY]
# Accept metrics streams from Parent 2 and Child Agents
enabled = yes
```
#### stream.conf on Parent 2
```yaml
```text
[stream]
# Stream metrics to another Netdata
enabled = yes
@ -165,7 +165,7 @@ On both Netdata Parent and all Child Agents, edit `stream.conf` by using the [ed
#### stream.conf on Child Agents
```yaml
```text
[stream]
# Stream metrics to another Netdata
enabled = yes
@ -193,7 +193,7 @@ We also suggest that you:
For increased security, user management and access to our latest features, tools and troubleshooting solutions.
2. [Change how long Netdata stores metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md)
2. [Change how long Netdata stores metrics](/src/database/README.md#modes)
To control Netdata's memory use, when you have a lot of ephemeral metrics.

View file

@ -9,7 +9,7 @@ the OpenTSDB and Graphite connectors.
>
> When you enable the exporting engine and a connector, the Netdata Agent exports metrics _beginning from the time you
> restart its process_, not the entire
> [database of long-term metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md).
> [database of long-term metrics](/src/database/README.md).
Once you understand how to enable a connector, you can apply that knowledge to any other connector.

View file

@ -35,7 +35,7 @@ Use the alphabetized list below to find the answer to your single-term questions
- [**Cloud** or **Netdata Cloud**](/docs/netdata-cloud/README.md): Netdata Cloud is a web application that gives you real-time visibility for your entire infrastructure. With Netdata Cloud, you can view key metrics, insightful charts, and active alerts from all your nodes in a single web interface.
- [**Collector**](/src/collectors/README.md#collector-architecture-and-terminology): A catch-all term for any Netdata process that gathers metrics from an endpoint.
- [**Collector**](/src/collectors/README.md): A catch-all term for any Netdata process that gathers metrics from an endpoint.
- [**Community**](https://community.netdata.cloud/): As a company with a passion and genesis in open-source, we are not just very proud of our community, but we consider our users, fans, and chatters to be an imperative part of the Netdata experience and culture.
@ -75,7 +75,7 @@ Use the alphabetized list below to find the answer to your single-term questions
## I
- [**Internal plugins**](/src/collectors/README.md#collector-architecture-and-terminology): These gather metrics from `/proc`, `/sys`, and other Linux kernel sources. They are written in `C` and run as threads within the Netdata daemon.
- [**Internal plugins**](/src/collectors/README.md): These gather metrics from `/proc`, `/sys`, and other Linux kernel sources. They are written in `C` and run as threads within the Netdata daemon.
## K
@ -91,7 +91,7 @@ Use the alphabetized list below to find the answer to your single-term questions
- [**Metrics Exporting**](/docs/exporting-metrics/README.md): Netdata allows you to export metrics to external time-series databases with the exporting engine. This system uses a number of connectors to initiate connections to more than thirty supported databases, including InfluxDB, Prometheus, Graphite, ElasticSearch, and much more.
- [**Metrics Storage**](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md): Upon collection the collected metrics need to be either forwarded, exported or just stored for further treatment. The Agent is capable to store metrics both short and long-term, with or without the usage of non-volatile storage.
- [**Metrics Storage**](/src/database/README.md#modes): Upon collection the collected metrics need to be either forwarded, exported or just stored for further treatment. The Agent is capable to store metrics both short and long-term, with or without the usage of non-volatile storage.
- [**Metrics Streaming Replication**](/docs/observability-centralization-points/README.md): Each node running Netdata can stream the metrics it collects, in real time, to another node. Metric streaming allows you to replicate metrics data across multiple nodes, or centralize all your metrics data into a single time-series database (TSDB).
@ -111,7 +111,7 @@ metrics, troubleshoot complex performance problems, and make data interoperable
- [**Obsoletion**(of nodes)](/docs/dashboards-and-charts/nodes-tab.md): Removing nodes from a space.
- [**Orchestrators**](/src/collectors/README.md#collector-architecture-and-terminology): External plugins that run and manage one or more modules. They run as independent processes.
- [**Orchestrators**](/src/collectors/README.md): External plugins that run and manage one or more modules. They run as independent processes.
## P

View file

@ -19,7 +19,7 @@ changes reflected in those visualizations due to the way Netdata Cloud proxies m
### Increase the long-term metrics retention period
Read our doc on [increasing long-term metrics storage](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md) for details.
Read our doc on [increasing long-term metrics storage](/src/database/README.md#tiers) for details.
## Modify alerts and notifications

View file

@ -17,16 +17,16 @@ Netdata for production use.
The following table summarizes the effect of each optimization on the CPU, RAM and Disk IO utilization in production.
| Optimization | CPU | RAM | Disk IO |
|-----------------------------------------------------------------------------------------------------------------------------------|--------------------|--------------------|--------------------|
| [Use streaming and replication](#use-streaming-and-replication) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| [Disable unneeded plugins or collectors](#disable-unneeded-plugins-or-collectors) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| [Reduce data collection frequency](#reduce-collection-frequency) | :heavy_check_mark: | | :heavy_check_mark: |
| [Change how long Netdata stores metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md) | | :heavy_check_mark: | :heavy_check_mark: |
| [Use a different metric storage database](/src/database/README.md) | | :heavy_check_mark: | :heavy_check_mark: |
| [Disable machine learning](#disable-machine-learning) | :heavy_check_mark: | | |
| [Use a reverse proxy](#run-netdata-behind-a-proxy) | :heavy_check_mark: | | |
| [Disable/lower gzip compression for the Agent dashboard](#disablelower-gzip-compression-for-the-dashboard) | :heavy_check_mark: | | |
| Optimization | CPU | RAM | Disk IO |
|------------------------------------------------------------------------------------------------------------|--------------------|--------------------|--------------------|
| [Use streaming and replication](#use-streaming-and-replication) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| [Disable unneeded plugins or collectors](#disable-unneeded-plugins-or-collectors) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| [Reduce data collection frequency](#reduce-collection-frequency) | :heavy_check_mark: | | :heavy_check_mark: |
| [Change how long Netdata stores metrics](/src/database/README.md#tiers) | | :heavy_check_mark: | :heavy_check_mark: |
| [Use a different metric storage database](/src/database/README.md) | | :heavy_check_mark: | :heavy_check_mark: |
| [Disable machine learning](#disable-machine-learning) | :heavy_check_mark: | | |
| [Use a reverse proxy](#run-netdata-behind-a-proxy) | :heavy_check_mark: | | |
| [Disable/lower gzip compression for the Agent dashboard](#disablelower-gzip-compression-for-the-dashboard) | :heavy_check_mark: | | |
## Resources required by a default Netdata installation
@ -72,7 +72,7 @@ The memory footprint of Netdata is mainly influenced by the number of metrics co
To estimate and control memory consumption, you can (either one or a combination of the following actions):
1. [Disable unneeded plugins or collectors](#disable-unneeded-plugins-or-collectors)
2. [Change how long Netdata stores metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md)
2. [Change how long Netdata stores metrics](/src/database/README.md#tiers)
3. [Use a different metric storage database](/src/database/README.md).
### Disk footprint and I/O
@ -90,7 +90,7 @@ To optimize your disk footprint in any aspect described below, you can:
To configure retention, you can:
1. [Change how long Netdata stores metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md).
1. [Change how long Netdata stores metrics](/src/database/README.md#tiers).
To control disk I/O:
@ -127,8 +127,7 @@ See [using a different metric storage database](/src/database/README.md).
## Disable unneeded plugins or collectors
If you know that you don't need an [entire plugin or a specific
collector](/src/collectors/README.md#collector-architecture-and-terminology),
If you know that you don't need an [entire plugin or a specific collector](/src/collectors/README.md),
you can disable any of them. Keep in mind that if a plugin/collector has nothing to do, it simply shuts down and doesnt consume system resources. You will only improve the Agent's performance by disabling plugins/collectors that are
actively collecting metrics.
@ -191,8 +190,7 @@ every` for an individual collector is less than the global, the Netdata Agent us
the [collectors configuration reference](/src/collectors/REFERENCE.md) for
details.
To reduce the frequency of
an [internal_plugin/collector](/src/collectors/README.md#collector-architecture-and-terminology),
To reduce the frequency of an [internal_plugin/collector](/src/collectors/README.md),
open `netdata.conf` and find the appropriate section. For example, to reduce the frequency of the `apps` plugin, which
collects and visualizes metrics on application resource utilization:
@ -213,7 +211,7 @@ update_every: 10
## Lower memory usage for metrics retention
See how
to [change how long Netdata stores metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md).
to [change how long Netdata stores metrics](/src/database/README.md#tiers).
## Use a different metric storage database

View file

@ -1,138 +0,0 @@
# Change how long Netdata stores metrics
Netdata offers a granular approach to data retention, allowing you to manage storage based on both **time** and **disk
space**. This provides greater control and helps you optimize storage usage for your specific needs.
**Default Retention Limits**:
| Tier | Resolution | Time Limit | Size Limit (min 256 MB) |
|:----:|:-------------------:|:----------:|:-----------------------:|
| 0 | high (per second) | 14d | 1 GiB |
| 1 | middle (per minute) | 3mo | 1 GiB |
| 2 | low (per hour) | 2y | 1 GiB |
> **Note**: If a user sets a disk space size less than 256 MB for a tier, Netdata will automatically adjust it to 256 MB.
With these defaults, Netdata requires approximately 4 GiB of storage space (including metadata).
## Retention Settings
> **In a parent-child setup**, these settings manage the shared storage space used by the Netdata parent Agent for storing metrics collected by both the parent and its child nodes.
You can fine-tune retention for each tier by setting a time limit or size limit. Setting a limit to 0 disables it,
allowing for no time-based deletion for that tier or using all available space, respectively. This enables various
retention strategies as shown in the table below:
| Setting | Retention Behavior |
|--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| Size Limit = 0, Time Limit > 0 | **Time-based only:** data is stored for a specific duration regardless of disk usage. |
| Time Limit = 0, Size Limit > 0 | **Space-based only:** data is stored until it reaches a certain amount of disk space, regardless of time. |
| Time Limit > 0, Size Limit > 0 | **Combined time and space limits:** data is deleted once it reaches either the time limit or the disk space limit, whichever comes first. |
You can change these limits in `netdata.conf`:
```text
[db]
mode = dbengine
storage tiers = 3
# Tier 0, per second data. Set to 0 for no limit.
dbengine tier 0 retention size = 1GiB
dbengine tier 0 retention time = 14d
# Tier 1, per minute data. Set to 0 for no limit.
dbengine tier 1 retention size = 1GiB
dbengine tier 1 retention time = 3mo
# Tier 2, per hour data. Set to 0 for no limit.
dbengine tier 2 retention size = 1GiB
dbengine tier 2 retention time = 2y
```
## Monitoring Retention Utilization
Netdata provides a visual representation of storage utilization for both time and space limits across all tiers within
the 'dbengine retention' subsection of the 'Netdata Monitoring' section on the dashboard. This chart shows exactly how
your storage space (disk space limits) and time (time limits) are used for metric retention.
## Legacy configuration
### v1.99.0 and prior
Netdata prior to v2 supports the following configuration options in `netdata.conf`.
They have the same defaults as the latest v2, but the unit of each value is given in the option name, not at the value.
```text
storage tiers = 3
# Tier 0, per second data. Set to 0 for no limit.
dbengine tier 0 disk space MB = 1024
dbengine tier 0 retention days = 14
# Tier 1, per minute data. Set to 0 for no limit.
dbengine tier 1 disk space MB = 1024
dbengine tier 1 retention days = 90
# Tier 2, per hour data. Set to 0 for no limit.
dbengine tier 2 disk space MB = 1024
dbengine tier 2 retention days = 730
```
### v1.45.6 and prior
Netdata versions prior to v1.46.0 relied on a disk space-based retention.
**Default Retention Limits**:
| Tier | Resolution | Size Limit |
|:----:|:-------------------:|:----------:|
| 0 | high (per second) | 256 MB |
| 1 | middle (per minute) | 128 MB |
| 2 | low (per hour) | 64 GiB |
You can change these limits in `netdata.conf`:
```text
[db]
mode = dbengine
storage tiers = 3
# Tier 0, per second data
dbengine multihost disk space MB = 256
# Tier 1, per minute data
dbengine tier 1 multihost disk space MB = 1024
# Tier 2, per hour data
dbengine tier 2 multihost disk space MB = 1024
```
### v1.35.1 and prior
These versions of the Agent do not support tiers. You could change the metric retention for the parent and
all of its children only with the `dbengine multihost disk space MB` setting. This setting accounts the space allocation
for the parent node and all of its children.
To configure the database engine, look for the `page cache size MB` and `dbengine multihost disk space MB` settings in
the `[db]` section of your `netdata.conf`.
```text
[db]
dbengine page cache size MB = 32
dbengine multihost disk space MB = 256
```
### v1.23.2 and prior
_For Netdata Agents earlier than v1.23.2_, the Agent on the parent node uses one dbengine instance for itself, and
another instance for every child node it receives metrics from. If you had four streaming nodes, you would have five
instances in total (`1 parent + 4 child nodes = 5 instances`).
The Agent allocates resources for each instance separately using the `dbengine disk space MB` (**deprecated**) setting.
If `dbengine disk space MB`(**deprecated**) is set to the default `256`, each instance is given 256 MiB in disk space,
which means the total disk space required to store all instances is,
roughly, `256 MiB * 1 parent * 4 child nodes = 1280 MiB`.
#### Backward compatibility
All existing metrics belonging to child nodes are automatically converted to legacy dbengine instances and the localhost
metrics are transferred to the multihost dbengine instance.
All new child nodes are automatically transferred to the multihost dbengine instance and share its page cache and disk
space. If you want to migrate a child node from its legacy dbengine instance to the multihost dbengine instance, you
must delete the instance's directory, which is located in `/var/cache/netdata/MACHINE_GUID/dbengine`, after stopping the
Agent.

View file

@ -36,7 +36,7 @@ gantt
**Configuring dbengine mode and retention**:
- Enable dbengine mode: The dbengine mode is already the default, so no configuration change is necessary. For reference, the dbengine mode can be configured by setting `[db].mode` to `dbengine` in `netdata.conf`.
- Adjust retention (optional): see [Change how long Netdata stores metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md).
- Adjust retention (optional): see [Change how long Netdata stores metrics](/src/database/README.md#tiers).
## `ram`

View file

@ -1,6 +1,6 @@
# RAM Utilization
Using the default [Database Tier configuration](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md), Netdata needs about 16KiB per unique metric collected, independently of the data collection frequency.
Using the default [Database Tier configuration](/src/database/README.md#tiers), Netdata needs about 16KiB per unique metric collected, independently of the data collection frequency.
## Children

View file

@ -337,7 +337,7 @@ Using the above configuration `myapp` should get its own section on the dashboar
- `gaps when not collected = yes|no`, enables or disables gaps on the charts of the application in case that no metrics are collected.
- `memory mode` sets the memory mode for all charts of the application. The default is the global default for Netdata (not the global default for StatsD private charts). We suggest not to use this (we have commented it out in the example) and let your app use the global default for Netdata, which is our dbengine.
- `history` sets the size of the round-robin database for this application. The default is the global default for Netdata (not the global default for StatsD private charts). This is only relevant if you use `memory mode = save`. Read more on our [metrics storage(]/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md) doc.
- `history` sets the size of the round-robin database for this application. The default is the global default for Netdata (not the global default for StatsD private charts). This is only relevant if you use `memory mode = save`. Read more on our documentation for the Agent's [Database](/src/database/README.md).
`[dictionary]` defines name-value associations. These are used to renaming metrics, when added to synthetic charts. Metric names are also defined at each `dimension` line. However, using the dictionary dimension names can be declared globally, for each app and is the only way to rename dimensions when using patterns. Of course the dictionary can be empty or missing.

View file

@ -1,153 +1,141 @@
# Database
Netdata is fully capable of long-term metrics storage, at per-second granularity, via its default database engine
(`dbengine`). But to remain as flexible as possible, Netdata supports several storage options:
Netdata stores detailed metrics at one-second granularity using its Database engine.
1. `dbengine`, (the default) data are in database files. The [Database Engine](/src/database/engine/README.md) works like a
traditional database. There is some amount of RAM dedicated to data caching and indexing and the rest of the data
reside compressed on disk. The number of history entries is not fixed in this case, but depends on the configured
disk space and the effective compression ratio of the data stored. This is the **only mode** that supports changing
the data collection update frequency (`update every`) **without losing** the previously stored metrics. For more
details see [here](/src/database/engine/README.md).
## Modes
2. `ram`, data are purely in memory. Data are never saved on disk. This mode uses `mmap()` and supports [KSM](#ksm).
| Mode | Description |
|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `dbengine` | Stores data in a database with RAM for caching and indexing, while keeping compressed data on disk. Storage capacity depends on available disk space and data compression ratio. For details, see [Database Engine](/src/database/engine/README.md). |
| `ram` | Stores data entirely in memory without disk persistence. |
| `none` | Operates without storage (metrics can only be streamed to another Agent). |
3. `alloc`, like `ram` but it uses `calloc()` and does not support [KSM](#ksm). This mode is the fallback for all others
except `none`.
The default `dbengine` mode is optimized for:
4. `none`, without a database (collected metrics can only be streamed to another Netdata).
- Long-term data retention
- Parent nodes in [Centralization](/docs/observability-centralization-points/README.md) setups
## Which database mode to use
For resource-constrained environments, particularly Child nodes in Centralization setups, consider using `ram`.
The default mode `[db].mode = dbengine` has been designed to scale for longer retentions and is the only mode suitable
for parent Agents in the _Parent - Child_ setups
The other available database modes are designed to minimize resource utilization and should only be considered on
[Parent - Child](/docs/observability-centralization-points/README.md) setups at the children side and only when the
resource constraints are very strict.
So,
- On a single node setup, use `[db].mode = dbengine`.
- On a [Parent - Child](/docs/observability-centralization-points/README.md) setup, use `[db].mode = dbengine` on the
parent to increase retention, and a more resource-efficient mode like, `dbengine` with light retention settings, `ram`, or `none` for the children to minimize resource utilization.
## Choose your database mode
You can select the database mode by editing `netdata.conf` and setting:
Use [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) to open `netdata.conf` and set your preferred mode:
```text
[db]
# dbengine (default), ram (the default if dbengine not available), alloc, none
# dbengine, ram, none
mode = dbengine
```
## Netdata Longer Metrics Retention
## Tiers
Metrics retention is controlled only by the disk space allocated to storing metrics. But it also affects the memory and
CPU required by the Agent to query longer timeframes.
Netdata offers a granular approach to data retention, allowing you to manage storage based on both **time** and **disk space**. This provides greater control and helps you optimize storage usage for your specific needs.
Since Netdata Agents usually run on the edge, on production systems, Netdata Agent **parents** should be considered.
When having a [**parent - child**](/docs/observability-centralization-points/README.md) setup, the child (the
Netdata Agent running on a production system) delegates all of its functions, including longer metrics retention and
querying, to the parent node that can dedicate more resources to this task. A single Netdata Agent parent can centralize
multiple children Netdata Agents (dozens, hundreds, or even thousands depending on its available resources).
**Default Retention Limits**:
## Running Netdata on embedded devices
| Tier | Resolution | Time Limit | Size Limit (min 256 MB) |
|:----:|:-------------------:|:----------:|:-----------------------:|
| 0 | high (per second) | 14d | 1 GiB |
| 1 | middle (per minute) | 3mo | 1 GiB |
| 2 | low (per hour) | 2y | 1 GiB |
Embedded devices typically have very limited RAM resources available.
> **Note**
>
> If a user sets a disk space size less than 256 MB for a tier, Netdata will automatically adjust it to 256 MB.
There are two settings for you to configure:
With these defaults, Netdata requires approximately 4 GiB of storage space (including metadata).
1. `[db].update every`, which controls the data collection frequency
2. `[db].retention`, which controls the size of the database in memory (except for `[db].mode = dbengine`)
### Retention Settings
By default `[db].update every = 1` and `[db].retention = 3600`. This gives you an hour of data with per second updates.
> **Important**
>
> In a Parent-Child setup, these settings manage the entire storage space used by the Parent for storing metrics collected both by itself and its Children.
If you set `[db].update every = 2` and `[db].retention = 1800`, you will still have an hour of data, but collected once
every 2 seconds. This will **cut in half** both CPU and RAM resources consumed by Netdata. Of course experiment a bit to find the right setting.
On very weak devices you might have to use `[db].update every = 5` and `[db].retention = 720` (still 1 hour of data, but
1/5 of the CPU and RAM resources).
You can fine-tune retention for each tier by setting a time limit or size limit. Setting a limit to 0 disables it. This enables the following retention strategies:
You can also disable [data collection plugins](/src/collectors/README.md) that you don't need. Disabling such plugins will also
free both CPU and RAM resources.
| Setting | Retention Behavior |
|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| Size Limit = 0, Time Limit > 0 | **Time based:** data is stored for a specific duration regardless of disk usage |
| Time Limit = 0, Size Limit > 0 | **Space based:** data is stored with a disk space limit, regardless of time |
| Time Limit > 0, Size Limit > 0 | **Combined time and space limits:** data is deleted once it reaches either the time limit or the disk space limit, whichever comes first |
## Memory optimizations
You can change these limits using [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) to open `netdata.conf`:
### KSM
```text
[db]
mode = dbengine
storage tiers = 3
KSM performs memory deduplication by scanning through main memory for physical pages that have identical content, and
identifies the virtual pages that are mapped to those physical pages. It leaves one page unchanged, and re-maps each
duplicate page to point to the same physical page. Netdata offers all of its in-memory database to kernel for
deduplication.
# Tier 0, per second data. Set to 0 for no limit.
dbengine tier 0 retention size = 1GiB
dbengine tier 0 retention time = 14d
In the past, KSM has been criticized for consuming a lot of CPU resources. This is true when KSM is used for
deduplicating certain applications, but it is not true for Netdata. Agent's memory is written very infrequently
(if you have 24 hours of metrics in Netdata, each byte at the in-memory database will be updated just once per day). KSM
is a solution that will provide 60+% memory savings to Netdata.
# Tier 1, per minute data. Set to 0 for no limit.
dbengine tier 1 retention size = 1GiB
dbengine tier 1 retention time = 3mo
### Enable KSM in kernel
To enable KSM in kernel, you need to run a kernel compiled with the following:
```sh
CONFIG_KSM=y
# Tier 2, per hour data. Set to 0 for no limit.
dbengine tier 2 retention size = 1GiB
dbengine tier 2 retention time = 2y
```
When KSM is enabled at the kernel, it is just available for the user to enable it.
### Monitoring Retention Utilization
If you build a kernel with `CONFIG_KSM=y`, you will just get a few files in `/sys/kernel/mm/ksm`. Nothing else
happens. There is no performance penalty (apart from the memory this code occupies into the kernel).
Netdata provides a visual representation of storage utilization for both the time and space limits across all Tiers under "Netdata" -> "dbengine retention" on the dashboard. This chart shows exactly how your storage space (disk space limits) and time (time limits) are used for metric retention.
The files that `CONFIG_KSM=y` offers include:
### Legacy configuration
- `/sys/kernel/mm/ksm/run` by default `0`. You have to set this to `1` for the kernel to spawn `ksmd`.
- `/sys/kernel/mm/ksm/sleep_millisecs`, by default `20`. The frequency ksmd should evaluate memory for deduplication.
- `/sys/kernel/mm/ksm/pages_to_scan`, by default `100`. The amount of pages ksmd will evaluate on each run.
<details><summary>v1.99.0 and prior</summary>
So, by default `ksmd` is just disabled. It will not harm performance and the user/admin can control the CPU resources
they are willing to have used by `ksmd`.
Netdata prior to v2 supports the following configuration options in `netdata.conf`.
They have the same defaults as the latest v2, but the unit of each value is given in the option name, not at the value.
### Run `ksmd` kernel daemon
To activate / run `ksmd,` you need to run the following:
```sh
echo 1 >/sys/kernel/mm/ksm/run
echo 1000 >/sys/kernel/mm/ksm/sleep_millisecs
```text
storage tiers = 3
# Tier 0, per second data. Set to 0 for no limit.
dbengine tier 0 disk space MB = 1024
dbengine tier 0 retention days = 14
# Tier 1, per minute data. Set to 0 for no limit.
dbengine tier 1 disk space MB = 1024
dbengine tier 1 retention days = 90
# Tier 2, per hour data. Set to 0 for no limit.
dbengine tier 2 disk space MB = 1024
dbengine tier 2 retention days = 730
```
With these settings, ksmd does not even appear in the running process list (it will run once per second and evaluate 100
pages for de-duplication).
</details>
Put the above lines in your boot sequence (`/etc/rc.local` or equivalent) to have `ksmd` run at boot.
<details><summary>v1.45.6 and prior</summary>
### Monitoring Kernel Memory de-duplication performance
Netdata versions prior to v1.46.0 relied on disk space-based retention.
Netdata will create charts for kernel memory de-duplication performance, the **deduper (ksm)** charts can be seen under the **Memory** section in the Netdata UI.
**Default Retention Limits**:
#### KSM summary
| Tier | Resolution | Size Limit |
|:----:|:-------------------:|:----------:|
| 0 | high (per second) | 256 MB |
| 1 | middle (per minute) | 128 MB |
| 2 | low (per hour) | 64 GiB |
The summary gives you a quick idea of how much savings (in terms of bytes and in terms of percentage) KSM is able to achieve.
You can change these limits in `netdata.conf`:
![image](https://user-images.githubusercontent.com/24860547/199454880-123ae7c4-071a-4811-95b8-18cf4e4f60a2.png)
```text
[db]
mode = dbengine
storage tiers = 3
# Tier 0, per second data
dbengine multihost disk space MB = 256
# Tier 1, per minute data
dbengine tier 1 multihost disk space MB = 1024
# Tier 2, per hour data
dbengine tier 2 multihost disk space MB = 1024
```
#### KSM pages merge performance
</details>
This chart indicates the performance of page merging. **Shared** indicates used shared pages, **Unshared** indicates memory no longer shared (pages are unique but repeatedly checked for merging), **Sharing** indicates memory currently shared(how many more sites are sharing the pages, i.e. how much saved) and **Volatile** indicates volatile pages (changing too fast to be placed in a tree).
## Cache sizes
A high ratio of Sharing to Shared indicates good sharing, but a high ratio of Unshared to Sharing indicates wasted effort.
There are two cache sizes that can be configured in `netdata.conf` to better optimize the Database:
![image](https://user-images.githubusercontent.com/24860547/199455374-d63fd2c2-e12b-4ddf-947b-35371215eb05.png)
1. `[db].dbengine page cache size`: this is the main cache that keeps metrics data into memory. When data is not found in it, the extent cache is consulted, and if not found in that too, they are loaded from the disk.
2. `[db].dbengine extent cache size`: this is the compressed extent cache. It keeps in memory compressed data blocks, as they appear on disk, to avoid reading them again. Data found in the extent cache but not in the main cache have to be uncompressed to be queried.
#### KSM savings
This chart shows the amount of memory saved by KSM. **Savings** indicates saved memory. **Offered** indicates memory marked as mergeable.
![image](https://user-images.githubusercontent.com/24860547/199455604-43cd9248-1f6e-4c31-be56-e0b9e432f48a.png)
#### KSM effectiveness
This chart tells you how well KSM is doing at what it is supposed to. It does this by charting the percentage of the mergeable pages that are currently merged.
![image](https://user-images.githubusercontent.com/24860547/199455770-4d7991ff-6b7e-4d96-9d23-33ffc572b370.png)
Both of them are dynamically adjusted to use some of the total memory computed above. The configuration in `netdata.conf` allows providing additional memory to them, increasing their caching efficiency.

View file

@ -12,29 +12,30 @@ DBENGINE is the time-series database of Netdata.
A **data point** has:
1. A **value**, the data collected for a metric. There is a special **value** to indicate that the collector failed to collect a valid value, and thus the data point is a **gap**.
1. A **value**, the data collected for a metric. There is a special **value** to indicate that the collector failed to collect a valid value, and thus the data point is a **gap**.
2. A **timestamp**, the time it has been collected.
3. A **duration**, the time between this and the previous data collection.
4. A flag which is set when machine-learning categorized the collected value as **anomalous** (an outlier based on the trained models).
Using the **timestamp** and **duration**, Netdata calculates for each point its **start time**, **end time** and **update every**.
For incremental metrics (counters), Netdata interpolates the collected values to align them to the expected **end time** at the microsecond level, absorbing data collection micro-latencies.
For incremental metrics (counters), Netdata interpolates the collected values to align them to the expected **end time** at the microsecond level, absorbing data collection micro-latencies.
When data points are stored in higher tiers (time aggregations - see [Tiers](#Tiers) below), each data point has:
1. The **sum** of the original values that have been aggregated,
2. The **count** of all the original values aggregated,
1. The **sum** of the original values that have been aggregated
2. The **count** of all the original values aggregated,
3. The **minimum** value among them,
4. The **maximum** value among them,
5. Their **anomaly rate**, i.e. the count of values that were detected as outliers based on the currently trained models for the metric,
5. Their **anomaly rate**, i.e., the count of values that were detected as outliers based on the currently trained models for the metric
6. A **timestamp**, which is the equal to the **end time** of the last point aggregated,
7. A **duration**, which is the duration between the **first time** of the first point aggregated to the **end time** of the last point aggregated.
This design allows Netdata to accurately know the **average**, **minimum**, **maximum** and **anomaly rate** values even when using higher tiers to satisfy a query.
### Pages
Data points are organized into **pages**, i.e. segments of contiguous data collections of the same metric.
Data points are organized into **pages**, i.e., segments of contiguous data collections of the same metric.
Each page:
@ -46,7 +47,7 @@ Each page:
A **page** is a simple array of values. Each slot in the array has a **timestamp** implied by its position in the array, and each value stored represents the **data point** for that time, for the metric the page belongs to.
This simple fixed step page design allows Netdata to collect several millions of points per second and pack all the values in a compact form with minimal metadata overhead.
This fixed step page design allows Netdata to collect several millions of points per second and pack all the values in a compact form with minimal metadata overhead.
#### Hot Pages
@ -58,7 +59,7 @@ Once a **hot page** is full, it becomes a **dirty page**, and it is scheduled fo
#### Clean Pages
Flushed (saved) pages are **clean pages**, i.e. read-only pages that reside primarily on disk, and are loaded on demand to satisfy data queries.
Flushed (saved) pages are **clean pages**, i.e., read-only pages that reside primarily on disk, and are loaded on demand to satisfy data queries.
#### Pages Configuration
@ -96,7 +97,7 @@ Each **datafile** has two **journal files** with metadata related to the stored
Database rotation is achieved by deleting the oldest **datafile** (and its journals) and creating a new one (with its journals).
Data on disk are append-only. There is no way to delete, add, or update data in the middle of the database. If data are not useful for whatever reason, Netdata can be instructed to ignore these data. They will eventually be deleted from disk when the database is rotated. New data are always appended.
Data on disk are append-only. There is no way to delete, add, or update data in the middle of the database. If data are not useful for whatever reason, Netdata can be instructed to ignore these data. They will eventually be deleted from the disk when the database is rotated. New data are always appended.
#### Tiers
@ -110,15 +111,15 @@ Tiers are supported in Netdata Agents with version `netdata-1.35.0.138.nightly`
Updating the higher **tiers** is automated, and it happens in real-time while data are being collected for **tier 0**.
When the Netdata Agent starts, during the first data collection of each metric, higher tiers are automatically **backfilled** with
When the Netdata Agent starts, during the first data collection of each metric, higher tier are automatically **backfilled** with
data from lower tiers, so that the aggregation they provide will be accurate.
Configuring how the number of tiers and the disk space allocated to each tier is how you can
[change how long netdata stores metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md).
Configuring how the number of tiers and the disk space allocated to each tier is how you can
[change how long netdata stores metrics](/src/database/README.md#tiers).
### Data loss
Until **hot pages** and **dirty pages** are **flushed** to disk they are at risk (e.g. due to a crash, or
Until **hot pages** and **dirty pages** are **flushed** to disk, they are at risk (e.g., due to a crash, or
power failure), as they are stored only in memory.
The supported way of ensuring high data availability is the use of Netdata Parents to stream the data in real-time to
@ -126,13 +127,13 @@ multiple other Netdata Agents.
## Memory requirements and retention
See (change how long netdata stores metrics)[https://github.com/netdata/netdata/edit/master/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md]
See [change how long netdata stores metrics](/src/database/README.md#tiers)
#### Exceptions
Netdata has several protection mechanisms to prevent the use of more memory (than the above), by incrementally fetching data from disk and aggressively evicting old data to make Room for new data, but still memory may grow beyond the above limit under the following conditions:
1. The number of pages concurrently used in queries do not fit the in the above size. This can happen when multiple queries of unreasonably long time-frames run on lower, higher resolution, tiers. The Netdata query planner attempts to avoid such situations by gradually loading pages, but still under extreme conditions the system may use more memory to satisfy these queries.
1. The number of pages concurrently used in queries does not fit in the above size. This can happen when multiple queries of unreasonably long time-frames run on lower, higher resolution tiers. The Netdata query planner attempts to avoid such situations by gradually loading pages, but still under extreme conditions, the system may use more memory to satisfy these queries.
2. The disks that host Netdata files are extremely slow for the workload required by the database so that data cannot be flushed to disk quickly to free memory. Netdata will automatically spawn more flushing workers in an attempt to parallelize and speed up flushing, but still if the disks cannot write the data quickly enough, they will remain in memory until they are written to disk.
@ -154,11 +155,11 @@ memory in KiB = METRICS x (TIERS - 1) x 4KiB x 2 + 32768 KiB
the part `METRICS x (TIERS - 1) x 4KiB` is an estimate for the max hot size of the main cache. Tier 0 pages are 4KiB, but tier 1 pages are 2 KiB and tier 2 pages are 384 bytes. So a single metric in 3 tiers uses 4096 + 2048 + 384 = 6528 bytes. The equation estimates 8192 per metric, which includes cache internal structures and leaves some spare.
Then `x 2` is the worst case estimate for the dirty queue. If all collected metrics (hot) become available for saving at once, to avoid stopping data collection all their pages will become dirty and new hot pages will be created instantly. To save memory, when Netdata starts, DBENGINE allocates randomly smaller pages for metrics, to spread their completion evenly across time.
Then `x 2` is the worst case estimate for the dirty queue. If all collected metrics (hot) become available for saving at once, to avoid stopping data collection, all their pages will become dirty and new hot pages will be created instantly. To save memory, when Netdata starts, DBENGINE allocates randomly smaller pages for metrics, to spread their completion evenly across time.
The memory we saved with the above is used to improve the LRU cache. So, although we reserved 32MiB for the LRU, in bigger setups (Netdata Parents) the LRU grows a lot more, within the limits of the equation.
In practice, the main cache sizes itself with `hot x 1.5` instead of `host x 2`. The reason is that 5% of main cache is reserved for expanding open cache, 5% for expanding extent cache and we need Room for the extensive buffers that are allocated in these setups. When the main cache exceeds `hot x 1.5` it enters a mode of critical evictions, and aggresively frees pages from the LRU to maintain a healthy memory footprint within its design limits.
In practice, the main cache sizes itself with `hot x 1.5` instead of `host x 2`. The reason is that 5% of the main cache is reserved for expanding open cache, 5% for expanding extent cache, and we need Room for the extensive buffers that are allocated in these setups. When the main cache exceeds `hot x 1.5` it enters a mode of critical evictions, and aggressively frees pages from the LRU to maintain a healthy memory footprint within its design limits.
#### Open Cache
@ -166,9 +167,9 @@ Stores metadata about on disk pages. Not the data itself. Only metadata about th
Its primary use is to index information about the open datafile, the one that still accepts new pages. Once that datafile becomes full, all the hot pages of the open cache are indexed in journal v2 files.
The clean queue is an LRU for reducing the journal v2 scans during quering.
The clean queue is an LRU for reducing the journal v2 scans during querying.
Open cache uses memory ballooning too, like the main cache, based on its own hot pages. Open cache hot size is mainly controlled by the size of the open datafile. This is why on netdata versions with journal files v2, we decreased the maximum datafile size from 1GB to 512MB and we increased the target number of datafiles from 20 to 50.
Open cache uses memory ballooning too, like the main cache, based on its own hot pages. Open cache hot size is mainly controlled by the size of the open datafile. This is why on netdata versions with journal files v2, we decreased the maximum datafile size from 1GB to 512MB, and we increased the target number of datafiles from 20 to 50.
On bigger setups open cache will get a bigger LRU by automatically sizing it (the whole open cache) to 5% to the size of (the whole) main cache.
@ -176,7 +177,6 @@ On bigger setups open cache will get a bigger LRU by automatically sizing it (th
Caches compressed **extent** data, to avoid reading too repeatedly the same data from disks.
### Shared Memory
Journal v2 indexes are mapped into memory. Netdata attempts to minimize shared memory use by instructing the kernel about the use of these files, or even unmounting them when they are not needed.

View file

@ -11,7 +11,7 @@ The exporting engine has a modular structure and supports metric exporting via m
the same time. You can have different update intervals and filters configured for every exporting connector instance.
When you enable the exporting engine and a connector, the Netdata Agent exports metrics _beginning from the time you
restart its process_, not the entire [database of long-term metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md).
restart its process_, not the entire [database of long-term metrics](/src/database/README.md).
Since Netdata collects thousands of metrics per server per second, which would easily congest any database server when
several Netdata servers are sending data to it, Netdata allows sending metrics at a lower frequency, by resampling them.

View file

@ -516,8 +516,8 @@ You can monitor the replication process in two ways:
### Replication history
Replication history in [dbengine](/src/database/README.md) mode is limited
by [Tier 0 retention](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md#effect-of-storage-tiers-and-disk-space-on-retention):
Replication history in [dbengine](/src/database/README.md#modes) mode is limited
by [Tier 0 retention](/src/database/README.md#tiers):
- Child instances replicate only Tier 0 data.
- Parent instance calculates higher-level tiers using Tier 0 as the basis.