如何一步一步通过Prometheus和grafana来监控Oracle 18c数据库

Contents

1 一背景说明
2 二配置流程
3 三小结和参考

一背景说明

本文档记录通过prometheus+grafana来监控Oracle 18C指标的操作指南。该文档是一个初步demo，后续会不断改善和迭代更新。

需求，一套运行在CentOS 7.6 X64平台的Oracle 18C数据库，想对其做一个基本的监控。于是，选择通过prometheus来采集和存储Oracle数据库的指标，然后在grafana上通过图形界面展示和查看。prometheus对于已经成熟的软件产品，比如操作系统本身，数据库软件等，它是无法直接去采集它们的指标数据的。需要通过一个exporter去拉取监控目标对象的指标，然后prometheus把这些指标存取起来，以时序数据库的格式存放。最后，通过grafana图形界面展示这些数据。所以，我这里通过运行一个oracledb-exporter,让它去抓取Oracle数据库的指标数据，然后prometheus来定期拉取这些数据。文中使用到的oracledb-exporter是直接运行在已有的一套3节点的Kubernetes集群中，并设置其中的pod指定为hostPort，让这个端口同时暴露在pod所在host机器上的端口，这样prometheus才可以去采集到这些metrics。

其中，各软件配置信息如下：

	软件运行平台	软件版本	所在IP
prometheus	macOS Monterey version 12.2.1	2.40.1	localhost
grafana	macOS Monterey version 12.2.1	9.2.4	localhost
Oracle	CentOS Linux release 7.6.1810 (Core)	18.3.0.0.0	172.16.17.43
kubernetes	CentOS Linux release 7.4.1708 (Core)	1.23	172.16.11.168「master」172.16.11.148「node1」172.16.11.168「node2」

二配置流程

0 启动运行oracledb-exporter

0.1 准备连接目标Oracle数据库的datasource信息

这里通过Kubernetes的secret来访问目标数据库，需要生成1个连接字符串

[root@master-node oracle_exporter]# echo -n system/system_user_password@//172.16.17.41:1521/devpdb|base64
c3lzdGVtL29yYWNsZUAvLzE3Mi4xNi4xNy40MToxNTIxL2RldnBkYg==
[root@master-node oracle_exporter]#

把连接目标库的用户、密码，IP，端口，数据库服务名，通过base64生成一个加密字符串。准备好的Kubernetes secret资源对象如下：

apiVersion: v1
kind: Secret
metadata:
  name:  oracledb-exporter-secret
  namespace: database-namespace
data:
    # Add here the result of:
    # echo -n YOUR_CONN_STRING | base64
    # YOUR_CONN_STRING be like: system/YOUR-PASS-FOR-SYSTEM@//database:1521/DB_SID.DB_DOMAIN
    datasource: c3lzdGVtL29yYWNsZUAvLzE3Mi4xNi4xNy40MToxNTIxL2RldnBkYg==       
type: Opaque

0.2 准备连接目标Oracle数据库的Deployment对象信息

我们要把这个oracledb-exporter运行在Kubernetes里，创建1个Deployment资源对象，Deployment对象挂载ConfigMap，而ConfigMap才是抓取Oracle数据库的metrics，访问目标数据库的连接信息，通过前面创建的oracledb-exporter-secret资源对象来完成。

[root@master-node oracle_exporter]# pwd
/root/oracle_exporter
[root@master-node oracle_exporter]# ll
总用量 8
-rw-r--r-- 1 root root 5459 11月 18 14:25 oracle-db-deployment.yaml
[root@master-node oracle_exporter]# cat oracle-db-deployment.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: oracledb-exporter
  namespace: database-namespace
spec:
  selector:
    matchLabels:
      app: oracledb-exporter
  replicas: 1
  template:
    metadata:
      labels:
        app: oracledb-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9161"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: oracledb-exporter
        ports:
        - containerPort: 9161
          hostPort: 9161
          protocol: TCP
        image: iamseth/oracledb_exporter
        env:
        - name: DATA_SOURCE_NAME
          valueFrom:
            secretKeyRef:
              name: oracledb-exporter-secret
              key: datasource
        - name: CUSTOM_METRICS
          value: /tmp/custom-metrics.toml
        volumeMounts:
          - name:  custom-metrics
            mountPath:  /tmp/custom-metrics.toml
            subPath: custom-metrics.toml
      volumes:
        - name: custom-metrics
          configMap:
            defaultMode: 420
            name: custom-metrics               
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-metrics
  namespace: database-namespace
data:
  custom-metrics.toml: |
    [[metric]]
    context = "slow_queries"
    metricsdesc = { p95_time_usecs= "Gauge metric with percentile 95 of elapsed time.", p99_time_usecs= "Gauge metric with percentile 99 of elapsed time." }
    request = "select  percentile_disc(0.95)  within group (order by elapsed_time) as p95_time_usecs, percentile_disc(0.99)  within group (order by elapsed_time) as p99_time_usecs from v$sql where last_active_time >= sysdate - 5/(24*60)"

    [[metric]]
    context = "big_queries"
    metricsdesc = { p95_rows= "Gauge metric with percentile 95 of returned rows.", p99_rows= "Gauge metric with percentile 99 of returned rows." }
    request = "select  percentile_disc(0.95)  within group (order by rownum) as p95_rows, percentile_disc(0.99)  within group (order by rownum) as p99_rows from v$sql where last_active_time >= sysdate - 5/(24*60)"

    [[metric]]
    context = "size_user_segments_top100"
    metricsdesc = {table_bytes="Gauge metric with the size of the tables in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as table_bytes from user_segments where segment_type='TABLE' group by segment_name) order by table_bytes DESC FETCH NEXT 100 ROWS ONLY"

    [[metric]]
    context = "size_user_segments_top100"
    metricsdesc = {table_partition_bytes="Gauge metric with the size of the table partition in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as table_partition_bytes from user_segments where segment_type='TABLE PARTITION' group by segment_name) order by table_partition_bytes DESC FETCH NEXT 100 ROWS ONLY"

    [[metric]]
    context = "size_user_segments_top100"
    metricsdesc = {cluster_bytes="Gauge metric with the size of the cluster in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as cluster_bytes from user_segments where segment_type='CLUSTER' group by segment_name) order by cluster_bytes DESC FETCH NEXT 100 ROWS ONLY"

    [[metric]]
    context = "size_dba_segments_top100"
    metricsdesc = {table_bytes="Gauge metric with the size of the tables in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as table_bytes from dba_segments where segment_type='TABLE' group by segment_name) order by table_bytes DESC FETCH NEXT 100 ROWS ONLY"

    [[metric]]
    context = "size_dba_segments_top100"
    metricsdesc = {table_partition_bytes="Gauge metric with the size of the table partition in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as table_partition_bytes from dba_segments where segment_type='TABLE PARTITION' group by segment_name) order by table_partition_bytes DESC FETCH NEXT 100 ROWS ONLY"

    [[metric]]
    context = "size_dba_segments_top100"
    metricsdesc = {cluster_bytes="Gauge metric with the size of the cluster in user segments."}
    labels = ["segment_name"]
    request = "select * from (select segment_name,sum(bytes) as cluster_bytes from dba_segments where segment_type='CLUSTER' group by segment_name) order by cluster_bytes DESC FETCH NEXT 100 ROWS ONLY"

    [[metric]]
    context = "resource"
    labels = [ "resource_name" ]
    metricsdesc = { current_utilization= "Generic counter metric from v$resource_limit view in Oracle (current value).", limit_value="Generic counter metric from v$resource_limit view in Oracle (UNLIMITED: -1)." }
    request="SELECT resource_name,current_utilization,CASE WHEN TRIM(limit_value) LIKE 'UNLIMITED' THEN '-1' ELSE TRIM(limit_value) END as limit_value FROM v$resource_limit"
    
    [[metric]]
    context = "process"
    metricsdesc = { count="Gauge metric with count of processes." }
    request = "SELECT COUNT(*) as count FROM v$process"

---
apiVersion: v1
kind: Secret
metadata:
  name:  oracledb-exporter-secret
  namespace: database-namespace
data:
    # Add here the result of: 
    # echo -n YOUR_CONN_STRING | base64 
    # YOUR_CONN_STRING be like: system/YOUR-PASS-FOR-SYSTEM@//database:1521/DB_SID.DB_DOMAIN
    datasource: c3lzdGVtL29yYWNsZUAvLzE3Mi4xNi4xNy40MToxNTIxL2RldnBkYg==
type: Opaque
[root@master-node oracle_exporter]#

注意📢📢📢，Deployment管控的pod，我们指定了hostPort: 9161字段，这样将来这个pod不管调度到Kubernetes cluster的哪个worker node上，就会在那个node上暴露9161端口给外部，prometheus才可以通过worker node IP+9161端口，抓取pull到采集的目标Oracle数据库的指标数据。

0.3 创建namespace资源对象database-namespace

[root@master-node oracle_exporter]# pwd
/root/oracle_exporter
[root@master-node oracle_exporter]# ll
总用量 8
-rw-r--r-- 1 root root 5459 11月 18 14:25 oracle-db-deployment.yaml
[root@master-node oracle_exporter]# kubectl create ns database-namespace
namespace/database-namespace created
[root@master-node oracle_exporter]# kubectl get ns
NAME                 STATUS   AGE
database-namespace   Active   3s
default              Active   111d
kube-node-lease      Active   111d
kube-public          Active   111d
kube-system          Active   111d
mysql                Active   45d
ns1                  Active   74d
test                 Active   82d
[root@master-node oracle_exporter]#

0.4 启动运行oracledb-exporter Deployment

[root@master-node oracle_exporter]# kubectl apply -f oracle-db-deployment.yaml 
deployment.apps/oracledb-exporter created
configmap/custom-metrics created
secret/oracledb-exporter-secret created
[root@master-node oracle_exporter]# kubectl -n database-namespace get all -owide
NAME                                     READY   STATUS    RESTARTS   AGE   IP           NODE     NOMINATED NODE   READINESS GATES
pod/oracledb-exporter-7d6c8cc59c-slpbp   1/1     Running   0          11s   10.244.2.8   node-2   <none>           <none>

NAME                                READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS          IMAGES                      SELECTOR
deployment.apps/oracledb-exporter   1/1     1            1           11s   oracledb-exporter   iamseth/oracledb_exporter   app=oracledb-exporter

NAME                                           DESIRED   CURRENT   READY   AGE   CONTAINERS          IMAGES                      SELECTOR
replicaset.apps/oracledb-exporter-7d6c8cc59c   1         1         1       11s   oracledb-exporter   iamseth/oracledb_exporter   app=oracledb-exporter,pod-template-hash=7d6c8cc59c
[root@master-node oracle_exporter]# kubectl get nodes -owide
NAME          STATUS   ROLES                  AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION          CONTAINER-RUNTIME
master-node   Ready    control-plane,master   111d   v1.23.1   172.16.11.168   <none>        CentOS Linux 7 (Core)   3.10.0-693.el7.x86_64   docker://1.13.1
node-1        Ready    <none>                 111d   v1.23.1   172.16.11.148   <none>        CentOS Linux 7 (Core)   3.10.0-862.el7.x86_64   docker://1.13.1
node-2        Ready    <none>                 111d   v1.23.1   172.16.11.161   <none>        CentOS Linux 7 (Core)   3.10.0-862.el7.x86_64   docker://1.13.1
[root@master-node oracle_exporter]#

至此，看到Deployment管控的pod运行在node2上，node2的IP为172.16.11.161。

0.5 校验metris正常抓取

[root@master-node oracle_exporter]# curl http://172.16.11.161:9161/metrics
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.3341e-05
....
process_virtual_memory_bytes 2.190381056e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes -1
[root@master-node oracle_exporter]#

这里，可以看到非常多的metrics，说明抓取正常。

1 下载安装运行prometheus

1.1 下载prometheus

到prometheus官网下载对应平台的软件，我这里下载的MacOS的prometheus:

wget https://github.com/prometheus/prometheus/releases/download/v2.40.1/prometheus-2.40.1.darwin-amd64.tar.gz
unzip prometheus-2.40.1.darwin-amd64.tar.gz 
cd prometheus-2.40.1.darwin-amd64/

1.2 修改prometheus.yml:

asher at MacBook-Air-3 in ~/prometheus-2.40.1.darwin-amd64
$ cat prometheus.yml 
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
    - "simulator.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "http-simulator"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    metrics_path: /prometheus
    static_configs:
      - targets: ["localhost:8080"]

  - job_name: "oracle"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    metrics_path: /metrics
    static_configs:
      - targets: ["172.16.11.161:9161"]

asher at MacBook-Air-3 in ~/prometheus-2.40.1.darwin-amd64
$

其中，添加最后一部分是对Oracle的抓取配置信息。其它部分，是我本地对其它信息的抓取配置。

1.3 启动prometheus:

asher at MacBook-Air-3 in ~/prometheus-2.40.1.darwin-amd64
$ ./prometheus --config.file prometheus.yml --web.enable-lifecycle
...
ts=2022-11-18T08:38:58.260Z caller=main.go:512 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2022-11-18T08:38:58.261Z caller=main.go:556 level=info msg="Starting Prometheus Server" mode=server version="(version=2.40.1, branch=HEAD, revision=965274259f993801d63f553639d248588287120b)"
ts=2022-11-18T08:38:58.261Z caller=main.go:561 level=info build_context="(go=go1.19.3, user=root@09c091b0e988, date=20221109-14:30:41)"
...

1.4 访问prometheus，验证目标对象抓取正常：

http://localhost:9090/targets?search=

2 下载安装运行grafana

2.1 下载解压grafana

$ pwd
/Users/asher

asher at MacBook-Air-3 in ~
$ curl -O https://dl.grafana.com/enterprise/release/grafana-enterprise-9.2.4.darwin-amd64.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 95.5M  100 95.5M    0     0  8933k      0  0:00:10  0:00:10 --:--:-- 9893k

asher at MacBook-Air-3 in ~
$ tar -zxvf grafana-enterprise-9.2.4.darwin-amd64.tar.gz 
x grafana-9.2.4/LICENSE
x grafana-9.2.4/README.md
...

2.2 运行grafana

$ cd grafana-9.2.4/
asher at MacBook-Air-3 in ~/grafana-9.2.4
$ ./bin/grafana-server 
INFO [11-14|08:44:28] Starting Grafana                         logger=settings version=9.2.4 commit=64017e8ca6 branch=HEAD compiled=2022-11-08T18:38:42+08:00
INFO [11-14|08:44:28] Config loaded from                       logger=settings file=/Users/asher/grafana-9.2.4/conf/defaults.ini
INFO [11-14|08:44:28] Path Home                                logger=settings path=/Users/asher/grafana-9.2.4
INFO [11-14|08:44:28] Path Data                                logger=settings path=/Users/asher/grafana-9.2.4/data
INFO [11-14|08:44:28] Path Logs                                logger=settings path=/Users/asher/grafana-9.2.4/data/log
...

2.3 访问grafana

http://localhost:3000/login

用户名和密码默认都是admin登录。

2.4 下载并导入dashboard

从 https://promcat.io/apps/oracle-database/19.3.0.0 下载dashboard。

然后，到grafana web界面菜单，选择导入：

2.5 查看监控信息

http://localhost:3000/d/kIVyrQvGk/oracle-db?orgId=1&from=now-1h&to=now

三小结和参考

通过prometheus+grafana开源监控系统，完成了对Oracle数据库的初步监控。

其中，exporter运行在Kubernetes里，引入了ConfigMap资源对象，secret对象。此步骤非必须，也可以把exporter直接运行在被监控的Oracle数据库服务器上「或者其它有Oracle运行环境的机器也可以，才可以抓取到目标数据库」，再或者直接把exporter运行在docker中也可以。

核心是ConfigMap里写的metrics，我们可以逐步增加和完善这里的metrics。

参考：

https://github.com/iamseth/oracledb_exporter

How to monitor an Oracle database with Prometheus

https://promcat.io/apps/oracle-database/19.3.0.0

https://www.modb.pro/db/103903

https://www.modb.pro/db/28083

Post Views: 7,515