Skip to content
GitHub

Prometheus

  • Prometheus : 系统监控和报警系统
  • node_exporter : 节点信息采集工具(读取 linux /proc 系统文件数据)
  • process-exporter: 进程信息采集工具(读取 linux /proc 进程文件数据)
  • loki : log 聚合系统, 类似于 prometheus
  • promtail : 节点 log 采集工具
  • alertmanager : 告警整合和发送系统
  • grafana : 数据可视化面板

数据监控, 采集, 告警系统; 主动采集被测节点信息
Prometheus Download

 # 解压安装包, 创建数据目录, 创建配置文件
 $ tar -zxvf prometheus-2.45.1.linux-amd64.tar.gz
 $ cd prometheus-2.45.1.linux-amd64 && mkdir -p data
 $ vi prometheus.yml

官方默认配置文件

global:                                          # 全局配置
  scrape_interval: 15s                           # 每 15s 收集一次 xx-exporter 的数据
  evaluation_interval: 15s                       # 每隔 15s 计算一次所有规则, 用于告警判断(默认 1m)
  # scrape_timeout: 10s                          # 获取数据的超时时间, 默认 10s

alerting:                                        # Alertmanager 配置
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093                  # 配置 alertmanager 服务

rule_files:                                      # 外部告警规则文件
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:

  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  
  # 添加外部采集服务, 如 node_exporter process-exporter
  # - job_name: "node_exporter"
  #   static_configs:
  #     - targets: ["localhost:9100"]

默认端口: 9090

 # 检查配置文件
 $ ./promtool check config prometheus.yml
 > Checking prometheus.yml
 > SUCCESS: prometheus.yml is valid prometheus config file syntax

 # 启动 prometheus
 $ ./prometheus --config.file=./prometheus.yml --web.enable-lifecycle
 > ts=2023-12-07T09:13:17.909Z caller=main.go:1043 level=info msg="TSDB started"
 > ts=2023-12-07T09:13:17.909Z caller=main.go:1224 level=info msg="Loading configuration file" filename=prometheus.yml
 > ts=2023-12-07T09:13:17.909Z caller=main.go:1261 level=info msg="Completed loading of configuration file" filename=prometheus.yml
 > ts=2023-12-07T09:13:17.910Z caller=main.go:1004 level=info msg="Server is ready to receive web requests."

注意: 添加 --web.enable-lifecycle 启用热重载, 执行 curl -X POST http://localhost:9090/-/reload 重载
浏览器打开 http://localhost:9090 打开 prometheus 控制台

在监控节点运行 node_exporter 采集节点信息, 待 peometheus 收集
node_exporter Download

 $ tar -zxvf node_exporter-1.7.0.linux-arm64.tar.gz
 $ cd node_exporter-1.7.0.linux-arm64
 $ cp node_exporter /usr/local/bin

 $ ./node_exporter
 > ts=1970-02-06T18:49:59.679Z caller=tls_config.go:274 level=info msg="Listening on" address=0.0.0.0:9100
 > ts=1970-02-06T18:49:59.679Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=0.0.0.0:9100

 # 自定义 HOST 和 端口
 $ ./node_exporter --web.listen-address 127.0.0.1:8080
 
 # 指定配置文件
 $ ./node_exporter --web.config.file="config.yaml"

浏览器打开 http://localhost:9100 进入 node_exporter 控制台
Prometheus 配置文件添加 node_export 监控, 重载 Prometheus
Prometheus 根据节点信息获取 node_exporter 采集的监控数据

scrape_configs:

  - job_name: "node_export"
    static_configs:
      - targets: ["<节点 HOST>:9100"]
  
  # 自定义 job 名称
  - job_name: "Node"
    static_configs:
      - targets: ["192.168.1.100:9100"] 
        # 自定义 labels 属性 host, tag(labels 默认带有 instance 属性对应 targets)
        # 属性可以在设定规则, 邮件中使用 {{ $labels.instance }}
        labels:
          host: "192.168.1.100"
          tag: "python"

      - targets: ["192.168.1.101:9100"] 
        labels:
          host: "192.168.1.101"
          tag: "golang"

process_exporter 用于监控进程和线程等更细致的信息
process-exporter

 # 解压包, 创建配置文件
 $ tar -zxvf process-exporter-0.7.9.linux-arm64.tar.gz
 $ cd process-exporter-0.7.9.linux-arm64
 $ vi config.yaml

配置需要监控的进程

process_names:
  - name: "{{.Matches}}"
    cmdline:
      - "sshd"

  - name: "{{.Matches}}"
    cmdline:
      - "python"

  - name: "{{.Matches}}"
    cmdline:
      - "docker"
 $ nohup ./process-exporter -config.path config.yaml &

打开浏览器 http://localhost:9256/metrics
若系统种存在监控的进程, log 必定出现 cpu_seconds_total 字段

namedprocess_namegroup_cpu_seconds_total{groupname="map[:sshd]",mode="system"} 0.19000000000000128
namedprocess_namegroup_cpu_seconds_total{groupname="map[:sshd]",mode="user"} 0.009999999999999787
namedprocess_namegroup_cpu_seconds_total{groupname="map[:python]",mode="system"} 0.010000000000019327
namedprocess_namegroup_cpu_seconds_total{groupname="map[:python]",mode="user"} 0.009999999999990905

Prometheus 配置文件添加 node_export 监控, 重启 Prometheus

scrape_configs:
   ...
   ...

  - job_name: "process_exporter"
    static_configs:
      - targets: ["<节点 HOST>:9256"]

Prometheus http://localhost:9090/service-discovery?search= 查询所有监控的服务

Loki 是一个模仿 Prometheus 的日志聚合系统, 也可以使用 Grafana 作为展示界面 Github Loki

官方配置

auth_enabled: false

server:
  http_listen_port: 3100 # 对外接口
  grpc_listen_port: 9096

common:
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093        # 添加告警服务路由
# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/usagestats/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false

创建配置文件, 拉起 loki 服务, 默认端口 3100

 $ ./loki-linux-amd64 -config.file=$PWD/loki-config.yaml

Github Promtail

server:
  http_listen_port: 9080 # 对外暴露端口
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:                                            # 机器需要与 loki 服务通信
  - url: http://193.168.1.123:3100/loki/api/v1/push # url 需与 loki 路由和端口一致

scrape_configs:
  - job_name: panic
    static_configs:
      - targets:
          - localhost
        labels:
          job: ald_panic # 在 grafana 显示的 job
          __path__: /proc/kbox/regions/panic # 检测文件

  - job_name: messages
    static_configs:
      - targets:
          - localhost
        labels:
          job: ald_msg # 在 grafana 显示的 job
          __path__: /var/log/messages # 检测文件

创建配置文件 promtail-config.yaml, 填入以上内容, 拉起 promtail 服务, 默认端口 9080

 $ ./promtail-linux-amd64 -config.file=$PWD/config.yaml
 >

http://localhost:9080 promtail 界面查看

日常重载, 修改配置文件, 关闭或重启服务有些麻烦, 使用脚本完成

CURRENT_DIR=$(pwd)
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
START_COMMAND="./prometheus --config.file=config.yml --web.enable-lifecycle"
RELOAD_COMMAND="curl -X POST http://localhost:9090/-/reload"
CHECK_CONFIG="./promtool check config config.yml"

function kill_process() {
    cd $SCRIPT_DIR
    if [[ ! -f pid ]]; then
        echo -e "\033[31mpid not exists \033[0m"
        return 1
    fi
    local pid=$(cat pid)
    if [[ -z $pid ]]; then
        echo -e "\033[31mpid is empty \033[0m"
        return 1
    fi
    
    kill -15 $pid >/dev/null 2>&1
    rm pid
    return 0
}

function check_config() {
    echo $CHECK_CONFIG
    $CHECK_CONFIG
    if [[ $? -ne 0 ]]; then
        echo -e "\033[31mCheck config failed \033[0m"
        return 1
    fi
    echo -e "\033[32mCheck config success\033[0m" 
    return 0
}

function reload_process() {
    check_process
    [[ $? -ne 0 ]] && return 1
    echo $RELOAD_COMMAND
    $RELOAD_COMMAND
    if [[ $? -ne 0 ]]; then
        echo -e "\033[31mReload service failed \033[0m"
        return 1
    fi
    echo -e "\033[32mReload service success \033[0m"
    return 0
}

function start_process() {
    cd $SCRIPT_DIR
    nohup $START_COMMAND >/dev/null 2>&1 &
    local pid=$!
    ps -aux | grep -v grep | grep $pid >/dev/null 2>&1
    echo $pid > pid
    return 0
}

function check_process() {
    cd $SCRIPT_DIR
    if [[ ! -f pid ]]; then
        echo -e "\033[31mpid not exists \033[0m"
        return 1
    fi
    local pid=$(cat pid)
    if [[ -z $pid ]]; then
        echo -e "\033[31mpid is empty \033[0m"
        return 1
    fi

    ps -aux | grep -v grep | grep $pid
    if [[ $? -ne 0 ]]; then 
        echo -e "\033[31mProcess is not running\033[0m"
        return 1
    fi
    echo -e "\033[32mProcess is running\033[0m" 
    return 0
}

usage=$(cat <<EOF
  -h/--help      show help       \n
  --restart      restart service \n
  --reload       reload service  \n
  --kill         close service   \n
  --check        check service   \n
  --config       check config    \n
EOF
)

declare -a params
while [[ $# -gt 0 ]]; do
    case $1 in
      -h|--help)
        echo -e $usage
        exit 0
        ;;
      --reload)
        cd $SCRIPT_DIR
        reload_process
        [[ $? -eq 0 ]] && cd $CURRENT_DIR || exit 1
        shift
        ;;
      --restart)
        cd $SCRIPT_DIR
        kill_process
        start_process
        check_process
        cd $CURRENT_DIR
        shift
        ;;
      --kill)
        cd $SCRIPT_DIR
        kill_process
        check_process
        cd $CURRENT_DIR
        shift
        ;;
      --check)
        cd $SCRIPT_DIR
        check_process
        cd $CURRENT_DIR
        shift
        ;;
      --config)
        cd $SCRIPT_DIR
        check_config
        [[ $? -eq 0 ]] && cd $CURRENT_DIR || exit 1
        shift
        ;;
      *)
        echo "param $1"
        params+=($1)
        shift
        ;;
    esac
done