$ oc set env -n hypershift deployment/operator METRICS_SET=All
您可以通过配置指标集来收集托管控制平面的指标。HyperShift 运算符可以在管理集群中为其管理的每个托管集群创建或删除监控仪表板。
Red Hat OpenShift Container Platform 的托管控制平面在每个控制平面命名空间中创建ServiceMonitor
资源,允许 Prometheus 堆栈从控制平面收集指标。ServiceMonitor
资源使用指标重新标记来定义哪些指标包含或排除在特定组件中,例如 etcd 或 Kubernetes API 服务器。控制平面直接生成的指标数量会直接影响收集这些指标的监控堆栈的资源需求。
您可以配置一个指标集来标识为每个控制平面生成的指标集,而不是生成适用于所有情况的固定数量的指标。支持以下指标集:
Telemetry
:这些指标是遥测所需的。此集合是默认集合,也是最小的指标集合。
SRE
:此集合包含生成警报并允许对控制平面组件进行故障排除所需的指标。
All
:此集合包含独立 OpenShift Container Platform 控制平面组件生成的所有指标。
要配置指标集,请通过输入以下命令设置 HyperShift 运算符部署中的METRICS_SET
环境变量:
$ oc set env -n hypershift deployment/operator METRICS_SET=All
当您指定SRE
指标集时,HyperShift 运算符会查找名为sre-metric-set
的配置映射,该映射具有单个键:config
。config
键的值必须包含一组按控制平面组件组织的RelabelConfigs
。
您可以指定以下组件:
etcd
kubeAPIServer
kubeControllerManager
openshiftAPIServer
openshiftControllerManager
openshiftRouteControllerManager
cvo
olm
catalogOperator
registryOperator
nodeTuningOperator
controlPlaneOperator
hostedClusterConfigOperator
以下示例说明了SRE
指标集的配置:
kubeAPIServer:
- action: "drop"
regex: "etcd_(debugging|disk|server).*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_admission_controller_admission_latencies_seconds_.*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_admission_step_admission_latencies_seconds_.*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "transformation_(transformation_latencies_microseconds|failures_total)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "network_plugin_operations_latency_microseconds|sync_proxy_rules_latency_microseconds|rest_client_request_latency_seconds"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50)"
sourceLabels: ["__name__", "le"]
kubeControllerManager:
- action: "drop"
regex: "etcd_(debugging|disk|request|server).*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "rest_client_request_latency_seconds_(bucket|count|sum)"
sourceLabels: ["__name__"]
- action: "drop"
regex: "root_ca_cert_publisher_sync_duration_seconds_(bucket|count|sum)"
sourceLabels: ["__name__"]
openshiftAPIServer:
- action: "drop"
regex: "etcd_(debugging|disk|server).*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_admission_controller_admission_latencies_seconds_.*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_admission_step_admission_latencies_seconds_.*"
sourceLabels: ["__name__"]
- action: "drop"
regex: "apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50)"
sourceLabels: ["__name__", "le"]
openshiftControllerManager:
- action: "drop"
regex: "etcd_(debugging|disk|request|server).*"
sourceLabels: ["__name__"]
openshiftRouteControllerManager:
- action: "drop"
regex: "etcd_(debugging|disk|request|server).*"
sourceLabels: ["__name__"]
olm:
- action: "drop"
regex: "etcd_(debugging|disk|server).*"
sourceLabels: ["__name__"]
catalogOperator:
- action: "drop"
regex: "etcd_(debugging|disk|server).*"
sourceLabels: ["__name__"]
cvo:
- action: drop
regex: "etcd_(debugging|disk|server).*"
sourceLabels: ["__name__"]
要在托管集群中启用监控仪表板,请完成以下步骤:
在local-cluster
命名空间中创建hypershift-operator-install-flags
配置映射,确保在data.installFlagsToAdd
部分中指定--monitoring-dashboards
标志。例如:
kind: ConfigMap
apiVersion: v1
metadata:
name: hypershift-operator-install-flags
namespace: local-cluster
data:
installFlagsToAdd: "--monitoring-dashboards"
installFlagsToRemove: ""
等待几分钟,直到hypershift
命名空间中的 HyperShift 运算符部署更新为包含以下环境变量:
- name: MONITORING_DASHBOARDS
value: "1"
启用监控仪表板后,对于 HyperShift 运算符管理的每个托管集群,运算符会在openshift-config-managed
命名空间中创建一个名为cp-<hosted_cluster_namespace>-<hosted_cluster_name>
的配置映射,其中<hosted_cluster_namespace>
是托管集群的命名空间,<hosted_cluster_name>
是托管集群的名称。结果,管理集群的管理控制台中会添加一个新的仪表板。
要查看仪表板,请登录到管理集群的控制台,然后单击**观察→仪表板**,转到托管集群的仪表板。
可选:要在托管集群中禁用监控仪表板,请从hypershift-operator-install-flags
配置映射中删除--monitoring-dashboards
标志。删除托管集群时,其相应的仪表板也会被删除。
要为每个托管集群生成仪表板,HyperShift 运算符使用存储在运算符命名空间 (hypershift
) 中的monitoring-dashboard-template
配置映射中的模板。此模板包含一组包含仪表板指标的 Grafana 面板。您可以编辑配置映射的内容来自定义仪表板。
生成仪表板时,以下字符串将替换为与特定托管集群相对应的值:
名称 |
描述 |
|
托管集群的名称 |
|
托管集群的命名空间 |
|
放置托管集群控制平面 Pod 的命名空间 |
|
托管集群的 UUID,与托管集群指标的 |