排查日志转发问题 - 日志记录 | 可观测性 | Red Hat OpenShift Service on AWS

编辑

重新部署Fluentd Pod
排查Loki速率限制错误

编辑

重新部署Fluentd Pod

创建ClusterLogForwarder自定义资源(CR)后，如果Red Hat OpenShift Logging Operator没有自动重新部署Fluentd Pod，您可以删除Fluentd Pod以强制重新部署它们。

先决条件

您已创建ClusterLogForwarder自定义资源(CR)对象。

步骤

通过运行以下命令删除Fluentd Pod以强制重新部署它们
```
$ oc delete pod --selector logging-infra=collector
```

编辑

排查Loki速率限制错误

如果日志转发器API转发的大量消息块超过了Loki的速率限制，Loki会生成速率限制(429)错误。

这些错误可能在正常操作期间发生。例如，当向已有一些日志的集群添加日志时，在日志尝试摄取所有现有日志条目时，可能会发生速率限制错误。在这种情况下，如果新日志的添加速率小于总速率限制，则最终会摄取历史数据，并且无需用户干预即可解决速率限制错误。

如果速率限制错误持续发生，您可以通过修改LokiStack自定义资源(CR)来解决此问题。

Grafana托管的Loki上没有LokiStack CR。本主题不适用于Grafana托管的Loki服务器。

条件

日志转发器API配置为将日志转发到Loki。

您的系统向Loki发送大于2 MB的消息块。例如

"values":[["1630410392689800468","{\"kind\":\"Event\",\"apiVersion\":\
.......
......
......
......
\"received_at\":\"2021-08-31T11:46:32.800278+00:00\",\"version\":\"1.7.4 1.6.0\"}},\"@timestamp\":\"2021-08-31T11:46:32.799692+00:00\",\"viaq_index_name\":\"audit-write\",\"viaq_msg_id\":\"MzFjYjJkZjItNjY0MC00YWU4LWIwMTEtNGNmM2E5ZmViMGU4\",\"log_type\":\"audit\"}"]]}]}

输入oc logs -n openshift-logging -l component=collector后，集群中的收集器日志显示包含以下错误消息之一的行

429 Too Many Requests Ingestion rate limit exceeded

示例Vector错误消息

2023-08-25T16:08:49.301780Z  WARN sink{component_kind="sink" component_id=default_loki_infra component_type=loki component_name=default_loki_infra}: vector::sinks::util::retries: Retrying after error. error=Server responded with an error: 429 Too Many Requests internal_log_rate_limit=true

示例Fluentd错误消息

2023-08-30 14:52:15 +0000 [warn]: [default_loki_infra] failed to flush the buffer. retry_times=2 next_retry_time=2023-08-30 14:52:19 +0000 chunk="604251225bf5378ed1567231a1c03b8b" error_class=Fluent::Plugin::LokiOutput::LogPostError error="429 Too Many Requests Ingestion rate limit exceeded for user infrastructure (limit: 4194304 bytes/sec) while attempting to ingest '4082' lines totaling '7820025' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased\n"

错误在接收端也可见。例如，在LokiStack ingester pod中

示例Loki ingester错误消息

level=warn ts=2023-08-30T14:57:34.155592243Z caller=grpc_logging.go:43 duration=1.434942ms method=/logproto.Pusher/Push err="rpc error: code = Code(429) desc = entry with timestamp 2023-08-30 14:57:32.012778399 +0000 UTC ignored, reason: 'Per stream rate limit exceeded (limit: 3MB/sec) while attempting to ingest for stream

步骤

更新LokiStack CR中的ingestionBurstSize和ingestionRate字段

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
  limits:
    global:
      ingestion:
        ingestionBurstSize: 16 (1)
        ingestionRate: 8 (2)
# ...

1	`ingestionBurstSize`字段定义每个分发器副本的最大本地速率限制样本大小（以MB为单位）。此值是硬限制。将此值至少设置为单个推送请求中预期最大日志大小。大于`ingestionBurstSize`值单个请求不被允许。
2	`ingestionRate`字段是对每秒摄取样本最大量（以MB为单位）的软限制。如果日志速率超过限制，则会发生速率限制错误，但收集器会重试发送日志。只要总平均值低于限制，系统就会恢复，并且无需用户干预即可解决错误。

日志转发故障排除

重新部署Fluentd Pod

排查Loki速率限制错误