×

大多数错误发生在 Collector 启动期间,此时 Collector 配置自身并查找或下载系统的内核驱动程序。(Most errors occur during Collector startup when Collector configures itself and finds or downloads a kernel driver for the system.)

下图描述了 Collector 启动过程的主要部分 (The following diagram describes the main parts of Collector startup process)

Collector pod startup process
图1. Collector Pod 启动过程 (Figure 1. Collector pod startup process)

如果启动过程的任何部分失败,日志将显示详细说明哪些步骤成功或失败的诊断摘要。(If any part of the startup procedure fails, the logs display a diagnostic summary detailing which steps succeeded or failed.)

以下日志文件示例显示了成功的启动 (The following log file example shows a successful startup)

[INFO    2022/11/28 13:21:55] == Collector Startup Diagnostics: ==
[INFO    2022/11/28 13:21:55]  Connected to Sensor?       true
[INFO    2022/11/28 13:21:55]  Kernel driver available?   true
[INFO    2022/11/28 13:21:55]  Driver loaded into kernel? true
[INFO    2022/11/28 13:21:55] ====================================

日志输出确认 Collector 已连接到 Sensor 并找到并加载了内核驱动程序。您可以使用此日志检查 Collector 是否成功启动。

无法连接到 Sensor

启动时,首先检查您是否可以连接到 Sensor。Sensor 负责下载内核驱动程序和 CIDR 块以处理网络事件,使其成为启动过程的重要组成部分。以下日志表明您无法连接到 Sensor

Collector Version: 3.15.0
OS: Ubuntu 20.04.4 LTS
Kernel Version: 5.4.0-126-generic
Starting StackRox Collector...
[INFO    2023/05/13 12:20:43] Hostname: 'hostname'
[...]
[INFO    2023/05/13 12:20:43] Sensor configured at address: sensor.stackrox.svc:9998
[INFO    2023/05/13 12:20:43] Attempting to connect to Sensor
[INFO    2023/05/13 12:21:13]
[INFO    2023/05/13 12:21:13] == Collector Startup Diagnostics: ==
[INFO    2023/05/13 12:21:13]  Connected to Sensor?       false
[INFO    2023/05/13 12:21:13]  Kernel driver candidates:
[INFO    2023/05/13 12:21:13] ====================================
[INFO    2023/05/13 12:21:13]
[FATAL   2023/05/13 12:21:13] Unable to connect to Sensor.

此错误可能意味着 Sensor 未正确启动或 Collector 配置不正确。要解决此问题,必须验证 Collector 配置以确保 Sensor 地址正确且 Sensor pod 运行正常。

查看 Collector 日志以具体检查已配置的 Sensor 地址。或者,您可以运行以下命令

$ kubectl -n stackrox get pod <collector_pod_name> -o jsonpath='{.spec.containers[0].env[?(@.name=="GRPC_SERVER")].value}' (1)
1 对于 <collector_pod_name>,请指定您的 Collector pod 的名称,例如 collector-vclg5

内核驱动程序不可用

Collector 确定它是否具有节点内核版本的内核驱动程序。Collector 首先搜索本地存储以查找具有正确版本和类型的驱动程序,然后尝试从 Sensor 下载驱动程序。以下日志表明本地内核驱动程序和 Sensor 中的驱动程序均不存在

Collector Version: 3.15.0
OS: Alpine Linux v3.16
Kernel Version: 5.15.82-0-virt
Starting StackRox Collector...
[INFO    2023/05/30 12:00:33] Hostname: 'alpine'
[INFO    2023/05/30 12:00:33] User configured collection-method=ebpf
[INFO    2023/05/30 12:00:33] Afterglow is enabled
[INFO    2023/05/30 12:00:33] Sensor configured at address: sensor.stackrox.svc:443
[INFO    2023/05/30 12:00:33] Attempting to connect to Sensor
[INFO    2023/05/30 12:00:33] Successfully connected to Sensor.
[INFO    2023/05/30 12:00:33] Module version: 2.5.0-rc1
[INFO    2023/05/30 12:00:33] Config: collection_method:0, useChiselCache:1, scrape_interval:30, turn_off_scrape:0, hostname:alpine, processesListeningOnPorts:1, logLevel:INFO
[INFO    2023/05/30 12:00:33] Attempting to find eBPF probe - Candidate versions:
[INFO    2023/05/30 12:00:33] collector-ebpf-5.15.82-0-virt.o
[INFO    2023/05/30 12:00:33] Attempting to download collector-ebpf-5.15.82-0-virt.o
[INFO    2023/05/30 12:00:33] Attempting to download kernel object from https://sensor.stackrox.svc:443/kernel-objects/2.5.0/collector-ebpf-5.15.82-0-virt.o.gz (1)
[INFO    2023/05/30 12:00:33] HTTP Request failed with error code 404 (2)
[WARNING 2023/05/30 12:02:03] Attempted to download collector-ebpf-5.15.82-0-virt.o.gz 90 time(s)
[WARNING 2023/05/30 12:02:03] Failed to download from collector-ebpf-5.15.82-0-virt.o.gz
[WARNING 2023/05/30 12:02:03] Unable to download kernel object collector-ebpf-5.15.82-0-virt.o to /module/collector-ebpf.o.gz
[WARNING 2023/05/30 12:02:03] No suitable kernel object downloaded for collector-ebpf-5.15.82-0-virt.o
[ERROR   2023/05/30 12:02:03] Failed to initialize collector kernel components.
[INFO    2023/05/30 12:02:03]
[INFO    2023/05/30 12:02:03] == Collector Startup Diagnostics: ==
[INFO    2023/05/30 12:02:03]  Connected to Sensor?       true
[INFO    2023/05/30 12:02:03]  Kernel driver candidates:
[INFO    2023/05/30 12:02:03]    collector-ebpf-5.15.82-0-virt.o (unavailable)
[INFO    2023/05/30 12:02:03] ====================================
[INFO    2023/05/30 12:02:03]
[FATAL   2023/05/30 12:02:03] Failed to initialize collector kernel components. (3)
1 日志显示首先尝试找到模块,然后尝试从 Sensor 下载驱动程序。
2 404 错误表示节点的内核没有内核驱动程序。
3 由于缺少驱动程序,Collector 进入 CrashLoopBackOff 状态。

内核版本文件包含所有受支持的内核版本的列表。

加载内核驱动程序失败

在 Collector 启动之前,它会加载内核驱动程序。但是,在极少数情况下,您可能会遇到 Collector 无法加载内核驱动程序的问题,从而导致各种错误消息或异常。在这种情况下,必须检查日志以识别加载内核驱动程序失败的问题。

请考虑以下 Collector 日志

[INFO    2023/05/13 14:25:13] Hostname: 'hostname'
[...]
[INFO    2023/05/13 14:25:13] Successfully downloaded and decompressed /module/collector.o
[INFO    2023/05/13 14:25:13]
[INFO    2023/05/13 14:25:13] This product uses ebpf subcomponents licensed under the GNU
[INFO    2023/05/13 14:25:13] GENERAL PURPOSE LICENSE Version 2 outlined in the /kernel-modules/LICENSE file.
[INFO    2023/05/13 14:25:13] Source code for the ebpf subcomponents is available at
[INFO    2023/05/13 14:25:13] https://github.com/stackrox/falcosecurity-libs/
[INFO    2023/05/13 14:25:13]
-- BEGIN PROG LOAD LOG --
[...]
-- END PROG LOAD LOG --
[WARNING 2023/05/13 14:25:13] libscap: bpf_load_program() event=tracepoint/syscalls/sys_enter_chdir: Operation not permitted
[ERROR   2023/05/13 14:25:13] Failed to setup collector-ebpf-6.2.0-20-generic.o
[ERROR   2023/05/13 14:25:13] Failed to initialize collector kernel components.
[INFO    2023/05/13 14:25:13]
[INFO    2023/05/13 14:25:13] == Collector Startup Diagnostics: ==
[INFO    2023/05/13 14:25:13]  Connected to Sensor?       true
[INFO    2023/05/13 14:25:13]  Kernel driver candidates:
[INFO    2023/05/13 14:25:13]    collector-ebpf-6.2.0-20-generic.o (available)
[INFO    2023/05/13 14:25:13] ====================================
[INFO    2023/05/13 14:25:13]
[FATAL   2023/05/13 14:25:13] Failed to initialize collector kernel components.

如果您遇到此类错误,则您不太可能自行修复它。因此,请改为向 Red Hat Advanced Cluster Security for Kubernetes (RHACS) 支持团队报告问题或创建一个GitHub 问题