android HAL层崩溃排查记录

要最近在调试系统HDMI CEC功能时,遇到一个奇怪的崩溃问题,这边记录下。

初步分析

先上日志:

--------- beginning of crash
03-06 10:48:25.503  1133  1133 F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
03-06 10:48:25.503  1133  1133 F DEBUG   : Build fingerprint: ':13/TD1A.220804.031/3582:userdebug/release-keys'
03-06 10:48:25.503  1133  1133 F DEBUG   : Revision: '0'
03-06 10:48:25.503  1133  1133 F DEBUG   : ABI: 'arm64'
03-06 10:48:25.503  1133  1133 F DEBUG   : Timestamp: 2024-03-06 10:48:25.490260378-0500
03-06 10:48:25.503  1133  1133 F DEBUG   : Process uptime: 6s
03-06 10:48:25.503  1133  1133 F DEBUG   : Cmdline: /vendor/bin/hw/android.hardware.tv.cec@1.0-service
03-06 10:48:25.503  1133  1133 F DEBUG   : pid: 615, tid: 615, name: cec@1.0-service  >>> /vendor/bin/hw/android.hardware.tv.cec@1.0-service <<<
03-06 10:48:25.503  1133  1133 F DEBUG   : uid: 1000
03-06 10:48:25.503  1133  1133 F DEBUG   : tagged_addr_ctrl: 0000000000000001 (PR_TAGGED_ADDR_ENABLE)
03-06 10:48:25.503  1133  1133 F DEBUG   : signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
03-06 10:48:25.503  1133  1133 F DEBUG   : Abort message: 'stack corruption detected (-fstack-protector)'
03-06 10:48:25.503  1133  1133 F DEBUG   :     x0  0000000000000000  x1  0000000000000267  x2  0000000000000006  x3  0000007fe8d61420
03-06 10:48:25.503  1133  1133 F DEBUG   :     x4  0000000000808080  x5  0000000000808080  x6  0000000000808080  x7  8080808080808080
03-06 10:48:25.503  1133  1133 F DEBUG   :     x8  00000000000000f0  x9  00000077cc5b4a00  x10 0000000000000001  x11 00000077cc5f2ce4
03-06 10:48:25.503  1133  1133 F DEBUG   :     x12 0101010101010101  x13 000000007fffffff  x14 0000000000001686  x15 0000000000000030
03-06 10:48:25.503  1133  1133 F DEBUG   :     x16 00000077cc657d60  x17 00000077cc634b70  x18 00000077d3ae2000  x19 0000000000000267
03-06 10:48:25.504  1133  1133 F DEBUG   :     x20 0000000000000267  x21 00000000ffffffff  x22 0000000000000030  x23 00000077d302a000
03-06 10:48:25.504  1133  1133 F DEBUG   :     x24 0000000000000004  x25 00000077d302a000  x26 00000077d302a000  x27 b40000763c5972c8
03-06 10:48:25.504  1133  1133 F DEBUG   :     x28 0000000000000000  x29 0000007fe8d614a0
03-06 10:48:25.504  1133  1133 F DEBUG   :     lr  00000077cc5e4868  sp  0000007fe8d61400  pc  00000077cc5e4894  pst 0000000000001000
03-06 10:48:25.504  1133  1133 F DEBUG   : backtrace:
03-06 10:48:25.504  1133  1133 F DEBUG   :       #00 pc 0000000000051894  /apex/com.android.runtime/lib64/bionic/libc.so (abort+164) (BuildId: 058e3ec96fa600fb840a6a6956c6b64e)
03-06 10:48:25.504  1133  1133 F DEBUG   :       #01 pc 00000000000664e8  /apex/com.android.runtime/lib64/bionic/libc.so (__stack_chk_fail+20) (BuildId: 058e3ec96fa600fb840a6a6956c6b64e)
03-06 10:48:25.504  1133  1133 F DEBUG   :       #02 pc 0000000000006954  /vendor/lib64/hw/android.hardware.tv.cec@1.0-impl.so (android::hardware::tv::cec::V1_0::implementation::HdmiCec::getPortInfo(std::__1::function<void (android::hardware::hidl_vec<android::hardware::tv::cec::V1_0::HdmiPortInfo> const&)>)+376) (BuildId: 647cc2659b38df33f681ae1d58a04c74)
03-06 10:48:25.504  1133  1133 F DEBUG   :       #03 pc 0000000000016540  /vendor/lib64/android.hardware.tv.cec@1.0.so (android::hardware::tv::cec::V1_0::BnHwHdmiCec::_hidl_getPortInfo(android::hidl::base::V1_0::BnHwBase*, android::hardware::Parcel const&, android::hardware::Parcel*, std::__1::function<void (android::hardware::Parcel&)>)+252) (BuildId: 8ca54579dc40d30a62824bb0a91d98f4)
03-06 10:48:25.504  1133  1133 F DEBUG   :       #04 pc 0000000000017668  /vendor/lib64/android.hardware.tv.cec@1.0.so (android::hardware::tv::cec::V1_0::BnHwHdmiCec::onTransact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+1132) (BuildId: 8ca54579dc40d30a62824bb0a91d98f4)
03-06 10:48:25.504  1133  1133 F DEBUG   :       #05 pc 000000000008ee40  /apex/com.android.vndk.v33/lib64/libhidlbase.so (android::hardware::BHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+156) (BuildId: 3fafcf3a9734f0d41045c2b5f828b363)
03-06 10:48:25.504  1133  1133 F DEBUG   :       #06 pc 0000000000093dfc  /apex/com.android.vndk.v33/lib64/libhidlbase.so (android::hardware::IPCThreadState::executeCommand(int)+2784) (BuildId: 3fafcf3a9734f0d41045c2b5f828b363)
03-06 10:48:25.504  1133  1133 F DEBUG   :       #07 pc 00000000000931bc  /apex/com.android.vndk.v33/lib64/libhidlbase.so (android::hardware::IPCThreadState::getAndExecuteCommand()+224) (BuildId: 3fafcf3a9734f0d41045c2b5f828b363)
03-06 10:48:25.504  1133  1133 F DEBUG   :       #08 pc 0000000000094388  /apex/com.android.vndk.v33/lib64/libhidlbase.so (android::hardware::IPCThreadState::joinThreadPool(bool)+172) (BuildId: 3fafcf3a9734f0d41045c2b5f828b363)
03-06 10:48:25.504  1133  1133 F DEBUG   :       #09 pc 00000000000010e4  /vendor/bin/hw/android.hardware.tv.cec@1.0-service (main+144) (BuildId: f6a65dc725b06643501c269fa219b717)
03-06 10:48:25.504  1133  1133 F DEBUG   :       #10 pc 000000000004a0f4  /apex/com.android.runtime/lib64/bionic/libc.so (__libc_init+96) (BuildId: 058e3ec96fa600fb840a6a6956c6b64e)
03-06 10:48:26.344  1267  1267 F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***

 初步看了下,崩溃在android.hardware.tv.cec@1.0-service 服务进程中。那么简单,上addr2line工具。

addr2line

addr2line --help
Usage: addr2line [option(s)] [addr(s)]
 Convert addresses into line number/file name pairs.
 If no addresses are specified on the command line, they will be read from stdin
 The options are:
  @<file>                Read options from <file>
  -a --addresses         Show addresses
  -b --target=<bfdname>  Set the binary file format
  -e --exe=<executable>  Set the input file name (default is a.out)
  -i --inlines           Unwind inlined functions
  -j --section=<name>    Read section-relative offsets instead of addresses
  -p --pretty-print      Make the output easier to read for humans
  -s --basenames         Strip directory names
  -f --functions         Show function names
  -C --demangle[=style]  Demangle function names
  -R --recurse-limit     Enable a limit on recursion whilst demangling.  [Default]
  -r --no-recurse-limit  Disable a limit on recursion whilst demangling
  -h --help              Display this information
  -v --version           Display the program's version

addr2line: supported targets: elf64-x86-64 elf32-i386 elf32-iamcu elf32-x86-64 pei-i386 pe-x86-64 pei-x86-64 elf64-l1om elf64-k1om elf64-little elf64-big elf32-little elf32-big pe-bigobj-x86-64 pe-i386 srec symbolsrec verilog tekhex binary ihex plugin
Report bugs to <https://sourceware.org/bugzilla/>

addr2line -ife out/target/product/aosp/symbols/vendor/lib64/hw/android.hardware.tv.cec@1.0-impl.so 0000000000006954

llvm-addr2line

记得android编译链接工具更新了,确实不能用这个版本了。下面切成llvm-addr2line工具。

prebuilts/clang/host/linux-x86/llvm-binutils-stable/llvm-addr2line --help
OVERVIEW: llvm-addr2line

USAGE: llvm-addr2line [options] addresses...

OPTIONS:
  --addresses           Show address before line information
  --adjust-vma=<offset> Add specified offset to object file addresses
  -a                    Alias for --addresses
  --basenames           Strip directory names from paths
  -C                    Alias for --demangle
  --debug-file-directory=<dir>
                        Path to directory where to look for debug files
  -demangle=false       Alias for --no-demangle
  -demangle=true        Alias for --demangle
  --demangle            Demangle function names
  --dia                 Use the DIA library to access symbols (Windows only)
  --dwp=<file>          Path to DWP file to be use for any split CUs
  -e=<file>             Alias for --obj
  --exe=<file>          Alias for --obj
  --exe <file>          Alias for --obj
  -e <file>             Alias for --obj
  -f=<value>            Alias for --functions=
  --fallback-debug-path=<dir>
                        Fallback path for debug binaries
  --functions=<value>   Print function name for a given address
  --functions           Print function name for a given address
  -f                    Alias for --functions
  --help                Display this help
  --inlines             Print all inlined frames for a given address
  --inlining=false      Alias for --no-inlines
  --inlining=true       Alias for --inlines
  --inlining            Alias for --inlines
  -i                    Alias for --inlines
  --no-demangle         Don't demangle function names
  --no-inlines          Do not print inlined frames
  --no-untag-addresses  Remove memory tags from addresses before symbolization
  --obj=<file>          Path to object file to be symbolized (if not provided, object file should be specified for each input line)
  --output-style=style  Specify print style. Supported styles: LLVM, GNU, JSON
  --pretty-print        Make the output more human friendly
  --print-address       Alias for --addresses
  --print-source-context-lines=<value>
                        Print N lines of source file context
  -p                    Alias for --pretty-print
  --relative-address    Interpret addresses as addresses relative to the image base
  --relativenames       Strip the compilation directory from paths
  -s                    Alias for --basenames
  --verbose             Print verbose line info
  --version             Display the version
  -v                    Alias for --version

llvm-symbolizer Mach-O Specific Options:
  --default-arch=<value> Default architecture (for multi-arch objects)
  --dsym-hint=<dir>      Path to .dSYM bundles to search for debug info for the object files

Pass @FILE as argument to read options from FILE.

于是,定位命令行切换成:

prebuilts/clang/host/linux-x86/llvm-binutils-stable/llvm-addr2line -ife out/target/product/aosp/symbols/vendor/lib64/hw/android.hardware.tv.cec@1.0-impl.so 0000000000006954
_ZN7android8hardware2tv3cec4V1_014implementation7HdmiCec11getPortInfoENSt3__18functionIFvRKNS0_8hidl_vecINS3_12HdmiPortInfoEEEEEE
hardware/interfaces/tv/cec/1.0/default/HdmiCec.cpp:0

怎么可能是源码的0行,现在轮到我崩溃了。。

背景知识

看来直接通过上面的通用方式,不能直接定位到崩溃点的代码了。

那从进程名,打印出来的函数名:getPortInfo ,对应的崩溃错误:

Abort message: 'stack corruption detected (-fstack-protector)'

来看看能不能发现些什么。

-fstack-protector 检测到的堆栈损坏

编译器的 -fstack-protector 选项会在具有栈上缓冲区的函数中插入检查机制,以防止缓冲区溢出。默认情况下,系统会为平台代码(而非应用)启用此选项。启用此选项后,编译器会向函数序言添加指令,以在堆栈上写入刚刚超过上一局部值的随机值,并向函数结尾添加指令以进行回读并确认是否发生更改。如果该值已更改,则表示该值已被缓冲区溢出覆盖,因此该结尾会调用 __stack_chk_fail 来记录消息和中止。

pid: 26717, tid: 26717, name: crasher  >>> crasher <<<
signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
Abort message: 'stack corruption detected'
    r0 00000000  r1 0000685d  r2 00000006  r3 00000008
    r4 ffd516d8  r5 0000685d  r6 0000685d  r7 0000010c
    r8 00000000  r9 00000000  sl 00000000  fp ffd518bc
    ip 00000000  sp ffd516c8  lr ee63ece3  pc ee66ef0c  cpsr 000e0010

backtrace:
    #00 pc 00049f0c  /system/lib/libc.so (tgkill+12)
    #01 pc 00019cdf  /system/lib/libc.so (abort+50)
    #02 pc 0001e07d  /system/lib/libc.so (__libc_fatal+24)
    #03 pc 0004863f  /system/lib/libc.so (__stack_chk_fail+6)
    #04 pc 000013ed  /system/xbin/crasher (smash_stack+76)
    #05 pc 00001591  /system/xbin/crasher (do_action+280)
    #06 pc 00002219  /system/xbin/crasher (main+100)
    #07 pc 000177a1  /system/lib/libc.so (__libc_init+48)
    #08 pc 00001144  /system/xbin/crasher (_start+96)

0x00 概述

栈溢出保护是一种缓冲区溢出攻击缓解手段,当函数存在缓冲区溢出攻击漏洞时,攻击者可以覆盖栈上的返回地址来让shellcode能够得到执行。当启用栈保护后,函数开始执行的时候会先往栈里插入cookie信息,当函数真正返回的时候会验证cookie信息能否合法,假如不合法就中止程序运行。攻击者在覆盖返回地址的时候往往也会将cookie信息给覆盖掉,导致栈保护检查失败而阻止shellcode的执行。在Linux中我们将cookie信息称为canary(以下统一使用canary)。

gcc在4.2版本中增加了-fstack-protector和-fstack-protector-all编译参数以支持栈保护功能,4.9新添加了-fstack-protector-strong编译参数让保护的范围更广。以下是-fstack-protector和-fstack-protector-strong的区别:

原创技术干货 | 解读Linux安全机制之栈溢出保护

Linux系统中存在着三种类型的栈:

  1. 应用程序栈:工作在Ring3,由应用程序来维护;

  2. 内核进程上下文栈:工作在Ring0,由内核在创立线程的时候创立;

  3. 内核中断上下文栈:工作在Ring0,在内核初始化的时候给每个CPU核心创立一个。

看来,是哪里可能存在内存溢出。联系到用户之前在未进行多个hdmi cec端口配置时,未有发现此问题。配置多个口后,出现此问题。而 java层代码一致,有变动的,就是HAL层这块了。

定位改动

因为有比较明确的改动地方引起,就从改动开始排查吧。

HAL被调用的地方,也是上面崩溃指向的函数:

Return<void> HdmiCec::getPortInfo(getPortInfo_cb _hidl_cb) {
    struct hdmi_port_info* legacyPorts;
    int numPorts;
    hidl_vec<HdmiPortInfo> portInfos;
    mDevice->get_port_info(mDevice, &legacyPorts, &numPorts);
    portInfos.resize(numPorts);
    for (int i = 0; i < numPorts; ++i) {
        portInfos[i] = {
            .type = static_cast<HdmiPortType>(legacyPorts[i].type),
            .portId = static_cast<uint32_t>(legacyPorts[i].port_id),
            .cecSupported = legacyPorts[i].cec_supported != 0,
            .arcSupported = legacyPorts[i].arc_supported != 0,
            .physicalAddress = legacyPorts[i].physical_address
        };
    }
    _hidl_cb(portInfos);
    return Void();
}

初始版本

struct hdmi_cec_context_t {
    hdmi_cec_device_t device;
    /* our private state goes below here */
    event_callback_t event_callback;
    void* cec_arg;
    struct hdmi_port_info port;
    int fd;
    int en_mask;
    bool enable;
    bool system_control;
    int phy_addr;
    bool hotplug;
    bool cec_init;
};

static void hdmi_cec_get_port_info(const struct hdmi_cec_device* dev,
                   struct hdmi_port_info* list[], int* total)
{
...
    list[0] = &ctx->port;
    list[0]->type = HDMI_OUTPUT;
    list[0]->port_id = HDMI_CEC_PORT_ID;
    list[0]->cec_supported = support;
    list[0]->arc_supported = 0;
    list[0]->physical_address = val;
    *total = 1;
}

问题版本

struct hdmi_cec_context_t {
    hdmi_cec_device_t device;
    /* our private state goes below here */
    event_callback_t event_callback;
    void* cec_arg;
    struct hdmi_port_info port[4];
    int fd;
    int en_mask;
    bool enable;
    bool system_control;
    int phy_addr;
    bool hotplug;
    bool cec_init;
};
static void hdmi_cec_get_port_info(const struct hdmi_cec_device* dev,
				   struct hdmi_port_info* list[], int* total)
{
...

	list[0] = &ctx->port[0];
	list[0]->type = HDMI_INPUT;
	list[0]->port_id = 1;
	list[0]->cec_supported = support;
	list[0]->arc_supported = 0;
	list[0]->physical_address = 0x1000;//CVT_DEF_ARC_PHYSICAL_ADDRESS;

	list[1] = &ctx->port[1];
	list[1]->type = HDMI_INPUT;
	list[1]->port_id = 2;
	list[1]->cec_supported = support;
	list[1]->arc_supported = 0;
	list[1]->physical_address = 0x3000;

	list[2] = &ctx->port[2];
	list[2]->type = HDMI_INPUT;
	list[2]->port_id = 3;
	list[2]->cec_supported = support;
	list[2]->arc_supported = 0;
	list[2]->physical_address = 0x4000;
	
	list[3] = &ctx->port[3];
	list[3]->type = HDMI_INPUT;
	list[3]->port_id = 4;
	list[3]->cec_supported = support;
	list[3]->arc_supported = 1;
	list[3]->physical_address = 0x2000;

	*total = 4;
}

上面测试过,只添加2个(list[0],list[1]),也是不会崩溃。看起来,是个内存溢出的问题。排查了相关数量定义,限制,似乎是没有找到有限制2个的。反馈还出现过一次配置3个的可以。

关注下面变量的定义及传递: 

struct hdmi_port_info* legacyPorts;
mDevice->get_port_info(mDevice, &legacyPorts, &numPorts);

hdmi_cec_get_port_info的参数

struct hdmi_port_info* list[] 是一个指针数组,其中每个元素都是指向 struct hdmi_port_info 结构体的指针。list 是一个指针数组,它可以存储 struct hdmi_port_info* 类型的指针。

修正版本

static void hdmi_cec_get_port_info(const struct hdmi_cec_device* dev,
				   struct hdmi_port_info* list[], int* total)
{
	...
	ctx->port[0].type = HDMI_INPUT;
	ctx->port[0].port_id = 1;
	ctx->port[0].cec_supported = 1;
	ctx->port[0].arc_supported = 1;
	ctx->port[0].physical_address = 0x1000;

	ctx->port[1].type = HDMI_INPUT;
	ctx->port[1].port_id = 2;
	ctx->port[1].cec_supported = 1;
	ctx->port[1].arc_supported = 0;
	ctx->port[1].physical_address = 0x2000;

	ctx->port[2].type = HDMI_INPUT;
	ctx->port[2].port_id = 3;
	ctx->port[2].cec_supported = 1;
	ctx->port[2].arc_supported = 0;
	ctx->port[2].physical_address = 0x3000;

	ctx->port[3].type = HDMI_INPUT;
	ctx->port[3].port_id = 4;
	ctx->port[3].cec_supported = 1;
	ctx->port[3].arc_supported = 0;
	ctx->port[3].physical_address = 0x4000;

	*list = &ctx->port[0];
	*total = 4;

}

问题分析

让我们逐步解释上述过程中涉及到的相关步骤:

1. 定义 `legacyPorts` 指针:

struct hdmi_port_info* legacyPorts;
   这行代码定义了一个名为 `legacyPorts` 的指针,它的类型是 `struct hdmi_port_info*`,即指向 `struct hdmi_port_info` 结构体的指针。

2. 调用 hdmi_cec_get_port_info 函数:

hdmi_cec_get_port_info(mDevice, &legacyPorts, &numPorts);
   在这行代码中,我们将 `legacyPorts` 的地址(即指向 `legacyPorts` 指针的指针)和 `numPorts` 的地址(即指向 `numPorts` 变量的指针)传递给 `hdmi_cec_get_port_info` 函数。

3. 在 `hdmi_cec_get_port_info` 函数中:*list = ctx->port;
   在函数实现中,`list` 是一个指向指针数组的指针,`ctx->port` 是指向 `struct hdmi_port_info` 数组的指针。
   通过 `*list = ctx->port;` 这行代码,我们将 `ctx->port` 数组的起始地址赋值给了 `list` 指针,这样 `list` 指针就指向了 `ctx->port` 数组的内容。
   由于 `legacyPorts` 是 `list` 的地址,所以在函数调用结束后,`legacyPorts` 指向了 `ctx->port` 数组的内容。

总结起来,通过使用 `&legacyPorts` 将 `legacyPorts` 的地址传递给 `hdmi_cec_get_port_info` 函数,在函数内部将 `ctx->port` 的地址赋值给了 `*list`,从而使得 `legacyPorts` 指向了 `ctx->port` 数组的内容。这样,通过 `legacyPorts` 指针,我们可以在函数外部访问和操作 `ctx->port` 数组的填充后的端口信息。

 总结

所以,通过修正版本的分析,就知道问题版本出问题的原因了。

在用问题版本中,我们使用list[0] = &ctx->port[0],对struct hdmi_port_info* list[]中的每个元素进行赋值。

	list[0] = &ctx->port[0];
    ...
	list[3]->physical_address = 0x2000;

我们在调用时,定义了一个名为 `legacyPorts` 的指针,它的类型是 `struct hdmi_port_info*`,即指向 `struct hdmi_port_info` 结构体的指针。

参考链接:

诊断原生代码崩溃问题  |  Android 开源项目  |  Android Open Source Project

原创技术干货 | 解读Linux安全机制之栈溢出保护 - 送码网