Netty 是一个异步事件驱动的网络通信层框架,用于快速开发高可用高性能的服务端网络框架与客户端程序,它极大地简化了 TCP 和 UDP 套接字服务器等网络编程。 和别人单独开发一个基于Netty的高性能Server入门netty不同,我深入了解Netty源自 数据透传Server直接内存OOM且进程僵死问题的排查。 一、问题与背景
一天自己接手的一个日志透传模块出现大量直接内存OOM的异常日志告警,且不久进程出现僵死,服务不可用。关键错误日志如下:
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 2147483648, max: 2147483648) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:775) at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:730) at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:645) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:621) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:204) at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:188) at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) at io.netty.buffer.PoolArena.allocate(PoolArena.java:128) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:378) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178) at io.netty.channel.unix.PreferredDirectByteBufAllocator.ioBuffer(PreferredDirectByteBufAllocator.java:53) at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114) at io.netty.channel.epoll.EpollRecvByteAllocatorHandle.allocate(EpollRecvByteAllocatorHandle.java:75) at io.netty.channel.epoll.EpollDatagramChannel$EpollDatagramChannelUnsafe.epollInReady(EpollDatagramChannel.java:485) at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:388) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:387) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)
复制
问题出现后,第一步就是要进行紧急定位与恢复。经过定位发现,该时刻点业务有瞬时集中上报大量数据,故直接内存OOM与其直接相关,通过JVM参数-XX:MaxDirectMemorySize=4G,翻倍直接内存大小并重启后业务恢复。 二、问题分析 1、现场回顾
netty版本:4.1.58.Final
jvm版本:1.8.0_242
堆内存:2GB
之前对Netty和直接内存这块了解不是很多,于是对一些基本问题进行了深入理解。 1)直接内存的默认设置
程序在现网运行阶段,其实我们并没有设置-XX:MaxDirectMemorySize,那实际运行的直接内存为啥是2GB? 2)Netty直接内存申请机制
private static void incrementMemoryCounter(int capacity) { if (DIRECT_MEMORY_COUNTER != null) { long newUsedMemory = DIRECT_MEMORY_COUNTER.addAndGet(capacity); if (newUsedMemory > DIRECT_MEMORY_LIMIT) { DIRECT_MEMORY_COUNTER.addAndGet(-capacity); throw new OutOfDirectMemoryError("failed to allocate " + capacity + " byte(s) of direct memory (used: " + (newUsedMemory - capacity) + ", max: " + DIRECT_MEMORY_LIMIT + ')'); } } }
复制 2、是否存在内存泄漏?
虽然直接内存泄漏问题的排查是极其痛苦和繁琐,但千万不要被这堆讨厌的 OOM 日志和内存泄漏问题吓到。直接内存是否够用,我们先打印出相关的指标再做分析。 1)反射打印出堆外内存计数
由上文所知,Netty的PlatformDependent类中,incrementMemoryCounter方法进行直接内存统计判断,所以我参考了美团这篇技术文章的实现方案,使用反射获取到DIRECT_MEMORY_COUNTER。
详细实现如下:
// 使用得是spring的ReflectionUtils,spring yyds! Field field = ReflectionUtils.findField(PlatformDependent.class, "DIRECT_MEMORY_COUNTER"); field.setAccessible(true); directMem = (AtomicLong) field.get(PlatformDependent.class);
复制
笔者后面补充:其实可以直接通过 PlatformDependent.usedDirectMemory() 访问获取到DIRECT_MEMORY_COUNTER的值,不用反射机制。