一则进程写日志导致磁盘IO高的案例分析
Contents
一 背景介绍
源于Linux性能优化的 [26 | 案例篇:如何找出狂打日志的“内鬼”?],一个跑在docker容器中的Python应用,持续不断的写日志,导致磁盘IO一直很高,进而影响系统性能。我这里根据自己的思路重新复盘模拟一次该案例的场景。在一台24C 64GB内存,配备SSD盘的物理机,CentOS 7.4的操作系统上操作。
二 分析步骤和思路
1 启动container
[root@centos-master ~]# docker run -itd --name=app -v /tmp/:/tmp feisky/logapp Unable to find image 'feisky/logapp:latest' locally Trying to pull repository docker.io/feisky/logapp ... latest: Pulling from docker.io/feisky/logapp 4fe2ade4980c: Pull complete 7cf6a1d62200: Pull complete d16bb326b4ba: Downloading [=================> ] 8.379 MB/24.55 MB ed7df901e9a8: Download complete d7e55b1728aa: Download complete e355358dada7: Pull complete Digest: sha256:8dfbc3116680c581d20f58d63a2353b9a7e67a0442dd43803d7dad43c6d4ac0a Status: Downloaded newer image for docker.io/feisky/logapp:latest 7f058667d924e337e8b7db93ecf28048b8226c76050d4c6f936fab77889ddb2f [root@centos-master ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 7f058667d924 feisky/logapp "python /app.py" 24 seconds ago Up 23 seconds app 15137d822209 luksa/kubia-unhealthy "node app.js" 12 days ago Up 12 days kubia-unhealthy f1df8f14c584 registry "/entrypoint.sh /e..." 2 weeks ago Up 2 weeks 0.0.0.0:5000->5000/tcp naughty_newton [root@centos-master ~]#
2 top查看系统性能指标
[root@centos-master ~]# top top - 10:36:15 up 1139 days, 21:07, 2 users, load average: 1.58, 0.60, 0.79 Tasks: 320 total, 2 running, 318 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.1 us, 3.0 sy, 0.0 ni, 90.8 id, 5.1 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 65928364 total, 7923768 free, 40700180 used, 17304416 buff/cache KiB Swap: 0 total, 0 free, 0 used. 12227536 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ PPID UID COMMAND 14327 root 20 0 963372 932328 3304 R 89.8 1.4 1:10.86 14308 0 python /app.py 11426 root 20 0 0 0 0 S 2.0 0.0 1:45.41 2 0 [kworker/u66:2] 23586 root 20 0 25.8g 25.3g 6832 S 1.0 40.3 9866:02 1 0 mongod -f /etc/mongodb.conf 14418 root 20 0 157988 2488 1568 R 0.7 0.0 0:00.18 14392 0 top
24CPU的机器,负载并不高。内存占的比较大。进程号为14327的进程,占了比较高的CPU资源。top中通过数字键1调出CPU的资源使用情况:
[root@centos-master ~]# top top - 10:50:25 up 1139 days, 21:21, 2 users, load average: 2.67, 2.39, 1.80 Tasks: 317 total, 2 running, 315 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0 us, 0.3 sy, 0.0 ni, 98.7 id, 0.7 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 20.6 us, 54.2 sy, 0.0 ni, 25.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 0.7 us, 0.0 sy, 0.0 ni, 25.2 id, 74.2 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 0.0 us, 1.7 sy, 0.0 ni, 28.4 id, 69.6 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu16 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu17 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu18 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu19 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu20 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu21 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu22 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu23 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 65928364 total, 8146796 free, 40467680 used, 17313888 buff/cache KiB Swap: 0 total, 0 free, 0 used. 12452092 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ PPID UID COMMAND 14327 root 20 0 963372 735988 3304 R 74.8 1.1 12:27.35 14308 0 python 11426 root 20 0 0 0 0 S 1.7 0.0 2:03.16 2 0 kworker/u66:2 26598 etcd 20 0 12.4g 60620 11428 S 1.3 0.1 2295:18 1 492 etcd 26773 kube 20 0 2494188 62844 21092 S 1.3 0.1 1864:10 1 493 kube-scheduler 14683 root 20 0 157956 2488 1568 R 0.7 0.0 0:00.11 14392 0 top
有部分CPU的IOwait高一点儿,其它看起来都正常。
其实,到这里我们就基本上知道有异常进程了,14327。可以去看该进程具体在干嘛了???直接到第5步骤了,但是,我们可以多角度完整的跟踪分析。
3 pidstat查看进程的资源消耗
看到占用CPU资源高的进程是14327,我们来观察它的资源消耗:pidstat -p 14327
[root@centos-master ~]# pidstat -p 14327 1 Linux 3.10.0-514.el7.x86_64 (centos-master) 2022年03月18日 _x86_64_ (24 CPU) 10时54分06秒 UID PID %usr %system %guest %CPU CPU Command 10时54分07秒 0 14327 23.00 67.00 0.00 90.00 7 python 10时54分08秒 0 14327 19.00 71.00 0.00 90.00 7 python 10时54分09秒 0 14327 25.00 64.00 0.00 89.00 7 python 10时54分10秒 0 14327 19.00 53.00 0.00 72.00 7 python 10时54分11秒 0 14327 10.00 65.00 0.00 75.00 7 python 10时54分12秒 0 14327 29.00 61.00 0.00 90.00 7 python 10时54分13秒 0 14327 27.72 62.38 0.00 90.10 7 python 10时54分14秒 0 14327 16.00 73.00 0.00 89.00 7 python 10时54分15秒 0 14327 25.00 65.00 0.00 90.00 7 python 10时54分16秒 0 14327 21.00 59.00 0.00 80.00 7 python 10时54分17秒 0 14327 10.00 38.00 0.00 48.00 3 python ^C 平均时间: 0 14327 20.44 61.67 0.00 82.11 - python [root@centos-master ~]#
的确,看到该进程占了较高的CPU资源。我们通过pidstat -d -l -p 14327 1来查看它的磁盘IO信息,同时-l表示打印它执行的命令和参数:
[root@centos-master ~]# pidstat -d -l -p 14327 1 Linux 3.10.0-514.el7.x86_64 (centos-master) 2022年03月18日 _x86_64_ (24 CPU) 11时00分59秒 UID PID kB_rd/s kB_wr/s kB_ccwr/s Command 11时01分00秒 0 14327 0.00 318540.00 0.00 python /app.py 11时01分01秒 0 14327 0.00 0.00 479236.00 python /app.py 11时01分02秒 0 14327 0.00 344500.00 0.00 python /app.py 11时01分03秒 0 14327 0.00 307472.00 0.00 python /app.py 11时01分04秒 0 14327 0.00 269632.00 700420.00 python /app.py 11时01分05秒 0 14327 0.00 384912.00 0.00 python /app.py 11时01分06秒 0 14327 0.00 445024.00 0.00 python /app.py 11时01分07秒 0 14327 0.00 91668.00 312872.00 python /app.py ^C 平均时间: 0 14327 0.00 270218.50 186566.00 python /app.py [root@centos-master ~]#
看到该进程正在执行大量的写操作,平均每秒写差不多270MB。
这里,我们再进一步分析一下系统的整体IO指标。
4 iostat查看系统IO指标
[root@centos-master ~]# iostat -dx 1 Linux 3.10.0-514.el7.x86_64 (centos-master) 2022年03月18日 _x86_64_ (24 CPU) Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 4.20 0.06 4.19 4.78 57.96 29.56 0.04 9.48 3.74 9.56 3.80 1.61 sdc 0.00 0.48 0.09 2.32 4.60 47.78 43.33 0.01 5.24 54.24 3.24 1.25 0.30 sdd 0.00 0.01 0.00 0.00 0.00 0.02 249.21 0.00 60.17 0.52 61.92 0.45 0.00 sda 0.00 0.00 0.13 0.06 15.98 7.16 243.97 0.00 12.52 0.97 36.74 0.32 0.01 dm-0 0.00 0.00 0.06 8.39 4.78 57.96 14.86 0.04 4.36 3.95 4.36 1.91 1.61 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 26.32 0.00 6.82 6.82 0.00 6.07 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 2.00 0.00 441.00 0.00 55724.00 252.72 54.40 151.57 0.00 151.57 1.05 46.20 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 295.00 0.00 36904.00 250.20 54.80 228.38 0.00 228.38 1.57 46.20 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 0.00 567.00 0.00 71212.00 251.19 89.14 139.16 0.00 139.16 1.10 62.30 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 706.00 0.00 89004.00 252.14 89.88 112.40 0.00 112.40 0.88 62.40 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 1.00 0.00 1006.00 0.00 128768.00 256.00 145.22 142.77 0.00 142.77 0.99 100.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 1028.00 0.00 131088.00 255.04 146.82 140.68 0.00 140.68 0.97 100.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ^C [root@centos-master ~]#
系统有大量的IO操作集中在sdb和dm-0设备上,它的%util持续偏高。印证了前面的进程在进行大量IO操作。
5 跟踪进程调用链
当我们知道14327的进程异常时,我们可以通过strace工具来跟踪它的系统调用链。
[root@centos-master ~]# strace -p 14327 strace: Process 14327 attached mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b7e9c3000 write(4, "2022-03-18 03:08:40,212 - __main"..., 314572844) = 314572844 munmap(0x7f4b7e9c3000, 314576896) = 0 write(4, "\n", 1) = 1 munmap(0x7f4b915c4000, 314576896) = 0 select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) getpid() = 1 mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b915c4000 mmap(NULL, 393220096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b79ec3000 mremap(0x7f4b79ec3000, 393220096, 314576896, MREMAP_MAYMOVE) = 0x7f4b79ec3000 munmap(0x7f4b915c4000, 314576896) = 0 lseek(4, 0, SEEK_END) = 943718535 lseek(4, 0, SEEK_CUR) = 943718535 munmap(0x7f4b79ec3000, 314576896) = 0 close(4) = 0 stat("/tmp/logtest.txt.1", {st_mode=S_IFREG|0644, st_size=943718535, ...}) = 0 unlink("/tmp/logtest.txt.1") = 0 stat("/tmp/logtest.txt", {st_mode=S_IFREG|0644, st_size=943718535, ...}) = 0 rename("/tmp/logtest.txt", "/tmp/logtest.txt.1") = 0 open("/tmp/logtest.txt", O_WRONLY|O_CREAT|O_APPEND|O_CLOEXEC, 0666) = 4 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 lseek(4, 0, SEEK_END) = 0 ioctl(4, TIOCGWINSZ, 0x7ffecd67a520) = -1 ENOTTY (Inappropriate ioctl for device) lseek(4, 0, SEEK_CUR) = 0 ioctl(4, TIOCGWINSZ, 0x7ffecd67a440) = -1 ENOTTY (Inappropriate ioctl for device) lseek(4, 0, SEEK_CUR) = 0 mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b915c4000 mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b7e9c3000 write(4, "2022-03-18 03:08:41,104 - __main"..., 314572844) = 314572844 munmap(0x7f4b7e9c3000, 314576896) = 0 write(4, "\n", 1) = 1 munmap(0x7f4b915c4000, 314576896) = 0 select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) getpid() = 1 mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b915c4000 mmap(NULL, 393220096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b79ec3000
看到,它调用系统函数write向文件描述符为4的文件写数据,大小为314572844字节。然后,调用了系统函数stat(),unlink(),rename(),open(),参数指向了/tmp/logtest.txt文件。
[root@centos-master ~]# ll /tmp/logtest.txt* -rw-r--r-- 1 root root 629145690 3月 18 11:11 /tmp/logtest.txt -rw-r--r-- 1 root root 943718535 3月 18 11:11 /tmp/logtest.txt.1 [root@centos-master ~]#
观察一下,该文件不断的在变化,这里其实是在轮询写日志。
6 跟踪进程打开的文件信息
我们通过lsof跟踪进程打开的文件信息:
[root@centos-master ~]# lsof -p 14327 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 14327 root cwd DIR 0,60 4096 2301835634 / python 14327 root rtd DIR 0,60 4096 2301835634 / python 14327 root txt REG 253,0 28016 5506213 /usr/local/bin/python3.7 python 14327 root mem REG 253,0 5506942 /usr/local/lib/python3.7/lib-dynload/_queue.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5506940 /usr/local/lib/python3.7/lib-dynload/_pickle.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5506951 /usr/local/lib/python3.7/lib-dynload/_struct.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5506973 /usr/local/lib/python3.7/lib-dynload/select.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5506948 /usr/local/lib/python3.7/lib-dynload/_socket.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5506943 /usr/local/lib/python3.7/lib-dynload/_random.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5506910 /usr/local/lib/python3.7/lib-dynload/_bisect.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5506946 /usr/local/lib/python3.7/lib-dynload/_sha3.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5506911 /usr/local/lib/python3.7/lib-dynload/_blake2.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5505192 /lib/libcrypto.so.43.0.1 (stat: No such file or directory) python 14327 root mem REG 253,0 5506931 /usr/local/lib/python3.7/lib-dynload/_hashlib.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5506965 /usr/local/lib/python3.7/lib-dynload/math.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5506932 /usr/local/lib/python3.7/lib-dynload/_heapq.cpython-37m-x86_64-linux-gnu.so (stat: No such file or directory) python 14327 root mem REG 253,0 5505134 /etc/localtime (path inode=14157303) python 14327 root mem REG 253,0 5506322 /usr/local/lib/libpython3.7m.so.1.0 (stat: No such file or directory) python 14327 root mem REG 253,0 5505189 /lib/ld-musl-x86_64.so.1 (stat: No such file or directory) python 14327 root 0u CHR 136,2 0t0 5 /2 python 14327 root 1u CHR 136,2 0t0 5 /2 python 14327 root 2u CHR 136,2 0t0 5 /2 python 14327 root 3r CHR 1,9 0t0 2301848644 /dev/urandom python 14327 root 4w REG 253,0 314572845 5123176 /tmp/logtest.txt [root@centos-master ~]#
果然,看到它的FD列,即文件描述符为4w,表示是文件描述符为4的文件,w表示以写的方式操作文件。大小为314572845。
到这里,基本确定该进程在干嘛了。
7 查看docker容器应用源码
查看运行的docker container:
[root@centos-master ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 7f058667d924 feisky/logapp "python /app.py" 41 minutes ago Up 41 minutes app 15137d822209 luksa/kubia-unhealthy "node app.js" 12 days ago Up 12 days kubia-unhealthy f1df8f14c584 registry "/entrypoint.sh /e..." 2 weeks ago Up 2 weeks 0.0.0.0:5000->5000/tcp naughty_newton [root@centos-master ~]#
进入container,查看应用代码:
[root@centos-master ~]# docker exec -it app /bin/sh / # pwd / # ls app.py bin dev etc home lib media mnt proc root run sbin srv sys tmp usr var / # cat app.py #!/usr/bin/env python # -*- coding: UTF-8 -*- import logging import random import string import signal import time from logging.handlers import RotatingFileHandler logger = logging.getLogger(__name__) logger.setLevel(level=logging.INFO) rHandler = RotatingFileHandler( "/tmp/logtest.txt", maxBytes=1024 * 1024 * 1024, backupCount=1) rHandler.setLevel(logging.INFO) formatter = logging.Formatter( '%(asctime)s - %(name)s - %(levelname)s - %(message)s') rHandler.setFormatter(formatter) logger.addHandler(rHandler) def set_logging_info(signal_num, frame): '''Set loging level to INFO when receives SIGUSR1''' logger.setLevel(logging.INFO) def set_logging_warning(signal_num, frame): '''Set loging level to WARNING when receives SIGUSR2''' logger.setLevel(logging.WARNING) def get_message(N): '''Get message for logging''' return N * ''.join( random.choices(string.ascii_uppercase + string.digits, k=1)) def write_log(size): '''Write logs to file''' message = get_message(size) while True: logger.info(message) time.sleep(0.1) signal.signal(signal.SIGUSR1, set_logging_info) signal.signal(signal.SIGUSR2, set_logging_warning) if __name__ == '__main__': msg_size = 300 * 1024 * 1024 write_log(msg_size) / #
这里,看了app.py代码,在写日志,日志级别有不同的调整方式。
def set_logging_info(signal_num, frame): '''Set loging level to INFO when receives SIGUSR1''' logger.setLevel(logging.INFO) def set_logging_warning(signal_num, frame): '''Set loging level to WARNING when receives SIGUSR2''' logger.setLevel(logging.WARNING)
8 修改应用的日志级别
[root@centos-master ~]# kill -l 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP 21) SIGTTIN 22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ 26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO 30) SIGPWR 31) SIGSYS 34) SIGRTMIN 35) SIGRTMIN+1 36) SIGRTMIN+2 37) SIGRTMIN+3 38) SIGRTMIN+4 39) SIGRTMIN+5 40) SIGRTMIN+6 41) SIGRTMIN+7 42) SIGRTMIN+8 43) SIGRTMIN+9 44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13 48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12 53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9 56) SIGRTMAX-8 57) SIGRTMAX-7 58) SIGRTMAX-6 59) SIGRTMAX-5 60) SIGRTMAX-4 61) SIGRTMAX-3 62) SIGRTMAX-2 63) SIGRTMAX-1 64) SIGRTMAX [root@centos-master ~]# kill -SIGUSR2 14327 [root@centos-master ~]#
然后,top和iostat监控系统的性能指标,都正常了。
三小结
这个案例较为简单,从top上看到异常进程,我们可以立即跟踪该进程在执行什么,进而找到线索和思路。当然,这里,通过iostat -dx 1和pidstat -d -l -p the_pid 1;来看了一下系统的IO指标,发现有异常的IO请求操作。最后,通过strace -p the_pid;和lsof -p the_pid找到进程的系统调用栈和打开文件句柄。最终,确定进程是在持续记日志,并从app.py源码中,印证定位和猜测。
个人收获,strace -p the_pid;和lsof -p the_pid;以前没有深入的使用,这里值得再研究。