What is natager task?



  • Hi,

    Does anyone know what “natager” is in the “get os task”? This task in my NS500 occupied more than 50% CPU task. Anything related to NAT?

    Thanks!


  • Engineer

    Maxpipe is 100% correct. Natager is the process that cleans up expired sessions in your session table. Basically every other second it scans the session table and tears down sessions in which the time has expired.

    So from a CPU point of view it will look like this(Pulled from your data above):
    9: 80(85  5)**  58: 60(43 27)*  57: 83(91  2)**  56: 74(74 10)**

    At 56 second mark it ran. As you can see Natager uses the task CPU (for those of you with two CPU firewalls) It then ran again at 58 seconds. This is why when you have a large number of sessions being torn down that your task CPU jumps around every other second…because that is when the Natager process triggers and starts its clean up duties.

    One thing to monitor with Natager is if you start seeing it take longer than one second (in other words the task CPU is for example 80% for 2-4 seconds in a row) then you could be in bad shape and not tearing down all the sessions. I have seen this happen and you must manually run the Natager command (via a hidden command).

    From the data you provided it appears (once again just agreeing with Max here) that it’s all flow CPU. I would recommend running a debug flow basic and looking at that data. Please take the utmost amount of caution when running a debug as it will max out your firewall cpu and impact your network. Run it from the Console port and only run it for about 2 seconds. To analyze the debug flow basic output just use the program in my signature. I will tell you exactly what kind of traffic (and hopefully help you pinpoint a root cause) is passing through your firewall.

    Good luck!

    Tim Eberhard



  • It looks like you are seeing some fragmentation, though not a really huge amount. It could definitely help if the fragmentation could be reduced or eliminated as this has been known to increase flow CPU. Do you have IPSec tunnels configured? If so then try lowering the tcp-mss for tunneled traffic.

    set flow tcp-mss 1350

    See if that stops the fragments received from incrementing. If you do not have IPSec VPNs, or if you still see fragmentation even after entering the above command then also try this command:

    set flow all-tcp-mss

    This will affect all non-encrypted traffic. Also I forgot to ask. Do you have any counting enabled on your policies? How about any deep inspection or URL filtering? If so then try disabling all of those and see if the CPU drops.

    Finally, when you say 2000 pps, are you including ALL interfaces or only one? 50Mb is not really a useful figure to look at for high CPU because packet sizes vary considerably. In terms of the CPU flow performance, the PPS is really the most important value to look at.



  • Hi MaxPipeline,

    The traffic passing the NS500 is quite low, around 50Mb. The PPS is around 2000.

    Get sess frag shows:
    NS500-> get session frag
    Max 65536 fcbs in the system, 0 fcbs are in use.
    Max 957 fragments can be queued, 0 fragments are queued now.
    Total 41031470 fragments received.
    Total 41015349 fragments passed defrag.
    Total 16121 fragments failed in defrag.
    Total 0 fragments overlap happen.
    Total 20392020 are 1st fragments.
    Total 20638354 are non-1st fragments.
    Total 409193 are out-of-order fragments.
    Total 13147 fragments are aged out.

    ======after 1 minutes

    NS500-> get session frag
    Max 65536 fcbs in the system, 0 fcbs are in use.
    Max 957 fragments can be queued, 0 fragments are queued now.
    Total 41031686 fragments received.
    Total 41015565 fragments passed defrag.
    Total 16121 fragments failed in defrag.
    Total 0 fragments overlap happen.
    Total 20392128 are 1st fragments.
    Total 20638462 are non-1st fragments.
    Total 409193 are out-of-order fragments.
    Total 13147 fragments are aged out.
    NS500->

    Any strange? Thanks!



  • Definitely the flow CPU is high. Example:

    59: 80(85  5)**

    The 59 means this was at second 59 or 1 second ago. The 80 is overall CPU. The numbers in parenthesis (85 5) determines how much is flow and how much is task. The first value is flow and the second is task. So as you can see, most of your cpu utilization is a result of flow ot transit traffic.

    That said, the next the next thing to look for are the packets per second that the NS500 is experiencing across all interfaces. Do you have any network monitoring like MRTG? If not then to do this you need to look at the “get counter stats” at regular intervals. Then calculate the difference between the in packets and the number of seconds. Do this for all interfaces. Once you have the PPS calculated, then let us know.

    Also check for fragmentation with “get sess frag”. If you need help with that you may want to open a case with JTAC for assistance as they deal with these types of problems often. You can also search on kb.juniper.net for articles regarding high flow CPU.

    Hope this helps.



  • Hi MaxPipeline,

    Does it mean the flow CPU is high? If so, then what what I can do to reduce the CPU?

    BTW: We have another NS500 runing in the similar situation and the performance is fine.

    Thanks.



  • Legatol run for 5 mins (and also last night), here is the output:

    NS500-> get performance cpu all detail
    Average System Utilization: 52% (50 10)
    Last 60 seconds:
    59: 80(85  5)**  58: 60(43 27)*  57: 83(91  2)**  56: 74(74 10)**
    55: 75(75 10)**  54: 76(77  9)**  53: 65(53 22)*  52: 82(90  2)**
    51: 78(81  7)**  50: 62(47 25)*  49: 71(67 14)**  48: 82(90  2)**
    47: 75(77  8)**  46: 60(44 26)*  45: 81(89  2)**  44: 65(54 21)*
    43: 67(58 19)*  42: 79(85  4)**  41: 81(88  3)**  40: 63(49 24)*
    39: 66(57 19)*  38: 81(88  3)**  37: 66(58 18)*  36: 75(78  7)**
    35: 70(68 12)*  34: 67(64 13)*  33: 67(67 10)*  32: 64(58 16)*
    31: 69(70  9)*  30: 65(59 16)*  29: 65(58 17)*  28: 72(74  8)**
    27: 73(78  5)**  26: 61(51 20)*  25: 67(68  9)*  24: 65(59 16)*
    23: 53(52 11)*  22: 66(62 14)*  21: 74(82  2)**  20: 57(45 22)*
    19: 56(47 19)*  18: 68(70  8)*  17: 64(58 16)*  16: 63(53 20)*
    15: 75(77  8)**  14: 61(46 25)*  13: 68(61 17)*  12: 80(88  2)**
    11: 71(69 12)**  10: 70(69 11)*    9: 64(61 13)*    8: 70(69 11)*
    7: 66(57 19)*    6: 78(84  4)**  5: 69(64 15)*    4: 70(65 15)*
    3: 61(69  2)*    2: 57(45 22)*    1: 65(73  2)*    0: 63(49 24)*

    Last 60 minutes:
    59: 67(64 13)*  58: 70(68 11)*  57: 70(68 12)*  56: 71(69 11)**
    55: 70(67 12)*  54: 73(71 11)**  53: 72(70 11)**  52: 66(64 12)*
    51: 65(62 12)*  50: 38(36 12)    49: 45(42 12)    48: 46(44 12)
    47: 57(55 12)*  46: 52(49 12)*  45: 43(41 12)    44: 44(42 12)
    43: 37(35 11)    42: 44(41 12)    41: 47(45 12)    40: 40(38 12)
    39: 39(37 12)    38: 49(47 12)    37: 51(49 12)*  36: 54(51 12)*
    35: 44(41 12)    34: 45(43 12)    33: 40(37 12)    32: 48(45 12)
    31: 49(47 12)    30: 41(39 12)    29: 40(38 12)    28: 41(39 12)
    27: 47(45 11)    26: 34(32 12)    25: 36(34 12)    24: 39(37 11)
    23: 37(35 12)    22: 41(39 12)    21: 44(42 11)    20: 39(37 12)
    19: 34(32 12)    18: 38(36 12)    17: 40(39 11)    16: 32(30 12)
    15: 34(32 11)    14: 43(40 12)    13: 32(30 11)    12: 35(33 12)
    11: 34(32 11)    10: 32(30 12)    9: 31(29 12)    8: 34(32 12)
    7: 38(37 11)    6: 35(33 12)    5: 33(32 11)    4: 29(27 11)
    3: 35(34 11)    2: 24(22 11)    1: 22(20 11)    0: 24(22 11)

    Last 24 hours:
    23: 28(26 11)    22:  9( 5 10)    21: 49(47 10)    20: 71(70 10)**
    19: 73(72 10)**  18: 74(74  9)**  17: 74(74 10)**  16: 64(63 10)*
    15: 69(68  9)*  14: 75(75  9)**  13: 75(76  9)**  12: 54(53 10)*
    11: 32(30 11)    10: 58(56 11)*    9: 61(59 11)*    8: 66(64 11)*
    7: 59(57 11)*    6: 46(44 11)    5: 45(43 11)    4: 61(58 11)*
    3: 56(54 11)*    2: 39(37 11)    1: 11( 6 10)    0:  9( 4 10)



  • Can you run “get perf cpu all detail” while legato backup is running? I suspect the issue is not with natager task per se. Likely the flow CPU is high which can cause packet loss. The command will show both flow and task CPU so you would know which is high.



  • Thanks for your reply.

    The problem is the CPU utilization raised above 80% and around 2% packet lost . The problem happens when the legato running to backup some servers accross the NS. Does there any problem with legato and NS500?

    BTW: I did a test which let the legato passing one 5XT. When the backup running, the 5XT only got 6% CPU utilization.



  • The name Natager is somewhat misleading. This task is responsible for proper aging out of sessions in your session table. It is not uncommon for this task to run high especially if you have a large session table. Are you experiencing any issues? Keep in mind that since natager is a task then likely you will not see any throughput issues. The reason is the CPU will yield to flow as opposed to tasks. Flow is responsible for through traffic, so that means that even if task is high, traffic flow will still be OK.

    So what issues are you seeing exactly?



  • My guess is it’s probably the process which monitors NAT allocations and ages them out.


 

29
Online

38.4k
Users

12.7k
Topics

44.5k
Posts