Macro 32 Ramblings

Mind Archive

Quick look at Trio ddos-protection with flow-detection

from http://blog.ip.fi/2014/03/quick-look-at-trio-ddos-protection-with.html

Some things are easy to protect with iACL and lo0 ACL but others are really hard, like BGP, you need to allow BGP from customers and from core, and it’s not convenient or practical to handle them separately in lo0 ACL + policer. Luckily JunOS has feature called flow-detection, you turn it on with set system ddos-protection global flow-detection

I’m sending DoS from single source to lo0, my iBGP goes immediately down. After I turn on flow-detection iBGP connectivity is restored. Looking at PFE, we can see what is happening:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows
pfe idx rindex prot aggr IIF/IFD pkts bytes source-info
— —- —— —- —- ——- ——- ——– ———-
0 0 721 1400 sub 338 21 79161 c158ef22 c158ef1f 53571 179
0 1 2679 1400 sub 356 11159404 2187242988 64640102 c158ef1f 179 179
0 2 2015 1400 sub 338 29 112468 c158ef23 c158ef1f 179 65020

Pretty nice and clear, 64.64.01.02 => c1.58.ef.1f is our attack traffic and it’s getting its own policer, iBGP is stable, attack traffic is policed separately. Let’s check those policers more closely:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 2679

PFE: 0
Flow Record Index: 2679
Flow Key:
Proto-ID: 1400
Key type: 1
IIF: 356
Src IP addr: 64640102 (100.100.1.2)
Dst IP addr: c158ef1f (193.88.239.31)
Src port: 179
Dst port: 179
Flow Context Data:
Rcvd ack_add: 1
Rcvd ack_del: 0
Rcvd last flow op: 2
Flow state: 2
Aggr level: 0
Proto idx: 79
Policer idx: 3885
Time inserted: 3484405383
Time last violated: 3484485825
Last received: 42480417
Flow Statitics:
Packet Count: 42655279
Byte Count: 8360434488

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos policer bgp stats
DDOS Policer Statistics:

arrival pass # of
idx prot group proto on loc pass drop rate rate flows
— — ———– ———– — —— ——– ——– —— —— —–
79 1400 bgp aggregate Y UKERN 8378805 0 9 9 3
PFE-0 8393411 2310827652 528095 9 3

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos policer bgp configuration
DDOS Policer Configuration:

UKERN-Config PFE-Config

idx prot group proto on Pri rate burst rate burst
— — ———— ———— — — —— —– —— —–
79 1400 bgp aggregate Y Lo 8000 20000 8000 20000

MX104-ABB-0(test13nqa1-re0.dk vty)#

Looks good, what if we increase the sources from /32 to /16, let’s see:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows
pfe idx rindex prot aggr IIF/IFD pkts bytes source-info
— —- —— —- —- ——- ——- ——– ———-
0 0 3083 1400 sub 356 80 15484 6464e743 c158ef1f 179 179
0 1 805 1400 sub 356 217 42336 64642df4 c158ef1f 179 179
0 2 672 1400 sub 356 291 56840 6464eb64 c158ef1f 179 179
0 3 3382 1400 sub 356 256 49980 6464ba66 c158ef1f 179 179
0 4 3956 1400 sub 356 82 15876 64643b3d c158ef1f 179 179
0 5 1999 1400 sub 356 92 17836 646403a6 c158ef1f 179 179
0 6 2808 1400 sub 356 22 4116 64642f33 c158ef1f 179 179

<..>

0 3455 866 1400 sub 356 80 15484 646475a1 c158ef1f 179 179
0 3456 4096 1400 sub 356 319 62328 646470a8 c158ef1f 179 179

MX104-ABB-0(test13nqa1-re0.dk vty)#

No mas! I expected to see 65537 or so, but instead I’m seeing just 3456. My iBGP also went down! Let’s poke around bit:

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd global-info
DDOS-SCFD global context
——————————————————
FLow entry/state/hash size: 288/12/8 bytes
Flow scan: Yes
Send async msg to RE: Yes
Send periodic update to RE: No
Default enabled: No
Enabled: Yes
Last aggr op is: Aggr
Next available flow id: 5b39
Culprit flows: 3902
Culprit flows on scan: 3902
Violated protocols: 1
Violated protocols on scan: 1
Violation report rate: 100(pps)
Flow change report rate: 100(pps)
Scan cookie: 588425
Free SCFD states: 4096
Free flow entries: 194
Free notif blocks: 400
Free re request blocks: 400
Free flow msg blocks: 4096
Free flow policers: 321
Socket notif queue size: 0
Has queued work state items: 0
Has queued re requests: 0
Has queued flow rate modifies: 0
Has queued flow messages: 0
Send packet size: 16384
Send batch size: 1
Last aggr op time: 2942210140
Per PFE flows: 0=3902
Run out of flows: 0
Reuse an entry not freed yet: 0
Run out of state items: 0
Bad proto ID: 0
rindex changed for same flow: 0
Remove flow on an empty proto: 0
Remove non-exist flow: 0
Read ASIC failed: 0
Failed tries write flow params: 0
Failed change flow params: 0
Run out of policers: 0
Run out of msg blocks: 0
Run out of mod flow blocks: 0
SCFD stats for PFE 0
Global configuration
violation report rate: 100
flow report rate: 100
Flow counters read from LU
current suspicious flows: 191
current culprit flows: 3905
discovered suspicious flows: 123802
discovered culprit flows: 23356
deleted culprit flows: 19451
false positives: 100255
hash insertion errors: 148669170
hash deletion errors: 0
max flow tbl scan time(ms): 0
debug values: 0
Flow reports received through PUNT
policer violation: 104836
flow found: 23353
flow timeout: 0
flow return to normal: 115
flow cleared: 19336
unknown reports: 0
bad flow type: 0
Violation indication policer stats
Passed indications: 21346
Dropped indications: 84304888

Immediately what caught my attention was Free flow entries: 194, oops! We’re running out of flow entries. But as we wait for few more seconds, my iBGP comes back up, why?

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flows
pfe idx rindex prot aggr IIF/IFD pkts bytes source-info
— —- —— —- —- ——- ——- ——– ———-
0 0 3183 1400 ifl 338 96 220159 0 0 0 0
0 1 1439 1400 ifl 356 70851932 13886978476 0 0 0 0

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 3183

PFE: 0
Flow Record Index: 3183
Flow Key:
Proto-ID: 1400
Key type: 1
IIF: 338
Src IP addr: 0 (0.0.0.0)
Dst IP addr: 0 (0.0.0.0)
Src port: 0
Dst port: 0
Flow Context Data:
Rcvd ack_add: 1
Rcvd ack_del: 1
Rcvd last flow op: 3
Flow state: 3
Aggr level: 1
Proto idx: 79
Policer idx: 2105
Time inserted: 3485722214
Time last violated: 3485723825
Last received: 105
Flow Statitics:
Packet Count: 105
Byte Count: 221115

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd asic-flow-rindex 0 1439

PFE: 0
Flow Record Index: 1439
Flow Key:
Proto-ID: 1400
Key type: 1
IIF: 356
Src IP addr: 0 (0.0.0.0)
Dst IP addr: 0 (0.0.0.0)
Src port: 0
Dst port: 0
Flow Context Data:
Rcvd ack_add: 1
Rcvd ack_del: 0
Rcvd last flow op: 2
Flow state: 2
Aggr level: 1
Proto idx: 79
Policer idx: 2155
Time inserted: 3485637825
Time last violated: 3485792825
Last received: 81852797
Flow Statitics:
Packet Count: 82308537
Byte Count: 16132473056

MX104-ABB-0(test13nqa1-re0.dk vty)# show ddos scfd global-info
DDOS-SCFD global context
——————————————————
FLow entry/state/hash size: 288/12/8 bytes
Flow scan: Yes
Send async msg to RE: Yes
Send periodic update to RE: No
Default enabled: No
Enabled: Yes
Last aggr op is: Aggr
Next available flow id: 5b40
Culprit flows: 2
Culprit flows on scan: 2
Violated protocols: 1
Violated protocols on scan: 1
Violation report rate: 100(pps)
Flow change report rate: 100(pps)
Scan cookie: 588463
Free SCFD states: 4096
Free flow entries: 4094
Free notif blocks: 400
Free re request blocks: 400
Free flow msg blocks: 4096
Free flow policers: 4221
Socket notif queue size: 0
Has queued work state items: 0
Has queued re requests: 0
Has queued flow rate modifies: 0
Has queued flow messages: 0
Send packet size: 16384
Send batch size: 1
Last aggr op time: 2942210140
Per PFE flows: 0=2
Run out of flows: 0
Reuse an entry not freed yet: 0
Run out of state items: 0
Bad proto ID: 0
rindex changed for same flow: 0
Remove flow on an empty proto: 0
Remove non-exist flow: 0
Read ASIC failed: 0
Failed tries write flow params: 0
Failed change flow params: 0
Run out of policers: 0
Run out of msg blocks: 0
Run out of mod flow blocks: 0
SCFD stats for PFE 0
Global configuration
violation report rate: 100
flow report rate: 100
Flow counters read from LU
current suspicious flows: 0
current culprit flows: 2
discovered suspicious flows: 123806
discovered culprit flows: 23360
deleted culprit flows: 23358
false positives: 100446
hash insertion errors: 149071671
hash deletion errors: 0
max flow tbl scan time(ms): 0
debug values: 0
Flow reports received through PUNT
policer violation: 104836
flow found: 23360
flow timeout: 0
flow return to normal: 117
flow cleared: 23241
unknown reports: 0
bad flow type: 0
Violation indication policer stats
Passed indications: 21346
Dropped indications: 84304888

Wow! Only two flows left, and they’ve changed from sub to ifl, keys are all zero, no src, dst IP match, no src, dst port match! But my iBGP seems to work, as my attack traffic (simulating BGP customer) is coming from separate IFL than my iBGP. I guess it’s lot better than nothing, lot better than overflowing the 4k policers.

Yet, waiting bit longer I see that my INET/INET6 iBGP is happy, but iBGP VPNv4/VPNv6 is unhappy, both are on same IFL, but different than attack traffic. Turns out default level for sub and ifl is 10pps and ifd is 20kpps. Our iBGP was 5 session and was periodically exceeding 10pps IFL limit, and by default, once flow-limit is hit (flow-control-mode), you don’t drop just the exceeding traffic (flow-control-mode police), but you drop everything in offending flow (flow-control-mode drop), which I think is great default. But is sub==ifl==10pps great default? I don’t think so. Anyway my iBGP IFL was hitting 10pps, and dropping everything for duration of violation, not sure why it affected 3xVPN4/VPN6 session, but not INET/INET6, same hold-time and keepalive, but doesn’t matter misconfig on my part. Once I added set system ddos-protection protocols bgp aggregate flow-level-bandwidth logical-interface 1000 VPN4/VPN6 stabilized.

I noticed that for some reason I couldn’t anymore make the platform automatically switch during DoS from sub detection to ifl detection. This mean, my configuration was again broken, as the subscriber flows congested my HW’s 4k slots. Not sure why this is and how it should work. I ended up disabling ‘sub’ level detection αΊƒith set system ddos-protection protocols bgp aggregate flow-level-detection subscriber off, I guess it would be nice to keep it on, if we could reliably and rapidly switch from sub=>ifl during policers being congested, but right now it seems that if there is danger of seeing 4k of sub-level flows, it might be safest to just ignore them and protect only at ifl level. My full config is now:

{master}[edit system ddos-protection protocols bgp]
fisakytt@test13nqa1-re0.dk# show
aggregate {
bandwidth 8000;
flow-level-bandwidth {
logical-interface 100;
}
flow-level-detection {
subscriber off;
}
}

It seems to consistently protect my BGP from attacks which I’m interested in, quite bit of config, if you’re doing that for 200 DDoS policer protocols (after researching each relevant about what you even should do). It might be nice if Juniper would have concept of ‘protected flows’ which always have priority, which would be those flows, which have manual hint in configuration, like in Cisco LPTS when you configure BGP, you get BGP policer for that configured session, JNPR could easily do same and configure those persistently and in case of congestion, don’t break those, only break policers among dynamically detected flows. I’d also really want to figure out how sub=>ifl fallback should work, why does it work sometimes and sometimes not.