Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAT feature HLD #390

Merged
merged 25 commits into from Nov 6, 2019
Merged

Conversation

kirankella
Copy link
Contributor

This document describes high level design details for NAT feature in SONiC.

Signed-off-by: Kiran Kella kiran.kella@broadcom.com

Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
doc/nat/nat_design_spec.md Outdated Show resolved Hide resolved
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
doc/nat/nat_design_spec.md Outdated Show resolved Hide resolved
doc/nat/nat_design_spec.md Outdated Show resolved Hide resolved
@kannankvs
Copy link
Collaborator

How about the support for NAT ALGs ?

@kannankvs
Copy link
Collaborator

Regarding the reply for Guohan's question about 5-tuple, I think that the issue cannot be solved until ASICs/SAI supports 5 tuple. At the same time, I think that kernel does not reuse the source port number until the unique port numbers are exhausted. If user can configure enough port range, I think that it should work without having the necessity for reusing the source port number. A quick prototype with just 2 or 3 port numbers and few TCP connections may help to get the actual implementation of this Debian kernel.
NOTE: Am unable to reply under/against Guohan's reply.

doc/nat/nat_design_spec.md Show resolved Hide resolved
doc/nat/nat_design_spec.md Show resolved Hide resolved
doc/nat/nat_design_spec.md Outdated Show resolved Hide resolved
iptables -t nat -A POSTROUTING -p tcp -s 20.0.0.0/16 -j SNAT -o Ethernet15 --to-source 65.55.42.1:1024-65535 --random
iptables -t nat -A POSTROUTING -p udp -s 20.0.0.0/16 -j SNAT -o Ethernet15 --to-source 65.55.42.1:1024-65535 --random
iptables -t nat -A POSTROUTING -p icmp -s 20.0.0.0/16 -j SNAT -o Ethernet15 --to-source 65.55.42.1:1024-65535 --random
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the current design supports ALGs (bump-in-the-wire) on SONiC?

@kirankella
Copy link
Contributor Author

@kannankvs @madhupalu

How about the support for NAT ALGs ?

For ALG support, it needs to be handled in the slow path (in the linux kernel) and cannot be handled in hardware. And that requires the corresponding ALG kernel modules for the protocols (like SIP, DNS,...).
This is added as one of the future features to be supported in NAT.

Currently only the ICMP error reply packets are trapped to CPU to be handled by the ALG logic for ICMP packets in the kernel.

- Updated about random allocation from the port range, to minimize reuse of SNAT'ted port.

Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
@lguohan
Copy link
Contributor

lguohan commented Jun 27, 2019

since zone concept is not available in the kernel, how do represent the zone in the ip table, or do we need that?

@kirankella
Copy link
Contributor Author

kirankella commented Jun 28, 2019

since zone concept is not available in the kernel, how do represent the zone in the ip table, or do we need that?

@lguohan We are not mapping to any equivalent zone attribute in the kernel, since iptables do not take such an attribute. Since SNAT MISS packets come up to CPU only if zone crossing happens, the zone filtering check is kind of taken care of in the hardware, and only the translation happens in the iptables based on the ACL matching subnets.

@rlhui
Copy link
Contributor

rlhui commented Jul 3, 2019

@kirenlella, please add this behavior to the HLD. Also the NAT overall feature behavior is really tightly coupled with NAT SAI pipeline, in all cases that if resulting behavior does not exactly match what's mentioned in the SONiC spec, due to differences between SW (SONiC) and HW (logically represented by the SAI pipeline), can we please mention it in HLD? Thanks,

Handling mismatch between the Linux and Hardware NAT models.
Supporting Loopback IP as Public IP.

Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
@kirankella
Copy link
Contributor Author

@kirenlella, please add this behavior to the HLD. Also the NAT overall feature behavior is really tightly coupled with NAT SAI pipeline, in all cases that if resulting behavior does not exactly match what's mentioned in the SONiC spec, due to differences between SW (SONiC) and HW (logically represented by the SAI pipeline), can we please mention it in HLD? Thanks,

Done. Updated the section 3.4.

By default, L3 interface is in NAT zone 0 which we refer to as an inside interface.

NAT/NAPT is performed when packets matching the configured NAT ACLs cross between different zones.
The source zone of a packet is determined by the zone of the interface on which the packet came on. And the destination zone of the packet is determined by the zone of the L3 next-hop interface from the L3 route lookup of the destination.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is ambiguous to me. From the schema, the zone is an L3 interface attribute. So does it mean, that the destination zone is determined purely by the egress interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The egress interface would also be an L3 interface right? So, user would have configured zone value on that outbound L3 interface (which happens to be egress interface that has the nexthop via that interface).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I wanted to know where the out zone is configured exactly because it makes a difference for the other applications. We don't want to mix it with the next hops which are configured by other applications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is again SAI pipeline specific. Zone is configured only on L3 interface in the hardware. And the egress nexthop in the hardware picks the zone attribute from the nexthop L3 interface it points to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Currently only 2 zones are supported, which correspond to the inside interfaces and the outside interfaces.

Any inbound traffic ingressing on the outside interface that is L3 forwarded on to the inside interface, is configured by user via Static NAT/NAPT entries to be DNAT translated.
Any outbound traffic ingressing on the inside interface is configured to be dynamically SNAT translated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement contradicts with the 1-st paragraph of 2.2.4 which says that "the NAT translation happens when the packet crosses between them (zones)"
So what are the exact criterias for DNAT/SNAT?
In my opinion the criteria for DNAT is that the packet DIP matches one of the prefixes in DNAT_POOL, and for the SNAT - crossing the zones.
Can we have exact criteria listed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above statements, we are mentioning the typical use case where the DNAT is configured to be done and the SNAT is configured to be done (in which directions usually).
And it is SAI pipeline specific that the DNAT_POOL match should happen first for the DIP, before NAT happens while crossing the zones. Hence it is not listed in this HLD.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

NAT_BINDINGS|{{binding-name}}
"nat_pool": {{pool-name}}
"access_list": {{access-list-name}} (OPTIONAL)
"nat_type": {{snat-or-dnat}} (OPTIONAL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we distinguish between NAT types here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAT bindings are used for dynamic NAT/NAPT. Having the NAT type as SNAT means, the Source addresses are dynamically translated when the bindings are applied. We currently support only SNAT type for dynamic pool bindings.
In future, we also can add support for dynamically doing DNAT (use case being FQDN service with session distribution at the firewall https://docs.paloaltonetworks.com/pan-os/8-1/pan-os-admin/networking/nat/source-nat-and-destination-nat/destination-nat.html)..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ACL there is only a do-not-nat action proposed. So how is the do-not-nat going to distinguish between nat_type for which it should disable translation? E. g. for this configuration:

ACL_TABLE|10
    "stage": "INGRESS", 
    "type": "L3", 
    "policy_desc": "nat-acl", 
    "ports": "Vlan2000"

ACL_RULE|10|1
    "priority": "20", 
    "src_ip": "20.0.1.0/24",
    "packet_action": "do_not_nat"

ACL_RULE|10|2
    "priority": "10", 
    "src_ip": "20.0.0.0/16", 
    "packet_action": "forward"

NAT_BINDINGS|nat1    
    "access_list": "10"
    "nat_pool": "pool1
    "nat_type": "snat"

How do we tell HW to disable only SNAT for packets with SRC_IP "20.0.1.0/24" and to not disable DNAT for the same packet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACL action do-not-nat disables any NAT that can happen on the packet.
And the nat-type attribute in the bindings config is only of use for the kernel iptables to act upon (whether to apply or not apply snat or dnat) on the NAT miss traffic to the CPU.
And like mentioned above, only snat type is supported in the bindings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the example above, if a packet with the SRC IP from 20.0.1.0/24 and DST IP from the DNAT_POOL, won't be translated and/or trapped to CPU because of that ACL with do-not-nat action.
Is it desired behavior? If the application controls SNAT and DNAT separately, why not do the same in HW?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlhui the DO_NOT_NAT is global in the SAI pipeline. Once you disabled the NAT, it is disabled for all the types - SNAT< DNAT, DOUBLE_NAT. I don't get why we should do it differently in SONiC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marian-pritsak , you're right that if a pkt matches DO_NOT_NAT condition in the SAI pipeline, then DNAT or DOUBLE_NAT is bypassed too. In SONiC, currently it seems acl binding including DO_NOT_NAT is supported with SNAT Only.

@kirankella, can you please confirm? Is there a way to bypass NAT for some pkts e.g. some known protocols for static DNAT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marian-pritsak @rlhui What you said is right. In SONiC, currently acl binding including DO_NOT_NAT is supported with SNAT only (ideally other vendors config apply ACLs to the SNAT dynamic pool allocation). So on those lines the iptables added apply the do-not-nat checks to exclude the some hosts during SNAT. Also, the ACL (with do-not-nat) is typically applied on all the inside zone ports where the outbound traffic is being SNAT'ted. That way do-not-nat is not applied on the outside zone ports in the hardware.

But as per the SAI pipeline, the DO_NOT_NAT condition is like a global condition i see, but that be can be achieved only if this ACL rule is applied on all the ports (both the inside zone and the outside zone ports).
Currently for the static DNAT (you are not referring to NAPT right?), all the traffic is DNAT'ted in the hardware for all protocols by the single DNAT entry. To bypass DNAT for some packets, we have to use the ACL DO_NOT_NAT rule and apply the ACL on those outside zone ports. So that they will be regularly forwarded in the hardware.

To be in sync with the hardware, in SONiC the iptables need to be updated to exclude/do-not-nat the DNAT for the ports in the ACL do-not-nat rules. We will add it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirankella, thanks. For each dynamic SNATP entry going outbound, there's at least one reverse flow, that needs the DNATP back to the source. If DO_NOT_NAT was applied to the outbound flow, same would apply to the corresponding inbound flow. Can this be supported?
For DO_NOT_NAT for static DNAT or DNATP, it could be a future item.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

doc/nat/nat_design_spec.md Show resolved Hide resolved
doc/nat/nat_design_spec.md Show resolved Hide resolved
doc/nat/nat_design_spec.md Show resolved Hide resolved

The conntrack netlink DESTROY events result in the deletion of the NAT entry from the APP_DB. The DESTROY events are received on the timeouts of the TCP connections in the connection tracking table.

The TCP FIN flagged packets are not trapped to CPU. Hence the NAT entries for the closed TCP connections are not removed immediately from the hardware. They are timed out eventually based on the translation inactivity and removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contradicts with the statement above "To have the conntrack TCP entries and the hardware entries in sync" and can lead to an entry table fill up in case of many short-lived connections. What is the main motivation not to trap FIN/RST packets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the same DNAPT entry, lots of TCP flows (new connections) are translated in the hardware (based on the fullcone nat model in the hardware). For the very first flow detected, the hardware entry is added. And the follow up flows are not in the kernel. Hence trapping FIN/RST packets to the kernel may result in getting them dropped in the kernel.
We can configure lower TCP NAT timeout in order to get the inactive sessions timed out early in the hardware.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

doc/nat/nat_design_spec.md Outdated Show resolved Hide resolved
ACL_RULE|10|1
"priority": "20",
"src_ip": "20.0.1.0/24",
"packet_action": "do_not_nat"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently in ACL L3 rule the "packet_action" is translated to SAI_ACL_ENTRY_ATTR_ACTION_PACKET_ACTION and the value is translated to sai_packet_action_t.
SAI_ACL_ENTRY_ATTR_ACTION_NO_NAT according to the logic should be a different key "do_not_nat" with value "true"/"false" to be aligned with SAI
Any reason to put "do_not_nat" as part of "packet_action" key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanblyschak Packet actions are mutually exclusive for a particular match criteria, so as far as the ACL schema is concerned the packet actions are mutually exclusive. So "do-not-nat" action is mutually exclusive of forward (permitting nat) or redirect or other actions. For eg., Packet is either forwarded (permitting nat) or not-natted or redirected...
That's the reason "packet_action" attribute in the ACL schema is re-used to configure do-not-nat value for a match.

The second traffic flow [SIP=1.0.0.2, SPORT=120] cannot be added in the hardware to translate to the same IP/PORT [SIP=65.55.45.1, SPORT=600], since the reverse traffic flows cannot be uniquely translated to the original Source endpoints.

This mismatch in the NAT models between the ASIC and the Kernel is addressed by:
- Changes in the Linux kernel to do 3-tuple unique translation and full cone NAT functionality in the outbound (SNAT) direction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify which changes or link to a patch if available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch file is added in the sonic-linux-kernel submodule for the kernel changes. Shall be submitting a PR shortly for these changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

nat_entry_t snat_entry;

nat_entry_attr[0].id = SAI_NAT_ENTRY_ATTR_NAT_TYPE;
nat_entry_attr[0].value = SAI_NAT_TYPE_SOURCE_NAT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From above description, "nat_type": "dnat" means:

- If the "nat_type" is 'dnat':
  - DNAT translation of the DIP in the IP packet from 'global_ip' address to 'local_ip' address when the packet crosses the zones.
  - SNAT translation of the SIP in the IP packet from 'local_ip' address to 'global_ip' address when the packet crosses the zones.

But in the example code you configure SIP from "global_ip" to "local_ip" translation, DIP from "local_ip" to "global_ip"
Is this a typo or am I missing something in SAI API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example is correct. The nat_entry_attr is populated with the translated (global ip) values, and the snat_entry.data (which has the key) is populated with the local_ip.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

nat_zone_counter_attr[1].value.u32 = 1;

nat_zone_counter_attr[2].id = SAI_NAT_ZONE_COUNTER_ATTR_ENABLE_TRANSLATION_NEEDED;
nat_zone_co
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please describe a bit more reconciliation in natsyncd?
Is the reconciliation done after a timer expires like in fpmsyncd warm restart case?
Based on what do you mark an entry as stale?
Do we need reconciliation here? What if we just restore and then dynamic NAT entries will be removed if no traffic hits them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, reconciliation starts after the timer expires.
As explained in section 6 in this document:
The warmRestartAssist class is used to cache the data read from APP_DB.
Now the conntrack entries are read and pushed to warmRestartAssist class.
After the timer expires, the reconciliation is done by warmRestartAssist which removes those entries the stale entries from the APP_DB. Ideally all the entries should be in sync. In case for reason if a connrtack entry failed to be restored in the kernel, that entry is marked as stale and removed from app-db during reconciliation.

doc/nat/nat_design_spec.md Outdated Show resolved Hide resolved
doc/nat/nat_design_spec.md Outdated Show resolved Hide resolved
...

INTERFACE|Ethernet15
"nat_zone": 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this example, it seems configuring "nat_zone" at an interface is the way to enable this static DNAT on the interface., is it correct? If there is another ethernet interface configured with a different nat_zone id, it'll not use DNAT entry, is it correct? What if we want one more DNAT entry to be bound to ethernet15, how to do so? Do we support two different static DNAT entries on two different interfaces with different zone-ids? It is not clear to me how we can clearly associate which static NAT entries to which interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nat_zone mentioned here is used to push the zone attribute to the hardware on the L3 RIF (and only hardware uses it for the zone-id differences when doing the translation).

For binding the interface to the static dnat entry, the L3 IP (65.55.42.1) of the nat entry is matched against any L3 interface IP in the system. [This matching is needed for the application to add an entry to the hardware].
We can add any number of static napt entries with the same L3 IP (and with different ports) that will be matched to the same interface that has that IP (Ethernet15 here).
For example, in addition to the above static dnapt entry, we can add another static dnapt entry bound to the same interface.
STATIC_NAPT|65.55.42.1|TCP|1025
"local_ip": 30.0.0.1
"local_port" :8000
"nat_type": "dnat"

Yes, we can add different static dnat entries matching different interfaces with same or different zone-ids.
For example, we have another L3 interface (Ethernet16) with IP 65.55.43.1, we can add another static dnapt entry
STATIC_NAPT|65.55.43.1|TCP|1024
"local_ip": 40.0.0.1
"local_port" :6000
"nat_type": "dnat"

where
INTERFACE|Ethernet16|65.55.43.1/24
...

and zone add can be
INTERFACE|Ethernet16
"nat_zone": 1

or
INTERFACE|Ethernet16
"nat_zone": 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean the specific DNAT matching and action happens only if the original DIP matches the IP interface of the packets' incoming interface? In example above, the DNAT action to translate from public IP 65.55.42.1|TCP|1024, will be applied to Ethernet15 only and not any other Ethernet interface, is it correct? Or it is not interface specific? E.g. if the same packet is received at another interface Ethernet17 with nat_zone 1 configured, but Ethernet17 has a different IP address configured, what will be the behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what's the behavior when the configured DNAT global IP is matched to a loopback interface IP only? thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean the specific DNAT matching and action happens only if the original DIP matches the IP interface of the packets' incoming interface? In example above, the DNAT action to translate from public IP 65.55.42.1|TCP|1024, will be applied to Ethernet15 only and not any other Ethernet interface, is it correct? Or it is not interface specific? E.g. if the same packet is received at another interface Ethernet17 with nat_zone 1 configured, but Ethernet17 has a different IP address configured, what will be the behavior?

@rlhui Actually, matching against any interface in the same subnet is a criteria for application's purpose only to add the matching iptables rules in the kernel, and to push the entry to the hardware.
Once the entry is added in the hardware the behavior as per SAI pipeline, will not match on the packet's incoming interface. DNAT will happen for the packet ingressing with that original IP when received on any interface as long as the DNAT criteria is met. The original IP can be of loopback interface as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirenkella,
I see the following can happen with DNAT:
If the outside/public IP address to use is a local interface IP address (e.g. an uplink interface IP), in SONiC NAT spec, inbound traffic must be arriving at this interface, only then corresponding rule is added to iptables in kernel and hardware DNAT entry programmed. Otherwise, it’ll not do so.
But in SAI pipeline, such check is not present, so long there is a “DNAT Pool Prefix Lookup” table hit which indicates the DIP needs a DNAT and if DNAT lookup is a miss, pkts will be trapped to CPU.
So if inbound traffic carrying this public IP is being received at a different uplink - an interface which does not own that public IP, no hardware DNAT entry is programmed but pkts will keep coming to CPU.

Will you please document this in the HLD so that those who use interface IP as public IP is aware of this and may avoid the situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlhui, similar to loopback interfaces, the physical interface IP if used as public IP, can also be handled. Packet destined to interface IP received on any interface in the same zone as the interface that has that IP, is DNAT'ted if trapped to CPU. Section 3.4.1.1 is updated with the same.

doc/nat/nat_design_spec.md Outdated Show resolved Hide resolved
nat_zone_counter_attr[1].value.u32 = 1;

nat_zone_counter_attr[2].id = SAI_NAT_ZONE_COUNTER_ATTR_ENABLE_TRANSLATION_NEEDED;
nat_zone_co
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we handle the case that previous L3 Route/neighbor/nexthop entries were gone after restarting? What's the behavior for NAT in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DNAT entries are added to the hardware when the translated dip address is either resolved via a neighbor/nexthop entry or via a L3 Route. If unresolved, the DNAT entry is deleted from the hardware.
So, after restart, the DNAT entries are added only after the translated DIP is resolved. This can happen if the packet came up and software forwarded resulting in the ARP entry for connected internal hosts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirankella, I was referring to warm boot case where the DNAT entries were already in hardware.
With warm reboot, a DNAT entry was already created prior to rebooting. If next-hop was gone after rebooting, then the DNAT entries in HW or all SW tables (e.g. APP_DB?) should be deleted. Wanted to double check if this case was tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlhui If the nexthop is not resolved after warm reboot, the DNAT entry is not added by orchagent in the h/w. The corresponding conntrack entry and the APP_DB entry are timed out and deleted.

@kirankella
Copy link
Contributor Author

"When ACL rule is changed from 'forward' action to 'do_not_nat' action, the matching traffic flows corresponding to the NAT entries that were created before due to the 'forward' action continue to be translated till the NAT entries are timed out."

why could we not delete the NAT entries in HW upon this? If traffic is still there, there would be no way to not do NAT even though CLI is changed.

@rlhui It requires to find out what conntrack entries were created because of which ACLs (iptables in the kernel). To identify such conntrack entries, we need to kind of run the whole ACL rules matching logic in the NAT application.
Once we are able to identify such matched conntrack entries for any ACL, we can delete such entries in the kernel, if the ACL action was changed from 'forward' to 'do_not_nat'. This can be tracked as a future item.

Moreover, apart from the DIP, DPORT, SIP, SPORT fields (that are available in the conntrack entry), we cannot match the conntrack entry against the other ACL matching rules if configured by the user (like length, DSCP, TOS,...).

nat_zone_counter_attr[1].value.u32 = 1;

nat_zone_counter_attr[2].id = SAI_NAT_ZONE_COUNTER_ATTR_ENABLE_TRANSLATION_NEEDED;
nat_zone_co
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to warm reboot design the table is 'WARM_RESTART_TABLE', Why do we need a separate table 'NAT_RESTORE_TABLE'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanblyschak This is on similar lines as used by the neighsync application. The purpose of RESTORE_TABLE is different, it is for the state exchange between the restore_nat_entries script and the natsync application that is waiting for the conntrack entries to be restored in the kernel by the script. Following the kernel restore, the natsync application gets the dump of the conntrack entries from the kernel and then uses the warmrestartassist helper class to complete the reconciliation and set the reconciliation state in the WARM_RESTART_TABLE.

doc/nat/nat_design_spec.md Show resolved Hide resolved
NAT_BINDINGS|{{binding-name}}
"nat_pool": {{pool-name}}
"access_list": {{access-list-name}} (OPTIONAL)
"nat_type": {{snat-or-dnat}} (OPTIONAL)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirankella, thanks. For each dynamic SNATP entry going outbound, there's at least one reverse flow, that needs the DNATP back to the source. If DO_NOT_NAT was applied to the outbound flow, same would apply to the corresponding inbound flow. Can this be supported?
For DO_NOT_NAT for static DNAT or DNATP, it could be a future item.

doc/nat/nat_design_spec.md Outdated Show resolved Hide resolved
@kirankella
Copy link
Contributor Author

For each dynamic SNATP entry going outbound, there's at least one reverse flow, that needs the DNATP back to the source. If DO_NOT_NAT was applied to the outbound flow, same would apply to the corresponding inbound flow. Can this be supported?
For DO_NOT_NAT for static DNAT or DNATP, it could be a future item.

@rlhui Yes it is supported, since there won't be any matching DNAT entries in the hardware for the reverse traffic.

@xinliu-seattle xinliu-seattle merged commit 8fc728e into sonic-net:master Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants