nebula

Author	SHA1	Message	Date
Wade Simmons	44cb697552	Add more metrics (#450 ) * Add more metrics This change adds the following counter metrics: Metrics to track packets dropped at the firewall: firewall.dropped.local_ip firewall.dropped.remote_ip firewall.dropped.no_rule Metrics to track handshakes attempts that have been initiated and ones that have timed out (ones that have completed are tracked by the existing "handshakes" histogram). handshake_manager.initiated handshake_manager.timed_out Metrics to track when cached_packets are dropped because we run out of buffer space, and how many are sent once the handshake completes. hostinfo.cached_packets.dropped hostinfo.cached_packets.sent This change also notes how many cached packets we have when we log the final "Handshake received" message for either stage1 for stage2. * separate incoming/outgoing metrics * remove "allowed" firewall metrics We don't need this on the hotpath, they aren't worh it. * don't need pointers here	2021-04-27 22:23:18 -04:00
Nathan Brown	db23fdf9bc	Dont apply race avoidance to existing handshakes, use the handshake time to determine who wins (#451 ) Co-authored-by: Wade Simmons <wadey@slack-corp.com>	2021-04-27 21:15:34 -05:00
Nathan Brown	df7c7eec4a	Get out faster on nil udpAddr (#449 )	2021-04-26 20:21:47 -05:00
Nathan Brown	6f37280e8e	Fully close tunnels when CloseAllTunnels is called (#448 )	2021-04-26 10:42:24 -05:00
Nathan Brown	a0735dd7d5	Add locking around ssh conns to avoid concurrent map access on reload (#447 )	2021-04-23 14:43:16 -05:00
Nathan Brown	1deb5d98e8	Fix tun funcs for ios and android (#446 )	2021-04-22 15:23:40 -05:00
Nathan Brown	a1ee521d79	Fix a failed return in an error case (#445 )	2021-04-17 18:47:31 -05:00
brad-defined	7859140711	Only set serveDns if the host is also configured to be a lighthouse. (#433 )	2021-04-16 13:33:56 -05:00
brad-defined	17106f83a0	Ensure the Nebula device exists before attempting to bind to the Nebula IP (#375 )	2021-04-16 10:34:28 -05:00
Nathan Brown	ab08be1e3e	Don't panic on a nil response from the lighthouse (#442 )	2021-04-15 09:12:21 -05:00
Nathan Brown	710df6a876	Refactor remotes and handshaking to give every address a fair shot (#437 )	2021-04-14 13:50:09 -05:00
John Maguire	20bef975cd	Remove obsolete systemd unit settings (take 2) (#438 )	2021-04-07 12:02:40 -05:00
Nathan Brown	480036fbc8	Remove unused structs in hostmap.go (#430 )	2021-04-01 22:07:11 -05:00
Nathan Brown	1499be3e40	Fix name resolution for host names in config (#431 )	2021-04-01 21:48:41 -05:00
Nathan Brown	64d8e5aa96	More LH cleanup (#429 )	2021-04-01 10:23:31 -05:00
Nathan Brown	75f7bda0a4	Lighthouse performance pass (#418 )	2021-03-31 17:32:02 -05:00
Nathan Brown	e7e55618ff	Include bad backets in the good handshake test (#428 )	2021-03-31 13:36:10 -05:00
Nathan Brown	0c2e5973e1	Simple lie test (#427 )	2021-03-31 10:26:35 -05:00
Nathan Brown	830d6d4639	Start of end to end testing with a good handshake between two nodes (#425 )	2021-03-29 14:29:20 -05:00
Nathan Brown	883e09a392	Don't use a global ca pool (#426 )	2021-03-29 12:10:19 -05:00
Wade Simmons	4603b5b2dd	fix PromoteEvery check (#424 ) This check was accidentally typo'd in #396 from `%` to `&`. Restore the correct functionality here (we want to do the check every "PromoteEvery" count packets).	2021-03-26 15:01:05 -04:00
Wade Simmons	a71541fb0b	export build version as a prometheus label (#405 ) This is how Prometheus recommends you do it, and how they do it themselves in their client. This makes it easy to see which versions you have deployed in your fleet, and query over it too.	2021-03-26 14:16:35 -04:00
Nathan Brown	3ea7e1b75f	Don't use a global logger (#423 )	2021-03-26 09:46:30 -05:00
Nathan Brown	7a9f9dbded	Don't craft buffers if we don't need them (#416 )	2021-03-22 18:25:06 -05:00
Nathan Brown	7073d204a8	IPv6 support for outside (udp) (#369 )	2021-03-18 20:37:24 -05:00
Joe Doss	9e94442ce7	Add fedora dist files. (#413 )	2021-03-18 12:33:43 -07:00
Joe Doss	13471f5792	Remove obsolete systemd unit settings. (#412 )	2021-03-18 12:29:36 -07:00
Thomas Roten	ea07a89cc8	Ensure mutex is unlocked when adding remote IP. (#406 ) Currently, if you use the remote allow list config, as soon as you attempt to create a tunnel to a node that has a blocked IP address, a mutex is locked and never unlocked. This happens even if the node has an allowed remote IP address in addition to the blocked remote IP address. This pull request ensures that the lighthouse mutex is unlocked whenever we attempt to add a remote IP.	2021-03-16 12:41:35 -04:00
Ryan Huber	3aaaea6309	don't allow a useless handshake with yourself (#402 ) * don't allow a useless handshake with yourself * remove helper	2021-03-15 12:58:23 -07:00
Wade Simmons	5506da3de9	Fix selection of UDP remote to use during stage2 (#404 ) The change for #401 incorrectly called HostInfo.ForcePromoteBest in stage2, when we really we want to pick the remote that we received the response from.	2021-03-12 21:43:24 -05:00
Wade Simmons	6c55d67f18	Refactor handshake_ix (#401 ) There are some subtle race conditions with the previous handshake_ix implementation, mostly around collisions with localIndexId. This change refactors it so that we have a "commit" phase during the handshake where we grab the lock for the hostmap and ensure that we have a unique local index before storing it. We also now avoid using the pending hostmap at all for receiving stage1 packets, since we have everything we need to just store the completed handshake. Co-authored-by: Nate Brown <nbrown.us@gmail.com> Co-authored-by: Ryan Huber <rhuber@gmail.com> Co-authored-by: forfuncsake <drussell@slack-corp.com>	2021-03-12 14:16:25 -05:00
Wade Simmons	64d8035d09	fix race in getOrHandshake (#400 ) We missed this race with #396 (and I think this is also the crash in issue #226). We need to lock a little higher in the getOrHandshake method, before we reset hostinfo.ConnectionInfo. Previously, two routines could enter this section and confuse the handshake process. This could result in the other side sending a recv_error that also has a race with setting hostinfo.ConnectionInfo back to nil. So we make sure to grab the lock in handleRecvError as well. Neither of these code paths are in the hot path (handling packets between two hosts over an active tunnel) so there should be no performance concerns.	2021-03-09 09:27:02 -05:00
Ryan Huber	73a5ed90b2	Do not allow someone to run a nebula lighthouse with an ephemeral port (#399 ) * Do not allow someone to run a nebula lighthouse with an ephemeral port * derp - we discover the port so we have to check the config setting * No context needed for this error * gofmt yourself * Revert "gofmt yourself" This reverts commit c01423498e3792f7acd69d7e691dce1edad81bcb. * Revert "No context needed for this error" This reverts commit 6792af6846d1200c564a4ad601a637535dd56c5b. * snip snap snip snap	2021-03-08 12:42:06 -08:00
Wade Simmons	d604270966	Fix most known data races (#396 ) This change fixes all of the known data races that `make smoke-docker-race` finds, except for one. Most of these races are around the handshake phase for a hostinfo, so we add a RWLock to the hostinfo and Lock during each of the handshake stages. Some of the other races are around consistently using `atomic` around the `messageCounter` field. To make this harder to mess up, I have renamed the field to `atomicMessageCounter` (I also removed the unnecessary extra pointer deference as we can just point directly to the struct field). The last remaining data race is around reading `ConnectionInfo.ready`, which is a boolean that is only written to once when the handshake has finished. Due to it being in the hot path for packets and the rare case that this could actually be an issue, holding off on fixing that one for now. here is the results of `make smoke-docker-race`: before: lighthouse1: Found 2 data race(s) host2: Found 36 data race(s) host3: Found 17 data race(s) host4: Found 31 data race(s) after: host2: Found 1 data race(s) host4: Found 1 data race(s) Fixes: #147 Fixes: #226 Fixes: #283 Fixes: #316	2021-03-05 21:18:33 -05:00
Nathan Brown	29c5f31f90	Add a check in the makefile to ensure a minimum version of go is installed (#383 )	2021-03-02 13:29:05 -06:00
Nathan Brown	b6234abfb3	Add a way to trigger punch backs via lighthouse (#394 )	2021-03-01 19:06:01 -06:00
Wade Simmons	2a4beb41b9	Routine-local conntrack cache (#391 ) Previously, every packet we see gets a lock on the conntrack table and updates it. When running with multiple routines, this can cause heavy lock contention and limit our ability for the threads to run independently. This change caches reads from the conntrack table for a very short period of time to reduce this lock contention. This cache will currently default to disabled unless you are running with multiple routines, in which case the default cache delay will be 1 second. This means that entries in the conntrack table may be up to 1 second out of date and remain in a routine local cache for up to 1 second longer than the global table. Instead of calling time.Now() for every packet, this cache system relies on a tick thread that updates the current cache "version" each tick. Every packet we check if the cache version is out of date, and reset the cache if so.	2021-03-01 19:52:17 -05:00
Wade Simmons	d232ccbfab	add metrics for the udp sockets using SO_MEMINFO (#390 ) Retrieve the current socket stats using SO_MEMINFO and report them as metrics gauges. If SO_MEMINFO isn't supported, we don't report these metrics.	2021-03-01 19:51:33 -05:00
Nathan Brown	ecfb40f29c	Fix osx for mq changes, this does not implement mq on osx (#395 )	2021-03-01 16:57:05 -05:00
Wade Simmons	1bae5b2550	more validation in pending hostmap deletes (#344 ) We are currently seeing some cases where we are not deleting entries correctly from the pending hostmap. I believe this is a case of an inbound timer tick firing and deleting the Hosts map entry for a newer handshake attempt than intended, thus leaving the old Indexes entry orphaned. This change adds some extra checking when deleteing from the Indexes and Hosts maps to ensure we clean everything up correctly.	2021-03-01 12:40:46 -05:00
Wade Simmons	73081d99bc	add `make smoke-docker` (#287 ) This makes it easier to use the docker container smoke test that GitHub actions runs. There is also `make smoke-docker-race` that runs the smoke test with `-race` enabled.	2021-03-01 11:15:15 -05:00
Tim Rots	e7e6a23cde	fix a few typos (#302 )	2021-03-01 11:14:34 -05:00
Wade Simmons	a0583ebdca	tun_disabled: reply to ICMP Echo Request (#342 ) This change allows a server running with `tun.disabled: true` (usually a lighthouse) to still reply to ICMP EchoRequest packets. This allows you to "ping" the lighthouse Nebula IP as a quick check to make sure the tunnel is up, even when running with tun.disabled. This is still gated by allowing `icmp` packets in the inbound firewall rules.	2021-03-01 11:09:41 -05:00
Wade Simmons	27d9a67dda	Proper multiqueue support for tun devices (#382 ) This change is for Linux only. Previously, when running with multiple tun.routines, we would only have one file descriptor. This change instead sets IFF_MULTI_QUEUE and opens a file descriptor for each routine. This allows us to process with multiple threads while preventing out of order packet reception issues. To attempt to distribute the flows across the queues, we try to write to the tun/UDP queue that corresponds with the one we read from. So if we read a packet from tun queue "2", we will write the outgoing encrypted packet to UDP queue "2". Because of the nature of how multi queue works with flows, a given host tunnel will be sticky to a given routine (so if you try to performance benchmark by only using one tunnel between two hosts, you are only going to be using a max of one thread for each direction). Because this system works much better when we can correlate flows between the tun and udp routines, we are deprecating the undocumented "tun.routines" and "listen.routines" parameters and introducing a new "routines" parameter that sets the value for both. If you use the old undocumented parameters, the max of the values will be used and a warning logged. Co-authored-by: Nate Brown <nbrown.us@gmail.com>	2021-02-25 15:01:14 -05:00
John Maguire	2bce222550	List possible cipher options in example config (#385 )	2021-02-19 21:46:42 -06:00
Wade Simmons	3dd1108099	Go 1.16 and darwin-arm64 (#381 ) This commit switches to Go 1.16 and adds a release binary for darwin-arm64. Fixes: #343	2021-02-17 13:11:57 -05:00
Nathan Brown	d4b81f9b8d	Add QR code support to `nebula-cert` (#297 )	2021-02-11 18:53:25 -06:00
brad-defined	454bc8a6bb	Check certificate banner during nebula-cert print (#373 )	2021-02-05 14:52:32 -06:00
Wade Simmons	ce9ad37431	fix regression with LightHouseHandler and punchBack (#346 ) The change introduced by #320 incorrectly re-uses the output buffer for sending punchBack packets. Since we are currently spawning a new goroutine for each send here, we need to allocate a new buffer each time. We can come back and optimize this in the future, but for now we should fix the regression.	2020-11-25 17:49:26 -05:00
Wade Simmons	ee7c27093c	add HostMap.RemoteIndexes (#329 ) This change adds an index based on HostInfo.remoteIndexId. This allows us to use HostMap.QueryReverseIndex without having to loop over all entries in the map (this can be a bottleneck under high traffic lighthouses). Without this patch, a high traffic lighthouse server receiving recv_error packets and lots of handshakes, cpu pprof trace can look like this: flat flat% sum% cum cum% 2000ms 32.26% 32.26% 3040ms 49.03% github.com/slackhq/nebula.(*HostMap).QueryReverseIndex 870ms 14.03% 46.29% 1060ms 17.10% runtime.mapiternext Which shows 50% of total cpu time is being spent in QueryReverseIndex.	2020-11-23 14:51:16 -05:00

1 2 3 4 5

228 Commits