Dual-homed firewall, web server on the private network, firewall is doing 1:1 NAT for the web server to the public interface of the firewall. em0 is the public interface, em1 is the private one. In the exact same setup (same hardware even) I am comparing Linux and OpenBSD for a firewall. Installed Linux on a hard-disc, OpenBSD on another disc, and I'm just swapping discs while I'm testing. All firewall rules are written as stateless as possible - I don't need stateful filtering, the setup is very simple (allow HTTP inbound, allow a few ICMP types, and that's it). With Linux, I achieve gigabit transfer speeds through the firewall (saturating the network ports), but the firewall refuses to let any new connection through when I flood it with a bunch of small UDP packets with random source addresses. I expected OpenBSD 4.1 to do better. But the thing is, even without the UDP flood, the OpenBSD firewall is very slow. I am downloading a huge file through it, via HTTP, and all I get is 4 Mbyte / sec. With Linux I get 112 Mbyte / sec. Something's wrong. Or I'm doing something wrong. The hardware is AMD64, Tyan Transport, 2 CPUs 2 cores each. I am using the SMP kernel. The network card is Intel Pro/1000 PCI Express 4x dual gigabit port, it carries both em0 and em1. ========================= lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 33192 groups: lo inet 127.0.0.1 netmask 0xff000000 inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8 fxp0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> mtu 1500 lladdr 00:e0:81:4a:0a:7f media: Ethernet autoselect (none) status: no carrier bge0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> mtu 1500 lladdr 00:e0:81:4a:0a:a8 media: Ethernet autoselect (none) status: no carrier bge1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> mtu 1500 lladdr 00:e0:81:4a:0a:a9 media: Ethernet autoselect (none) status: ...
You might want to re-think this, stateless rulesets are usually slower. This is interesting: Try setting net.inet.ip.ifq.maxlen to 256 (sysctl/sysctl.conf), if you still see the congestion count increasing then search for net.inet.ip.ifq.maxlen in the list archives and have a read.
I raised maxlen to 300. I also enabled ACPI. It's still slow. The congestion counter is still not zero - currently at 386.5/s One good thing is that there used to be a big pause when the kernel was booting up, probably waiting for some device or something - now with ACPI the pause is smaller. It's still waiting for something, just not as much. I am watching the system with top, set to update every 1s, and I noticed there are a lot of interrupt load bursts on CPU0. The percentage of interrupt load is very uneven, sometimes as low as 15%, sometimes as high as 75%. I unleashed the UDP flood and the firewall is totally frozen - can't do anything even on the local keyboard. Not even the display (running top) gets updated anymore. The machine is frozen solid. All network traffic stops immediately. Kill the UDP flood and OpenBSD resumes normal operations. I tried the uniprocessor kernel and it's exactly the same. Comparison with Linux on the exact same hardware: HTTP download speed through the firewall is 112 Mbyte / sec (saturating the GigE ports) and the interrupt load is relatively low and constant - about 30%. Under UDP flood with Linux as a firewall, the current download finishes up, but a new one cannot get started. The system is not frozen at all, it's quite usable, in fact I can heavily overload it (running a bunch of CPU hogs) to the point where userspace becomes sluggish and load average is up to 250 or so, yet the firewall is not influenced at all. So what's the deal here? The heavy interrupt load percentage seems to indicate an issue with the network driver if I'm not mistaken. But these are good and quite popular network cards - Intel Pro/1000 PCI Express 4x dual-port gigabit, seen by kernel as em0 and em1 -- Florin Andrei http://florin.myip.org/
I guess you need to "enable acpi" with config(8) as the system is quite new and most newer system have busted MP BIOS infos. The effect is bad interrupt routing and other crazyness -- which is often felt as slow systems.
First, you want to run 4.2 or -current, that shoudl about double your throughput. then, an i386 kernel should perform considerably better than amd64 for firewalling/routing/... next, you don't want SMP for such tasks. take out the second CPU and give it to somebody who can use it, and run the uniprocessor kernel. last, increase net.inet.ip.ifq.maxlen until you see the congestion counter not increasing much any more under load. should not exceed 2500 by too much. as a rule of thumb, 256 per gigE interface aren't too far off. -- Henning Brauer, email@example.com, firstname.lastname@example.org BS Web Services, http://bsws.de Full-Service ISP - Secure Hosting, Mail and DNS Services Dedicated Servers, Rootservers, Application Hosting - Hamburg & Amsterdam
Yes, I was looking at a paragraph in the 4.2 release notes and I thought all those things might be related exactly to the problem I'm seeing: ############## Huge performance improvements in the network stack, including: * In pf, store routing table ID, queue ID etc directly in the packet header mbuf instead of using mbuf tags (which use malloc'd memory). This yields a 100% improvement in pf performance. * Skip TCP/UDP/ICMP/ICMP6 checksumming when not necessary. This yields a further 10% improvement in pf performance. * A change in the way the kernel random pool is stirred greatly increases performance with network interface cards that support interrupt mitigation, especially on architectures where reading the clock is expensive (such as amd64). ############## That is surprising. What is the reason? So, assuming the box is a pure firewall / static router (so just pf and static routes), even with multiple interfaces, all those tasks run in a single kernel thread? Now here's the second thing: if this firewall needs to be integrated in an environment with dynamic routing, it will need to run some kind of dynamic routing daemon(s). For that, I'd like to have at least two cores on the system, and a kernel that can take advantage of them. If the SMP kernel does not actually hurt performance, I might have to use it. -- Florin Andrei http://florin.myip.org/
we dunno really. it hasn't been benched in sometimesoit might not even the required locking will cost you more than the second cpu/core it does. seriously. locking is not free. -- Henning Brauer, email@example.com, firstname.lastname@example.org BS Web Services, http://bsws.de Full-Service ISP - Secure Hosting, Mail and DNS Services Dedicated Servers, Rootservers, Application Hosting - Hamburg & Amsterdam
Why is this? Is there a security reason why the kernel is single-thread; is it OBSD resource limitations (no developer time, no hardware, etc); is it not enough interest yet? With interface speeds and bus bandwidth going up, how many interfaces is it possible to handle at full interface bandwidth on the fastest UP CPU and how much memory does that take? If you need more performance, do you build multiple boxes and CARP them? Virtualization to run multiple OBSDs, each on its own core (ignoring security issues of virtualization; crack one client is no worse than having a single OBSD running all interfaces getting cracked). Or do you start assembling a big box with muliple MBs each with a UP hooked up to a pair of drives, all co-located in one box with dual/triple/quad redudant PSUs? Not that I'm personally in need of the technology; I'm the one trying to keep a 486 patched on dialup. I'm just interested. Doug.
actually, i think henning wanted to say that the network stack runs in the stack runs entirely as interrupts. if there were a thread, we could add another, but going from 0 to 1 is more work than 1 to 2. networking workloads do not always divide up among CPUs nicely. assuming the code is written, just turning 2 or more CPUs loose on a stream of packets is likely to result in reordering, which is bad. to avoid reordering, you need lots of queueing, which hurts performance and drives up latency. the problem is unfortunately not as simple as add a lock here, a thread there, and presto.
Right, I see that multiple threads dealing with one interface would be a problem, but if you had a box with several interfaces, couldn't a mult-threaded stack work? Yes, I agree that 1 to 2 threads is totally different than 2 to n. I'm just concerned with what I perceive as two converging trends: 1) the trend for hardware per-interface bandwidth to increase; 2) the slowing of advances in single-processor speed. We're getting multiple cores on a chip and multiple chips on a board, and multiple interfaces on a box. What is the answer when the primary to-the-world interface is faster than the OBSD firewall can handle on a single CPU? Doug.
Even more offtopic - on Linux I saw there's a kernel thread for each interface. Interestingly, while routing 1 Gbps of traffic through the system (just a single download of a huge file over HTTP), on Linux kernel 2.6.18 both kernel threads are at 35% CPU usage, while on OpenBSD 4.1 the single kernel thread is at 70...80%. Maybe a coincidence, maybe the numbers don't usually translate linearly like that, I don't know. I like pf, it's a really clever firewall, that's why I'll keep testing with 4.2 -- Florin Andrei http://florin.myip.org/
I'm not an OpenBSD developer, but I'd bet that the reason is that BSD was originally written single-threaded (both because that's much easier than multi-threaded and because multi-cpy systems were rare back then) and has not [yet] been changed because changing to a multi-threaded kernel requires a lot of very finicky work (with innumerable opportunities to introduce very subtle bugs). Dave -- Dave Anderson <email@example.com>
Then I will do some tests with 4.2 on gigabit-capable hardware. If anything noteworthy comes out, I'll post the results. Don't expect something too fancy, but I guess anything is better than Hmmm. Please correct me if I'm wrong: Let's say a firewall is connected to a pretty fast Internet pipe (in the gigabit range). Let's say there's a DDoS against this environment. In theory, the firewall would need lots of RAM so that it can deal with the incoming nasty packets, create an entry for each packet in the state table (don't know the correct name for it in OpenBSD, sorry), then expire it after a while. In theory, the firewall could be tweaked to expire unused states quickly, but still, more RAM is better when dealing with a DDoS. What's still not clear to me is how much RAM I should provision per 1Gb of bandwidth on OpenBSD, assuming there's an incoming worst-case-scenario DDoS, that consumes RAM (and other resources) on the firewall yet leaves some bandwidth open for legitimate traffic (so the firewall must be able to continue to let the good traffic pass through). Also assuming some tweaking has been done on the firewall to expire the bad stuff quickly without affecting legitimate traffic. But all that depends on the actual legitimate traffic and on the firewall rules. Aw, damn. I was hoping that's not quite the case. Well, then hopefully the dynamic routing daemons won't get too greedy and DoS the firewall from within. :-) Or I may have to re-think the whole environment and forget the idea of doing any kind of dynamic routing on the firewall - from a security perspective, dynamic routing on the firewall sucks anyway. +-----+-------+-------+ | \ | i386 | amd64 | +-----+-------+-------+ | SMP | | | +-----+-------+-------+ | UP | | | +-----+-------+-------+ -- Florin Andrei http://florin.myip.org/
nope. the kernel will not ever use more than 1 GB (or were it 768MB? memory fuzzy). more than 1 GB of memory on a firewall even hurts.ok, not much. but a no, they won't. they only get the cpu cycles not required for packet forwarding (well, no, not really, not if done right. -- Henning Brauer, firstname.lastname@example.org, email@example.com BS Web Services, http://bsws.de Full-Service ISP - Secure Hosting, Mail and DNS Services Dedicated Servers, Rootservers, Application Hosting - Hamburg & Amsterdam
I thought by running an amd64 kernel will get me twice the speed than an i386 on an amd64 machine since one is 64 bit processing and the other is just 32 bit :-( How about on sparc64 systems? do you get thwice the speed compared to its 32 bit counterpart? Thank you so much Kind Regards Siju
so you think a 20 ton truck is twice as fast as a 10 ton truck? -- Henning Brauer, firstname.lastname@example.org, email@example.com BS Web Services, http://bsws.de Full-Service ISP - Secure Hosting, Mail and DNS Services Dedicated Servers, Rootservers, Application Hosting - Hamburg & Amsterdam
horizontal or vertical motion? assuming a perfectly spherical truck? -- Peter N. M. Hansteen, member of the first RFC 1149 implementation team http://bsdly.blogspot.com/ http://www.datadok.no/ http://www.nuug.no/ "Remember to set the evil bit on all malicious network traffic" delilah spamd: 22.214.171.124: disconnected after 42673 seconds.
And is it in a vacuum?
O.K I get it :-) So when does changing from 32 bit to a 64-bit processor actually help? Kind Regards Siju
Siju George wrote: Quoting Paul de Weerd, "In short: There is no short answer. It depends on what you're doing." ( Not to mention how you do it ;-) Short answer: When you *might* need more than a GB or so of RAM/swap. Most anything is faster than stuck. Easy: 2:1 ratio *either direction* which is faster. Hard: 10:1 ratio (again either direction). (figure in loading/unloading times on the truck analogy)
There are other changes between i386/amd64 than the number of bits (e.g. amd64 has more registers, which allows some other changes that can improve performance for some things), so it depends a lot on the code being run. You can't even always say, "software X is faster on arch Y", since the way you use that software can give different results. If you're looking for "fastest", just benchmark as close to real-life use on both, it's the easiest way. You also often need to test whether what you're trying to run does work correctly on !i386 arch (it's not uncommon for code to make assumptions which don't hold true on !i386). Of course, there are reasons other than "fastest" you might choose I'm not too sure I understand what you're saying here.
64 bit processors (combined with 64 bit capable operating systems) have the ability to address more RAM than 32 bit processors because 64^2 is a much larger number than 32^2... lots more RAM addresses). This does not speed things up, though, until you run out of RAM, and start having to access the swapfile. The processor's speed... MHz, GHz, etc., will determine how fast the processor itself can process instructions. -- -wittig http://www.robertwittig.com/ http://robertwittig.net/ http://robertwittig.org/ .
Actually 2^64 vs 2^32 (64^2 is 2^7, 64 is 2^6, 32 is 2^5) Other things equal, 64-bit should take twice as long because it takes 64 bits to do anything instead of 32 bits. Not really that simple, because accessing 32 bits can involve 1) accessing the 64 bits that the 32 bits are in. The 64-bits does affect how big the swap file can be without
On Wed, Oct 10, 2007 at 09:24:25AM -0500, Robert C Wittig wrote: | Siju George wrote: | | >I thought by running an amd64 kernel will get me twice the speed than | >an i386 on an amd64 machine since one is 64 bit processing and the | >other is just 32 bit :-( | > | | 64 bit processors (combined with 64 bit capable operating systems) have | the ability to address more RAM than 32 bit processors because 64^2 is a | much larger number than 32^2... lots more RAM addresses). | | This does not speed things up, though, until you run out of RAM, and | start having to access the swapfile. | | The processor's speed... MHz, GHz, etc., will determine how fast the | processor itself can process instructions. Depending on your software, 64 bit processors can be quite a bit faster. If you're dealing with 64bit integers, using 64bit registers, etc., a lower clocked 64bit CPU might be faster than a 32bit CPU clocking at a higher rate. In short: There is no short answer. It depends on what you're doing. From what Henning tells us (and what sounds logical to me), grabbing a ethernet frame from a NIC and putting it on another NIC doesn't really change much from 32bit to 64bit. Your compiler also comes into play. If that is more tuned towards a certain 32bit architecture (such as i386) than a certain 64bit arch (because it's less populair, such as sparc64 or hppa64 or mips64), this will impact your performance quite a bit. Cheers, Paul 'WEiRD' de Weerd +++++++++++>-]<.>++[<------------>-]<+.--------------.[-] http://www.weirdnet.nl/ [demime 1.01d removed an attachment of type application/pgp-signature]
Paul de Weerd wrote: Oops! that should have read: If you had to choose between, say, 2 gig RAM and a 32 bit CPU, or 1 gig RAM and a 64 bit CPU, which would be a better choice, in general? -- -wittig http://www.robertwittig.com/ http://robertwittig.net/ http://robertwittig.org/ .
On Wed, Oct 10, 2007 at 12:34:48PM -0500, Robert C Wittig wrote: | If you had to choose between, say, 2 gig RAM and a 32 bit CPU, or 1 gig | RAM and a 64 bit CPU, which would be a better choice, in general? There is no such generalization. The amount of RAM you need depends on the task. For firewalling, you don't need lots. For a high-traffic, caching webserver you do need much. If, in general, you are firewalling .. you won't need much RAM. If, in general, you are doing something else, you might need it. Like I said in my previous mail, there is no short answer. No quick solution. Everything has advantages and disadvantages. In some cases you may not even want to run OpenBSD (*shock* !). In general, you should look at the specific problem at hand and solve it with the means available. Cheers, Paul 'WEiRD' de Weerd +++++++++++>-]<.>++[<------------>-]<+.--------------.[-] http://www.weirdnet.nl/ [demime 1.01d removed an attachment of type application/pgp-signature]
for a packet filter/router/...? 32bit 2Gig and take a gig out. for a databse server? 64bit and add ram when required. there is no "in general". -- Henning Brauer, firstname.lastname@example.org, email@example.com BS Web Services, http://bsws.de Full-Service ISP - Secure Hosting, Mail and DNS Services Dedicated Servers, Rootservers, Application Hosting - Hamburg & Amsterdam
64-bit and 1 GB. it's much easier to add another GB RAM later than to add 32-bits.
The increase from 2^32 to 2^64 is even more impressive. ;-) --Jon Radel [demime 1.01d removed an attachment of type application/x-pkcs7-signature which had a name of smime.p7s]
HOLY SH*T! I tried 4.2. It rocks! Just the first test that I tried after installing it: - switched gigabit network - web server behind 1:1 NATing firewall - firewall is AMD64 X2 2.4GHz - downloading 2GB file via HTTP through the firewall in infinite loop - flooding the firewall with small UDP packets, random source IPs, generated as fast as my workstation (AMD64 X2 6400, Intel Pro/1000 PCI Express card, Linux Fedora 7, running the kernel-level "pktgen" packet generator which is very fast) can crank them out. The packets are directed to the NATed address of the web server, to a port that's blocked by the firewall. Under these conditions, OpenBSD 4.1 as a firewall just keels over and dies. All traffic through the firewall just stops in an instant. Linux 2.6.18 fares slightly better, the current download finishes up, but another one won't start. But the default OpenBSD 4.2 i386 uniprocessor kernel doesn't seem to care. The download just keeps going. New downloads are initiated OK through the firewall. There are even spare CPU cycles left :-) not many (10%) but still. There's a very large percentage of CPU (80...90%) used for interrupts. Good job folks, I'm impressed. Anyone building gigabit routers and firewalls, don't delay, upgrade to 4.2. Heck, do that even for 100Mbit systems, this type of DoS doesn't need much bandwidth to be effective. I'll keep doing tests. If anything interesting shows up, I'll post the results in a new thread. -- Florin Andrei http://florin.myip.org/
First, thanks for sharing your findings. Secondly, does anyone on the mailing list know of an OpenBSD equivalent to pktgen? Thanks. Jim
Not in-kernel, but netblast from the netrate package is somewhat useful.
Disabled all pf rules including NAT, now it's just "pass in ; pass out" Now the download is able to saturate the gig ports, about 112 Mbyte / sec. But it's still not constantly at 112, it sometime drops below that about 10%. When that happens, CPU0 has 0% idle cycles. A lot of interrupts, always above 70% on CPU0, going to 99% when the download slows down. The congestion counter is now 0. The UDP flood still freezes the system solid (but I discovered that the system clock continues to work more or less fine, it's just the text console and the firewall that are not responsive). I still can't match the performance I get from Linux. Any suggestion is appreciated. -- Florin Andrei http://florin.myip.org/
while is dreadfully obvious that there is some weirdness happening, you'll definately get more performance by switching to the latest snapshot or wait for your 4.2 cd if it hasn't come yet. What model transport do you have and whats the Mainbords bios rev?
there were in the past postings on this list about problems with quad-port em NICs. I am absolutely not in a position to tell whether they are relevant for this situation. If I remember correctly, there was a problem with TCP checksum offloading, and a suggested fix in one instance was jumpering the card down to 66 MHz. I can't tell if this is related in *any* way. I think there are some people here who *could* tell if you'd post a dmesg. gretings, knitti
# dmesg OpenBSD 4.1 (GENERIC.MP) #1152: Sat Mar 10 19:22:57 MST 2007 firstname.lastname@example.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 3220754432 (3145268K) avail mem = 2757828608 (2693192K) using 22937 buffers containing 322281472 bytes (314728K) of memory mainbus0 (root) bios0 at mainbus0: SMBIOS rev. 2.3 @ 0xf97e0 (61 entries) bios0: empty empty acpi0 at mainbus0: rev 2 acpi0: tables DSDT FACP APIC OEMB SRAT acpitimer at acpi0 not configured acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Dual-Core AMD Opteron(tm) Processor 2216, 2394.33 MHz cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,CX16,NXE,MMXX,FFXSR,LONG,3DNOW2,3DNOW cpu0: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 1MB 64b/line 16-way L2 cache cpu0: ITLB 32 4KB entries fully associative, 8 4MB entries fully associative cpu0: DTLB 32 4KB entries fully associative, 8 4MB entries fully associative cpu0: apic clock running at 205MHz cpu1 at mainbus0: apid 1 (application processor) cpu1: Dual-Core AMD Opteron(tm) Processor 2216, 2465.82 MHz cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,CX16,NXE,MMXX,FFXSR,LONG,3DNOW2,3DNOW cpu1: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 1MB 64b/line 16-way L2 cache cpu1: ITLB 32 4KB entries fully associative, 8 4MB entries fully associative cpu1: DTLB 32 4KB entries fully associative, 8 4MB entries fully associative cpu2 at mainbus0: apid 2 (application processor) cpu2: Dual-Core AMD Opteron(tm) Processor 2216, 2465.82 MHz cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,CX16,NXE,MMXX,FFXSR,LONG,3DNOW2,3DNOW cpu2: 64KB 64b/line 2-way I-cache, 64KB 64b/line 2-way D-cache, 1MB 64b/line 16-way L2 cache cpu2: ITLB 32 4KB entries fully associative, 8 4MB entries fully associative cpu2: ...