Archives by Category
Contact
- Hagen Paul Pfeifer
- http://jauu.net
- hagen@jauu.net (encrypted preferred)
- KeyId: 0x98350C22
- Telephone: +49 174 5455209
Follow this blog
ACK as Grep Replacement
- Published in: programming
- | Time: 00:42:47 CET
- | SHA1: cc715c6013586498a3ea4eb14f93d0ae04c218ca
Not often but sometimes I discover new tools. Today I found ACK:”http://betterthangrep.com/”. There are some nice features:
- exclude git, svn directories
- perl regex
- better highlighting support
- per default 5 lines context
- pattern to exclude searched files
E.g to search the Linux Kernel network subdirectory for TCP SACK code the following command can be used:
ack-grep -CC SACK /usr/src/net-next/net
The CC option specifies that only *.c and *.h files are searched. Under Debian the package the program is called ack-grep.
New TCPM Co Chairs
- Published in: ietf
- | Time: 22:48:30 CET
- | SHA1: 87fa67f0029e36ce454ec3834539c720452a56dd
IETF TCMP will have a co-chair change: David Borman has stepped down as a TCPM co-chair and Wesley Eddy announced 3 co-chairs, instead of just 2: Yoshifumi Nishida, Pasi Sarolahti and still Wesley Eddy. Michael Scharf still acting as the chair.
Git Interactive Rebase and Stashing
- Published in: programming
- | Time: 21:51:25 CET
- | SHA1: 2e0c05c4b0b59c4874ea937642854b0caae0f4af
Sometimes I have to respin a patch-serie to address some new concerns. This time was a little bit more difficult: I first thought the a new feature can be applied on top of HEAD, later I realized that this particular feature should be applied on a previous commit. The workflow with git is really easy:
# save changes in stash stack git stash # start to rewrite history # change the addressed patch to 'edit' git rebase --interactive HEAD # re-apply the changes git stash pop # eventually fix merge commits vim file-to-fix.c git add file-to-fix.c # recommit git commit --amend # finish rebasing git rebase --continue
Linux Network Emulator Extensions
- Published in: networking
- | Time: 14:47:31 CET
- | SHA1: 87413c07fcbf1a37b0a7c74f6cf8399a8e4cdd35
Seems that the merge window for the upcomming 3.2 close. In davem’s net-next for 3.3 there are a couple of network emulator improvements:
rate extension allows to shape the traffic to n bytes/second:
Currently netem is not in the ability to emulate channel bandwidth. Only static delay (and optional random jitter) can be configured.
To emulate the channel rate the token bucket filter (sch_tbf) can be used. But TBF has some major emulation flaws. The buffer (token bucket depth/rate) cannot be 0. Also the idea behind TBF is that the credit (token in buckets) fills if no packet is transmitted. So that there is always a “positive” credit for new packets. In real life this behavior contradicts the law of nature where nothing can travel faster as speed of light. E.g.: on an emulated 1000 byte/s link a small IPv4/TCP SYN packet with ~50 byte require ~0.05 seconds – not 0 seconds.
Netem is an excellent place to implement a rate limiting feature: static delay is already implemented, tfifo already has time information and the user can skip TBF configuration completely.
This patch implement rate feature which can be configured via tc. e.g:
tc qdisc add dev eth0 root netem rate 10kbit
To emulate a link of 5000byte/s and add an additional static delay of 10ms:
tc qdisc add dev eth0 root netem delay 10ms rate 5KBps
Note: similar to TBF the rate extension is bounded to the kernel timing system. Depending on the architecture timer granularity, higher rates (e.g. 10mbit/s and higher) tend to transmission bursts. Also note: further queues living in network adaptors; see ethtool(8).
Second patch was the cell concept extension:
This extension can be used to simulate special link layer characteristics. Simulate because packet data is not modified, only the calculation base is changed to delay a packet based on the original packet size and artificial cell information.
packet_overhead can be used to simulate a link layer header compression scheme (e.g. set packet_overhead to -20) or with a positive packet_overhead value an additional MAC header can be simulated. It is also possible to “replace” the 14 byte Ethernet header with something else.
cell_size and cell_overhead can be used to simulate link layer schemes, based on cells, like some TDMA schemes. Another application area are MAC schemes using a link layer fragmentation with a (small) header each. Cell size is the maximum amount of data bytes within one cell. Cell overhead is an additional variable to change the per-cell-overhead (e.g. 5 byte header per fragment).
Example (5 kbit/s, 20 byte per packet overhead, cell-size 100 byte, per cell overhead 5 byte):
tc qdisc add dev eth0 root netem rate 5kbit 20 100 5
Eric also fixed classful handling of netem. Some month back someone removed the feature and netem was only able to act as a leave module. With Erics patch it is again now possible to add SFQ to netem. Sure this feature is more sophisticated but at least I will have a use case for this.
A couple of fixes and Stephen added userspace support for iproute to use the new loss model (markov-chain based). So a lot of changes for netem in this merge window. Last but not least I have another larger patch for the next merge window … ;-)
Git Diff Context
- Published in: programming
- | Time: 01:01:36 CET
- | SHA1: 08b024ee38f509f9d9e112da6956b8f825a860b1
Wow, two new git diff options:
--minimal to spend extra CPU cycles to generate a shorter diff
--function-context to display the whole function that was affected by the change. Nice for reviews by people without the whole context in mind.
Cisco ASA and DNS Security
- Published in: networking
- | Time: 23:19:19 CET
- | SHA1: d8b50877c4dadecda053a4b848a5e3dda0d61c43
I started to inform how Cisco ASA, Cisco PIX and Cisco FWSM firewall appliance secure their domain from DNS traffic. What is possible, what can I transport over DNS without increased drop probability. I question myself what DNS flags can be touched without any flaw.
I must admit that I’m no Cisco expert – not at all. If I look at the configuration possibilities I have to say “wow”:
class-map inspection_default
match default-inspection-traffic
policy-map type inspect dns preset_dns_map
parameters
dns-guard
id-randomization
message-length maximum 512
id-mismatch count 10 duration 2 action log
exit
match header-flag RD
drop
policy-map global_policy
class inspection_default
inspect dns preset_dns_map
service-policy global_policy global
What administrator knows DNS at this level? I mean that are no default values
(I think so), that are the recommendation of the official Cisco webpage. Let me
pick the message-length option: this means that no DNS request/reply larger
as 512 byte can be received! Today in a world of EDNS0, DNSSEC and several
AAAA answers in one packet this limit can trigger erroneous function.
Especially because the “configuration error” will show up rarely.
Virtual Processor IDs and TLB
- Published in: programming
- | Time: 20:53:16 CET
- | SHA1: cb696be96518e1e61739e561d7f2833b64ca7dea
The translation lookaside buffer is a high-speed memory page cache for virtual
to physical address translation. It follows the local principle to avoid time
consuming lookups for recently used pages. But what happened in a virtual
environment (e.g. kvm, xen, vmware)? Host mappings are not coherent to the guest
and vice versa. Each guest has it’s own address space, the mapping table cannot
be re-used in another guest (or host). Therefore first generation VMs like
Intel Core 2 (VMX) flush the TLB on each vm-enter (resume) and vm-exit. But
flushing the TLB is a show-stopper, it is one of the most critical components
in a modern CPU. The relevant code for Linux is located under
arch/x86/kvm/x86.c:vcpu_enter_guest().
But Intel engineers started to think about that. Intel Nehalem TLB entries have changed by introducing a Virtual Processor ID. So each TLB entry is tagged with this ID. VPID’s are not specified by the CPU, they are allocated by the hypervisor whereas the host VPID is 0. Starting with Intel Nehalem the TLB must not be flushed. When a process tries to access a mapping where the actual VPID does not match with the TLB entry VPID a standard TLB miss occur. Some Intel numbers show that the latency performance gain is 40% for a VM round trip transition compared to Meron, a Intel Core 2.
Current Kernel (at least e56c57d0d3fdb from net-next) provides no VIDS for
nested VM’s. This is no limitation of Intel’s concept
(arch/x86/kvm/vmx.c:prepare_vmcs02()). It is rather a proof that nested VMs
are not in wide-use and nobody seems to miss that behavior.
Ccontrol and Specific Make Environments
- Published in: programming
- | Time: 14:57:26 CET
- | SHA1: 5941c67f7586ba0706406ec351b135df48273280
Davem wrote that he sometimes accidentally type make -j 128 on his Intel
platform (btw: davem is Sparc maintainer). Rusty wrote that this was the main purpose to write ccontrol. A wrapper to control distcc, ccache and make.
Via ccontrol you can configure the current setup for the local machine. After that you use ccontrol instead of make and ccontrol will use the configured setting for this environment.
A standard configuration, generated via ccontrol-init on a Dual Core Processor looks like the following (~/.ccontrol/default):
[*]
cc = /usr/bin
c++ = /usr/bin
ld = /usr/bin
make = /usr/bin
cpus = 2
ccache = /usr/bin
A envorinment with several distcc hosts may have two additional lines:
distcc = /usr/bin/distcc distcc-hosts = hostname1 hostname2
On Debian ccontrol is available as a package: aptitude install ccontrol
WGET and HTTP Certificates
- Published in: neworking
- | Time: 02:13:42 CEST
- | SHA1: 50ff801b05b73f9a35cccbe8512c5bac1f283c13
@hal:~ $ wget https://fit.nokia.com/lars/papers/2010-imc-hgw-study.pdf --2011-10-27 02:23:29-- https://fit.nokia.com/lars/papers/2010-imc-hgw-study.pdf Resolving fit.nokia.com (fit.nokia.com)... 2001:708:40:fff1::1, 212.213.221.39, 195.148.124.194 Connecting to fit.nokia.com (fit.nokia.com)|2001:708:40:fff1::1|:443... connected. ERROR: The certificate of `fit.nokia.com' is not trusted.
I mean not that Nokia has IPv6 connectivity, no I mean why is the default
behavior of wget to verify certificates? I thought everybody use wget as a
handy download tool and to make some fix bandwidth measurements. (--no-check-certificate do the trick)
UEFI Secure Booting
- Published in: uncategorized
- | Time: 00:00:58 CEST
- | SHA1: f0e2026ccc005b9571bf3a7eb8932f6c637be505
Matthew Garrett blogged one more time about UEFI isses. To cite the most criticial problem:
“This is a little awkward for a couple of reasons. First, it means that any updates to the bootloader would require the user to manually accept the binary again. Second, as we’ve seen with https, there’s the risk of uesrs just blindly accepting things. If it’s acceptable to do this off internal media5 then malware could just rely on some number of users saying “yes”. And thirdly, we’ve been told that Microsoft’s Windows logo requirements forbid this option.”
ELF Symbol Name Length
- Published in: programming
- | Time: 22:28:00 CEST
- | SHA1: 6284940addab5c7192c5f233fe01d17b7dd1bc10
Category: unlimited. The ELF spec has no real (artificial) symbol name limitation. It only restricts offsets in string table by 4 bytes. And gcc/gdb are consequent: they also introduce no artificial limitation.
Of course, resolving dynamic linked object requires hashing and strcmp’ing these symbol names. Especially larger projects with lots of symbols will suffer from this (you may notice this for C++ symbols because of namespace bloat).
GCC versus LLVM
- Published in: programming
- | Time: 21:07:33 CEST
- | SHA1: 7bd545cd31cf3704a15f7d76e93060bbd8e52ccf
Today is one of these wacky days where nothing works (problems with kvm-tool, problems with RAM backed block device driver and kvm interaction and so on. But today was also the day of another Round of Vladimir Makarov gcc-versus-llvm round (gcc mailing list). To summary the highlights:
- LLVM is not faster as GCC (this is a often repeated lie): If you need the same generated code quality and compilation speed as LLVM -O2/-O3 you should use GCC with -O
- If you want 10%-40% faster generated code, you should use GCC with -O2/-O3 and you need 20%-40% more time for compilation (150%-200% if you use GCC LTO)
- Vladimir believe that LLVM code performance is far away from GCC because it is sufficiently easy to get first percents of code improvement, it becomes much harder to get subsequent percents
Vladimir used this year -Ofast -flto -fwhole-program instead of -O3 for GCC and -O3 -ffast-math for LLVM. To see the full report follow the link:
Head Of Line Blocking
- Published in: programming
- | Time: 19:26:58 CEST
- | SHA1: 7c84832ef9c8abbba242e52ef452b223fc404c53
The last two days I programmed an system with n input channels and n output channels. The system multiplex the input packet to the actual output channel, depending on packet IP destination address (and do some packet mangling, but this does not matter here). But the output channel can block (the socket returns EAGAIN, e.g. if TCP peer close the window). Each received packet is enqueued in the input queue. If the output channel is blocked then the input queue is stalled, no new packet can be transmitted, although the packet is indent for another output queue – this is bad.
The problem is also typical for switches: multiple input, multiple output and sometime the destination is the same output, thus buffering is required. The following image illustrate the layout of a Cisco Catalyst 6500. Catalyst do NOT suffer from HOL blocking because buffering is handled in egress queues – which is superior solution if the multiplex process is fast enough and do not consume to much cycles.
The idea to use a input queue was performance driven. At line rate – when CPU processing is at 100% load – I don’t wanted to spend the cycles in packet processing if later on the packet is dropped with a high probability. So from a performance point of view I don’t want to drop the design. But the performance issues that arise from the HOL blocking are much more relevant. So I started to look for workaraounds. The following papers look quit interesting:
RFIFO: Retreat FIFOs for the Head-of-Line Blocking problem
Head of Line blocking prevention technique based on multiple input queues per priority
Host Protocol for the ARPA Network
- Published in: networking
- | Time: 02:17:58 CEST
- | SHA1: e6d9056241bcdeff0e1f2f8f7203ec56488b44c7
Today a fresh I-D was published, titled: “Host/Host Protocol for the ARPA Network” from Alexander McKenzie and Steve Crocker; covering Internet history. Great I-D:
This document reproduces the Host/Host Protocol developed by the ARPA Network Working Group during 1969, 1970 and 1971. It describes a protocol used to manage communication between processes residing on independent Hosts. It addresses issues of multiplexing multiple streams of communication over a single hardware interface including addressing, flow control, connection establishment/disestablishment, and other signaling. It was the official protocol of the ARPA Network from January 1972 until the switch to TCP/IP in January 1983. It is offered as an “intended RFC” at this late date to help complete the historical record available through the RFC series.
Its Fairness Stupid
- Published in: ietf
- | Time: 21:06:46 CEST
- | SHA1: 1ad860df090a5f4b6f2c0a6bb1c719973b6a76b4
Some thoughts about recent activity here at TCPM and some other places.
Since several years vendors and web companies try to address web performance problems by adjusting TCP. Congestion control, slow start (IW10), timeouts and the like are addressed like a function of time: IW10 seems adequate for 2011, IW15 for 2013, IW20 for 2016 and so on. Timeouts are adjusted to current “consumer” networks. But the actual network characteristic is hardly a function of date, it is a function of the effective link characteristic between two endpoints in a packet switched networks with no a priori bandwidth guarantee. X.25, multi-hop sensor networks and low bandwidth radio links are _still employed_. Links with just a BDP of a few thousands byte per second, links cope a few packets in flight and with a large intrinsic jitter.
IW10, IW14, IWx are proposed to speed things up. Don’t get me wrong: I am generally ok with IW10 – I believe that the (potential) negative impact is manageable, overshooting the link is unlikely. I believe that this window is in a lower range of noise so that a flow – and all other flows – reach a stable equilibrium. But I am concerned about larger IW values. Last but not least: I cherish the work of Jerry, Joe, Michael, Mark, Ilpo, Matt, Nandita at al.
The argument that modern web browsers – or FTP client’s – actually open several parallel connections to speed up the process is no valid argument. It make things worse, because it de-facto disable congestion control. I.e. in extreme to transfer 1Gbyte, a client can spawn 10000 TCP connections in parallel, but with each connection you make the global congestion control mechanism less effective. To forecast something: I don’t believe that Chrome and Co disable the “parallel connection attempt” if IW10 is approved and becomes a standard track. In the end we will see several parallel connection attempts AND IW10. Simultaneously open n connections and shortly after the three way handshake receive n*10 packets in one flush. Because clients can spawn several connection in parallel does not mean that they should or is fair regarding Jain’s fairness index.
Larger vendors should not try to modify TCP in their horizon – TCP is for everyone. To gain milliseconds on “their” side probably mean loss of several seconds on the “other” side. Modify TCP is a short term solution. The long term solution is somewhere else. HTTP for example could cache more information: nearly 95% of all web-pages on my daily basis are identical: images, logos, web-page text fragments, java script files and so on are identical. I download them over and over again. But it is easier for vendors to tune TCP. But tuning TCP will not scale infinitely, you have a limited bandwidth and with a packet switch network you have the probe the available bandwidth – that’s a fact.
This is a plea to keep the spirit of a conservative TCP in the standardization body.
Justitia, a little bit uncommon without a pair of balances. Somewhere in Brazil:
Arabic Grid
- Published in: uncategorized
- | Time: 00:59:21 CEST
- | SHA1: 376457f903a332344038672ec4b0b88670946760
PCI Express Network Adapter Performance
- Published in: programming
- | Time: 01:54:44 CEST
- | SHA1: 6282c78b6cebe64fc29e21b1b74dfd46d56b44e7
Today I spent quite some hours in spotting some packet drops at Gigabit line rate. Stress-testing the hardware at line-rate reveal a lot of noise and disagreements: it is difficult to determine what really is the source of packet drops. Due to some netperf and perf analysis and some new tracepoints the eyes focus on PCI express bus.
Quite a lot of analysis until that, first I interpreted the raw values as a CPU
limitation. E.g. the CPU may not fast enough to keep sk_buff’s to the
adapter. Adjusting the interrupt rate or adjusting the RX/TX ring buffer size
are adequate actions, but as I indicated, the CPU was not the limiting
factor in the end.
I started to adjust the e1000e driver with some PCI driver modification to set the maximum read request size. I hope I can provide some graphs tomorrow for 1 byte packets and Ethernet Jumbo Frames.
TCP Fairness
- Published in: networking
- | Time: 22:14:25 CEST
- | SHA1: 60c5353c1cfb3f8710baa251317dae5a156508a9
TCP Fairness Debate
Last time as I posted about TCP fairness aspects of TCP I don’t touched the topic in great detail. But TCP fairness and protocol fairness in general is a important requirement in the whole IP communication zoo. In this blog posting I will write some sentences about the fairness aspects of a network layer protocol and particular TCP.
UDP and Datagram based Protocols
To start with UDP – as a unreliable, unordered lightweight datagram based service – provides no fairness mechanisms at all. If the ratio UDP versus TCP would shifts to UDP the whole Internet structure would break at a certain point. This is a factum. The Internet protocol suite is designed in a manner that a correct operational requires well behaved middle boxes, end systems the protocol interplay. UDP does not fulfil this requirement because UDP provides no mechanism to react of network condition. Even worse, UDP provides no mechanism to even receive any network feedback. Actual1 analysis shows that the ratio UDP to TCP lies somewhere between 5% and 20%.
Current application often seems to prefer UDP over TCP. This includes media streaming applications like VoIP or Video applications, all kind of tunnel applications (IPSec, OpenVPN, GRE, IPv6 and so on) and gaming application (gaming application employ often TCP because of NAT problems, but this is another topic). But we known that UDP cannot displace TCP from his prominent position. The IETF started several years ago a initiative to develop a protocol that can fill the gap. It is called Datagram Congestion Control Protocol (DCCP) and can be described as a bidirectional, congestion-controlled, unreliable unicast connection protocol. The main protocol is described in RFC 4340 some protocol extension are described in RFC 4341 and RFC 4342 .
Some years ago I wrote an article for a German magazine about DCCP. This article is now freely available: IX DCCP Article
At the time of writing the usage of DCCP is quite low. I know some application which are prepared but the employment suffers from the chicken-egg problem. Furthermore, decision makers act with caution because some middle-box (black-box) problems are to be expected. The employment of IPv6 will relax this problem a little bit. I will re-read this article in some years and update it. ;)
Congestion Collapse and Van Jacobson
The previously mentioned congestion collapse is not a theoretical consideration. No, back in 1988 TCP also did not include any congestion control mechanism! In April a massive collapse was noticeable – caused by the lack of functionality. Van Jacobson
TCP and Streaming Protocols
But let us come back and discuss the actual topic: TCP fairness.
TCP analyse the current network congestion status: it treats lost segments as an indicator of congestion, it can treat packet receive variations as an congestion indicator. Beside these passive recognition techniques TCP employs several active mechanisms to prevent the congestion collapse. Mechanism include the famous slow start to probe the available bandwidth, congestion avoidance to converge carefully to the bandwidth if TCP detects that the pipe is nearly at the capacity limit.
[1] http://www.cs.auckland.ac.nz/~brian/udptcp-ratio-TechReport.pdf
Back From Berlin
- Published in: uncategorized
- | Time: 23:11:41 CEST
- | SHA1: 9f90cd21754accc1231ff465dd30ae3f1fbe3a48
Traveling back from Berlin to Munich. This time via City-Night-Train booked in a six-person wagon. Reading “der Freitag”, a Berlin weekly newspaper produced here in Berlin until midnight. One of Kitas professor suggest the paper.
I took this extended weekend (from Thursday) to arrange our new home in Weissensee/Berlin. Paint the rooms white and the floor in a special mixed color. Next time I will establish the parquet (84m^2). After that all ground work is done and the furniture can be ebayed™. ;-)
Back From Holidays
- Published in: uncategorized
- | Time: 00:23:16 CEST
- | SHA1: 9f8cb8b963acc81b75a54fb4fee56d06cf0cdb7d
Between 15th and 25th of July I was at Majorca/Spain for one week of relaxing (including a little bit of climbing). A really nice spot!
Intel E5645
- Published in: uncategorized
- | Time: 00:47:20 CEST
- | SHA1: 3895887e2474308114484a9feb848a7f42e875da
Quit interesting Westmere-EP CPU:
| # of cores | 6 |
| # of threads | 12 |
| Bus speed | 2933 MHz QPI |
| System Bus | 5.86 GT/s |
| Level 1 cache size | 6×32 KB icache, 6×32 KB dcache |
| Level 2 cache size | 6×256 KB |
| Level 3 cache size | 12 MB |
| Features | SSE4.2, Turbo Boost, AES New Instructions, Enhanced Intel SpeedStep Technology. ... |
| Max Memory Bandwidth | 32 GB/s |
| Memory Types | DDR3-800/1066/1333 |
| Lithography | 32 nm |
In the end dual sockel boxed charges will amount to approx 1000.- Euro.
Email Standards
- Published in: networking
- | Time: 02:29:20 CEST
- | SHA1: 90af160c15f2c0de0d7c6e58443482d5976c445f
WTF? I stumbled over http://www.email-standards.org/ they call it Email standard? HTML Email? I mean Email and HTML are two totally different things. Apart from the fact that I don’t need another channel which is so commercialized as WEB nowadays. For people accustomed to receive text email because of information exchange the project name email-standard is … I don’t know … ;-)
Finally
- Published in: ietf
- | Time: 00:59:33 CEST
- | SHA1: 057974f8af30dd7df1b04c6d8a25bb0f0d9c1e14
Mobility Support in IPv6 – Request for Comments: 6275
Goes the way of the Dodo
- Published in: networking
- | Time: 01:06:45 CEST
- | SHA1: 2242bd22733eb7f304496517cd057bae2ee4119c
Jesper Juhl today on LKML questioning about the new Kernel naming scheme: ”... There are a few scripts that need fixing if SUBLEVEL goes the way of the Dodo, but I didn’t want to start fixing those without a clear indication of whether or not SUBLEVEL is going to/should die.”
Libhashish Development
- Published in: algorithm
- | Time: 08:37:14 CEST
- | SHA1: 224c7d33f7b5176321abcc4ec3cf75cf833252a3
Libhashish is one of my long term project developed with git. Today I stumbled over a tool to visualize commits, called gource:
NS 3 Quote
- Published in: networking
- | Time: 22:33:47 CEST
- | SHA1: 92767d343f327f2a9b266f80567d2e225ab54040
.h2 Move VisualSimulatorImpl into the core module?
move VisualSimulatorImpl into the core or network module.
Does this remind anyone else of Microsoft moving graphics into the kernel in NT 4.0, so that it was much more unreliable than NT 3.5 and could be crashed by any 3D screensaver?
-1
L.
No Cache Copy
- Published in: networking
- | Time: 23:32:12 CEST
- | SHA1: 67cf61c741c173b5f60dff466df93b753933305f
Tom posted these days a patch called “Allow no-cache copy from user on
transmit”. The idea behind is the following: in net/ipv4/tcp.c:tcp_sendmsg()
will copy date from user buffer to sk_buff via copy_from_user() or
csum_and_copy_from_user() (which does what the name implies: copy the data
and calculate a (partial) checksum). Tom realized that when data is copied
data caches are touched which are often not used afterwards.
The feature is enabled if the device signals that he is doing some kind of checksum offloading. So that the driver must not touch any data – calculate the checksum. In other words: this feature is disabled if software checksumming is required. There some other interplay (e.g. with loopback interface) but these are covered. The feature is also configurable via ethtool.
Some measurements (200 instances of netperf TCP_RR):
No-cache copy disabled: 672703 tps, 97.13% utilization 50/90/99% latency:244.31 484.205 1028.41 No-cache copy enabled: 702113 tps, 96.16% utilization, 50/90/99% latency 238.56 467.56 956.955 Using 14000 byte request and response sizes demonstrate the effects more dramatically: No-cache copy disabled: 79571 tps, 34.34 %utlization 50/90/95% latency 1584.46 2319.59 5001.76 No-cache copy enabled: 83856 tps, 34.81% utilization 50/90/95% latency 2508.42 2622.62 2735.8
The patch seems fine but Davem take them out because of some coding issues.
Shinny KVM Userspace
- Published in: programming
- | Time: 13:15:07 CEST
- | SHA1: 52c7445e77decebacbda652e3babbaa47c453370
Wow, today Pekka Enberg announced a new KVM userland (e.g. a qemu pendant). Techical it is really raw at the moment: no networking support, no graphic support and so on. The list of missing things is long. The most prominent change is another one: it is aligned with the Linux development, e.g. it is placed under tools/kvm and because of it freshness it can mature into a more powerful environment as todays qemu (e.g. SMP support).
To build the tool you can try this:
cd /usr/src/linux git checkout -b kvm/tool git pull git://github.com/penberg/linux-kvm.git # Compile the tool cd tools/kvm&& make # Download a raw userspace imag wget http://wiki.qemu.org/download/linux-0.2.img.bz2 bunzip2 linux-0.2.img.bz2 # Build a kernel with CONFIG_VIRTIO_BLK=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_EXT2_FS=y CONFIG_EXT4_FS=y # lunch the tool ./kvm --image=linux-0.2.img --kernel=../../arch/x86/boot/bzImage
Staging WTF
- Published in: programming
- | Time: 15:54:10 CET
- | SHA1: 9f84b391ee00b73c4c6737eb7ab14e91d831f95e
[...]
len = dwrq->length;
ext = _malloc(len);
- if (!_malloc(len))
+ if (!ext)
return -ENOMEM;
if (copy_from_user(ext, dwrq->pointer, len)) {
kfree(ext);
[...]
Initial Congestion Window and NS2 Testbed
- Published in: networking
- | Time: 23:09:54 CET
- | SHA1: 6354a367f845ea0ef69bcb37578a7b47bf0df2a9
The increase of the Initial Congestion Window (IW) throughout the land: netdev ML ((Linux Network Development mailing list), TCPM (TCP Modification WG) ML, ICCRP (Congestion Control Research Group WG) ML, TMRG (Traffic Modeling Research Group) ML are a few, undoubtedly the most significant bodies related to TCP all have one thing in common: they all discuss about the (initial) Google proposal to increase IW to 10 or even 16.
I think my position is to this topic is more or less known, this time I want to sum up arguments of other folks. At the end it is a long process to shift the IW to whatever value seems adequate and a lot of research is required to prove/back up the modified IW in the wild.
IW10/IW16 has only verified in the Google (especially Jerry Chu) and Ilpo Jarvinen testlab with SACK enabled sender. There are no analysis how the behavior shifts if SACK is disabled or not available, due to a receiver lack of SACK. But this is only one point of critique. To back up (or revise) my assumption I started today to setup a ns-2 testbed to analyze other corner cases. Especially multihop networks (e.g. MANET) with a bandwidth of a few kbits are of interest. The current simulation provides a OLSR based multihop infrastructure. Now I will start digin into patching the Initial Congestion Window of the network stack as well as think about useful analysis scripts.
And because it is Friday: a Internet LOL-Cat Meme

IETF 80 Prague Registered
- Published in: ietf
- | Time: 00:26:50 CET
- | SHA1: 458496a1377402857292534bafce02b3f86a8086
IETF 80 will be held in Prague – and as a early bird I registered today. The advantage of Prague is that I am familiar with the city which makes it easy for me to find a cheaper hotel compared to the standard conference hotels (Hilton: 148 Euro/day, Inter Continental: 140 Euro/day). The hotel in Beijing (IETF 79) for example costed ~1500.- Euro for 9 days. I hope to find a hotel for about 350.- Euro. Looking forward to meet colleagues and IETF maillinglist parter.
Linux TCP Quick ACK 2th Attempt
- Published in: networking
- | Time: 19:38:22 CET
- | SHA1: 8fc30614b01554ceeca28c1c36ec55b81bdbf412
Back in August I submitted a
patch
to enable/disable the TCP Quick ACK behavior. Linux default Quick ACK
behavior is in many cases counterproductive and increase the packet count.
Especially short-lived interactive protocols like HTTP will suffer of TCP
Quick ACK. For example a interactive HTTP flow of 16 data packets will
send one additional but unnecessary Quick ACK packet – 1/16. Accumulated
this is not to underestimated! Not sure why big HTTPD users like facebook or
$BIGCOMPANY do not tune their stack – do they not question their tcpdump
traces?
At that time where I submitted the patch Davem rejected the sysctl based
approach with the suggestion to use a per-route approach to reduce sysctl
pollution. The big disadvantage is that it is now a little bit tricky to
enable/disable Quick ACK’s, because each route must be addressed separately—
I am not happy with the idea to ban nearly all new sysctl extensions.
Anyway, I modified the patch, I will tests the patch a couple of days and
afterwards it should goes into net-next, hopefully.
Mass Parallel Network Processing Architecture
- Published in: programming
- | Time: 20:30:03 CET
- | SHA1: e7015a55176a45d49e07d11b5179430b8ecff710
Yesterday and today I wrote an ANSI C program to demonstrate how a non-blocking,
threaded network server for current SMP/CMP systems may be structured. The
design of the server is constructed in a way that all CPU may stressed and load
is fair balanced. All thread local clients are multiplexed via a event loop
management wrapper (epoll() or select()).
The program demonstrate the most effective architecture for current CMP/SMP
server systems. One or more passive server sockets is shared by all threads.
These file-descriptors are put into all thread local event loop and monitored
for EV_READ. Each a new client connect() the server will trigger a EV_READ
in one till probably all threads! The actual number of awaken threads depends
on the Kernel internal handling. Anyway, next step for all threads is to call
accept(), but only one thread will return with a new file-descriptor. The
other threads will return -1 and errno is set to EWOULDBLOCK. If this happens
the thread knows that another thread was a little bit faster.
The next step for the lucky thread is to process the client request and put the new client descriptor in the local event management loop. Voila, the next events are handled by this thread.
This time the program is not programmed in a portable way. On the contrary the
program is really Linux centric to gain the maximum performance and use Linux
specific hooks. With some modifications the program should be ported to *BSD
too.
What the current design not support is active load-balancing of work. But as more unfair a CPU is penalized the more likely is that new connections are handled by the idle CPU and therefore after a while a equilibration is reached.
ps ax -L -o pid,tid,psr,pcpu,comm | grep -r parallel - 17020 17020 0 0.0 parallel-net 17020 17021 0 0.0 parallel-net 17020 17022 1 0.0 parallel-net 17020 17023 2 0.0 parallel-net 17020 17024 3 0.0 parallel-net 17020 17025 4 0.0 parallel-net 17020 17026 5 0.0 parallel-net
In the next couple of days I will publish the git repository and give a short notice here in the blog.
Epoll and Select Overhead
- Published in: programming
- | Time: 21:48:30 CET
- | SHA1: d9d5c7c35a997103861053f9e303a5de5914bb6b
I benchmarked epoll() and select() one more time, this time restricted to
one FD which becomes readable. This benchmark reflect therefore the most crucial
possible performance measurement. And which came as no surprise, select()
perform worse. A possible poll() graph should be equal to the select()
graph. But as I said before: I am not interested in poll().
Link Equilibrium and Oversized Router Buffer
- Published in: networking
- | Time: 17:54:30 CET
- | SHA1: 4c8ec6ccee7c576642c2c0776f8a171cb097888c
Jim Gettys wrote an blog entry where he describe problems with a oversized router buffer – he called the problem bufferbloat. Buffer in network equipment exist to catch temporary load peaks and allow proper internal packet exchange from one linecard to another linecard in coexist with the max. bandwidth.
In the case where buffers are overestimated the effect get worse during periods of network congestion. Overestimated buffer will shift the problem in time, or in other words: TCP will overestimate the available capacity and normal congestion avoidance mechanisms do not timely take effect. Overestimated buffers concrete shift the congestion avoidance mechanisms in time – packets are dropped to late and the packet drop is not aligned with the link capacity but rather with the buffer space. The problem with the artificial delayed congestion avoidance is that the congestion avoidance mechanism will overestimate the available bandwidth therefore. User impact is a high latency and jitter. For a proper functioning the mechanism must take promptly and should reflect the link capacity.
Epoll versus Select
- Published in: realtime
- | Time: 23:07:21 CET
- | SHA1: b7cef6896655c3d730a9fbc010a857c6d82cf432
I finished the select() backend for libeve Thus libeve can be compiled under nearly every operating system in this solar system (I didn’t even tried it, but the functional components are ready). Primary focus was and is Linux, I am not necessarily interested in other systems. Under Linux epoll/timer_fd is the best what you can do if you want to do. Under *BSD kqueue should be the most performant implementation, but doe to lack of time and interest it isn’t implemented and select() should be fine too.
Some analysis draw another picture on the wall:
The good value for select() are a little bit misleading: epoll will outperform select and poll if a sub-set of descriptors are ready. The presented test here make all descriptors in the set readable. So the negative performance impact is caused due to massive context switched (e.g. each context switch for epoll() to get information, where the select() set is already set with one system call). Increasing the EVE_EPOLL_ARRAY_SIZE should relax the context switch overhead (currently 64).
IPv6 Address Assignment to End Sites
- Published in: ietf
- | Time: 19:58:05 CET
- | SHA1: 3990837196b53577160b58e07698359476008042
IAB/IESG Recommendations on IPv6 Address Allocations to Sites (RFC 3177)
provides recommendations that for end sites a /48 block should be provided in normal case. A /64 block when it is
absolutely sure that only one subnet is needed and /128 for the case that
only one device is connected. The requirements for IPv6 in 1993 included
the plan that the next IP version should address approximately 240
networks and 250 hosts. Therefore the currently IPv6 address can be
loosely splitted in a 64 bit network number (including subneting) and
64 bit host number (including flat EUI-64 host part and randomly self
autoconfigured host number) → 240 & 250 goal accomplished.
/48 provides end sites the ability to subnet 216 subnet numbers for
internal routing infrastructure each with theoretical max. 264 unique
hosts. Most enterprises should be happy with this. Very large enterprises
should be provided with a /47 or with multiple/48. RFC 3177 constitute a more or less hard
default setting and Regional Internet Registries (RIRs) should bear on
that. The idea of the IAB/IESG recommendation was that a hard structure
will reduce among other things maintainership (e.g. address restructuring, see paragraph
3 for more information).
IPv6 Address Assignment to End Sites
started now to obsolete RFC 3177.
Thomas Narten et. al. stated that the RIRs originally started with /48
but began to switch to other policies in 2005. Namely APNIC, APNIC and
RIPE encourage the assignment of smaller (e.g. /56) blocks to end sites.
One concern is that the hard suggestion can lead to a classfull routing
where CIDR continues to apply to all bits of the routing prefixes. Another
aspect is that RIRs may have other policies which fit better in their model.
Chi Square Test and Cryptographic Hash Functions
- Published in: algorithm
- | Time: 01:00:31 CET
- | SHA1: 3896f1a23e647651c956e5c656f7b88db5a410fc
The Chi-Square probe is a statistical hypothesis test. The here presented values show the distribution of several hash algorithms compared to the cryptographic SHA1. Currently I am too lazy to explain the algorithm in great detail. Wikipedia is a great source for an introduction.

The analysis is part of libhashish and available in the analysis directory.
![]()
SMP and too Optimistic Compiler Optimization
- Published in: programming
- | Time: 23:10:48 CET
- | SHA1: 6721af97076fa3410b4e63ecd25aaf3a8a3d53a9
In a nowadays common SMP/CMP environment with more then one CPU it is necessary
to protect global/shared data in the kernel. One prominent example are
sysctl_ variables. Sysctl variables are protected via a spin lock
(sysctl_lock). This look protect the /proc/sysctl interface path, the look
itself does not protect the usage path itself. Consider the following example:
int global_val;
int foo(void) {
if (global_val != 0) {
int res = bar();
return res / global_val;
}
return -1;
}
Here global_val is a global accessible variable. If you are fit in spotting
errors and remember that we are in a SMP/CMP environment you will notice that
global_val is not protected. Thus during the execution of bar(), global_val
may might be change, in a worst case to zero.
One solution might be the following code fragment:
int global_val;
int foo(void) {
int tmp_cpy = global_val;
if (tmp_cpy != 0) {
int res = bar();
return res / tmp_cpy;
}
return -1;
This may looks fine, but as you can imagine there is another flaw in the code!
The problem is that the compiler may eliminate tmp_cpy and read the
global_val twice! The correct code may look like this:
int global_val;
int foo(void) {
int tmp_cpy = *(volatile int *) &(global_val);
if (tmp_cpy != 0) {
int res = bar();
return res / tmp_cpy;
}
return -1;
This guard prevents the compiler from merging or refetching accesses – no more, no less.
The bottom line is either you protect every piece of shared data costly or you know how your compiler act and you can rely on you tests (regardless if automated or you crowd source your testing ;-).
Hoist Function Subexpression Elimination
- Published in: programming
- | Time: 16:06:21 CET
- | SHA1: e8bf5d30d255877d8eba6006e19bce068c6925ea
Function calls embedded in loop constructs are a area of optimization. As
a general statement: it is clever to hoist any method calls from loop
statments like for, while and the like where the result is constant
(eliminating redundant computation is a standard optimization mechanism). For
example, often you will see constructs like for (i = 0; i < strlen(str); i++) { whatever }.
strlen() is evaluated every time, although str does
never change. A optimized way would be for(i = 0, max_str = strlen(str); i < max_str, i++) { whatever }
so that strlen() is called once.
Modern compilers on the other hand implement strlen() as a build-in
function, the glibc function is not called. GCC provides
his own replacement. The compiler is therefore in the ability to detect that
strlen() exhibit no side-effects (e.g. change no global memory). The compiler can
now hoist this function like the handcrated version. Via -fno-builtin all
build-in functions can be disabled and the compiler cannot optimize this
construct. But wait! Gcc provides a possibility to mark functions to state that
they have any side-effects: the pure attribute (gcc since version 2.96
supports the pure attribute). Gcc can now optimize (hoist) unknown
functions just as builtin functions:
/* naiv strlen implementation - just demonstration of pure */
static size_t __attribute__ ((pure)) strlen(const char *str)
{
const char *s;
for (s = str; *s; ++s) ;
return s - str;
}
To mark the argument with the ANSI const keyword is not enough, on the one side it provides gcc the hint that this function does not modify the argument. It does not provide the compiler enough information that the function body is free of side effects.
From the GCC manual:
Many functions have no effects except the return value and their return value depends only on the parameters and/or global variables. Such a function can be subject to common subexpression elimination and loop optimization just as an arithmetic operator would be.
Another similar technique is known as constant folding. But constant folding provides another area: it evaluates expressions at compile time where the result is calculateable at compile time. One example is htonl() where the argument is a constant (e.g. htonl(0xbeef)).
Jump Table Optimization For If Else Constructs
- Published in: programming
- | Time: 01:01:26 CET
- | SHA1: 50bfcf8c73644685156efb40184e441d7a48daeb
This blog post is not about micro-optimization nor about what you have to use. Neither, it is about how compilers can optimize two common flow constructs. The knowledge may provide enough background information to sharpen the view and provides enough insights what construct to use for which usage.
The following if-else statement is compiled on x86_64 with -O6 into the following ASM sequence:
int main(int ac, char **av) {
if (a 0) foo();
else if (a 1) bar();
else if (a == 2) shu();
}
[...] main: .LFB35: .cfi_startproc subq $8, %rsp .cfi_def_cfa_offset 16 cmpl $0, %edi je .L19 cmpl $1, %edi je .L20 cmpl $2, %edi je .L21 .p2align 4,,5 jne .L13 [...]
The if-else statement is translated into a conditional sequence of cmp and conditional jump instruction.
A identical switch/case construct can be optimized because the compiler can generate a jump table of complexity(1) compared to a complexity of O(n) where the compiler generate a ordinary conditional construct. So if you had to branch into multiple targets and the argument is a integer a switch/case statement is superior. Requirement is that labels are dense. If not the fallback is to use the conditional jump chain.
switch (ac) {
case 1: foo(ac); break;
case 2: bar(ac); break;
case 3: sho(ac); break;
}
L18 is the label where the jump tables is saved. The jmp instruction will jump to the right label:
main: .LFB35: .cfi_startproc subq $8, %rsp .cfi_def_cfa_offset 16 cmpl $5, %edi ja .L12 mov %edi, %edi jmp *.L18(,%rdi,8) .section .rodata .align 8 .align 4 .L18: .quad .L12 .quad .L13 .quad .L14 .text .p2align 4,,10 .p2align 3
If you have some items the if-else statement is sufficient fast and no differences are measurable. But for many entries you should definitely use a switch/case statement. And by the way: a switch/case statement is often more eye pleasant to understand and can be considered as a list of coequal options. But anyway it is a micro-optimizations and does not affect your code (exceptions are omitted). If you write code often and you are free to choose the right construct why not to pick up the more efficient one? But as always in programming: readability and maintainability are of higher interest.
Another tip: sometimes I see code something like this:
if (!strcmp(arg, "foo")) foo(); else if (!strcmp(arg, "bar")) bar(); else if (!strcmp(arg, "shu")) shu(); [...]
The delicate issue of this construct is that the key (here arg) is compared multiple time in the program – not in one place but several times arg must be considered. The key is constant and do never change and strcmp() isn’t the lightweight function. The more intelligent approach is to map the key to a enum/int type at program start and use a switch/case construct afterwards:
int map(const char *arg) {
if (!strcmp(arg, "foo")) return TYPE_FOO;
else if (!strcmp(arg, "bar")) return TYPE_BAR;
else if (!strcmp(arg, "shu")) retrn TYPE_SHU;
}
int main(int ac, char **av) {
const int type = map(av[0]);
switch (type) {
case TYPE_FOO: return foo();
case TYPE_BAR: return bar();
case TYPE_SHU: return shu();
}
}
A last sidenote: if you are mainly interested on the function call than another superior construct is a function pointer array where the type is the index into the array. But this is more specific but on the other hand the most efficient usage for this scenario.
KVM QEMU Kernel Debugging
- Published in: programming
- | Time: 01:06:45 CET
- | SHA1: fae984f835bf9b35a8bd93a850c6cc3a134a0db7
Sometimes it is unavoidable to single step through the kernel because the code flow is complicated and systemtap and other tools are not helpful. This comes true when a lot of code must be conditionally analysed, without any prior knowledge. KVM and GDB provides a nice combination for this. I use my standard qemu setup with two additional qemu flags: -s and -S. Both flags instrument qemu to start a qemu gdb server and to break at the beginning. On the other side, the debugger side the following gdb commands are required to bring the environment in a suitable state:
gdb /usr/src/linux/vmlinux target remote localhost:1234 c bt set architecture i386:x86-64:intel
set architecture i386:x86-64:intel fix a bug where gdb cannot detect that the target is x86_64 one (adjust this for your needs). After this the common commands like setting breakpoints can be applied.
My Baby Is Gone
My baby was my lovely Canon EOS 7D. And I lost the camera during the flight back from the IETF meeting in Beijing/China to Frankfurt/Germany. It is a story about how scatterbrained a man can be. ;-)

I deposited the camera under the seat in front of me. After the plane landed and all passengers surrounding me left the plane I left the plane too – without the camera …
At the customs check I realized that I missed my camera and started to run to the plane. Unfortunately it was not possible anymore. I instantly called the baggage check service and approx. after 15 minutes we got a call that the camera wasn’t there. 15 minutes after I left the plane!
Passengers are out of the question because I sat at the window and nobody saw the camera. Additionally fact: I was the last person in the row – so nobody saw the camera. Now there are two groups who can robed the camera: the stewards and the cleaning staff.
I don’t think that the stewardess stole the camera because normally all stewardess leave the plane in a group, such a big camera would focus the colleagues. Furthermore, stewards do not check seats for forgotten baggage.
On the other hand cleaning stuff members are clean up all corners – including the place under the seat – and they are normally alone. E.g. each of then has an assigned area …
It is so sad that this audacious burglary can happened at a Lufthansa flight. I am really disappointed and unhappy with this incident. I booked 95% of my flights via Lufthansa, I prefer to spend some more money and enjoy a warm atmosphere as well as the sense of a technical maintained airline. Maybe I should reconsider my position and switch to another airline (e.g. Emirates).
- Published in: programming
- | Time: 18:38:15 CET
- | SHA1: 9c4435752e2e0218875bcab03d212bd7a021e364
A error source for ANSI C (ISO/IEC 9899:TC2; 6.3 Conversions) is the implicit type conversion mechanism. The following code fragment illustrates a bogus usage of the shift operation in realtion with the type of the operands:
void bit_mark(unsigned long *seq_bits, int32_t n)
{
int32_t word_offset, word_num;
word_num = n / sizeof(unsigned long) * 8;
word_offset = n % sizeof(unsigned long) * 8;
seq_bits[word_num] |= 1 << word_offset;
}
The error source is the type of “1” which is implicit casted to an integer (Section: 6.3.1.1. “If an int can represent all values of the original type the value is converted to an int; [..] These are called the integer promotions.All other types are unchanged by the integer promotions.). seq_bits on the other hand is a array of unsigned long types. On 32 bit architectures this may work, but on 64 architectures like x86_64 where int is 32 bit wide and unsigned long is 64 bit wide the upper 32 bit are left out! The solution is to define 1 as 1UL to promote this type as the general processing type. Another sidenote: character types are inplicit converted to a integer type, this includes both operand to the shift statement. The formal ANSI C definition can be found under: 6.5.7 Bitwise shift operators:
The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the wide of the promoted left operand, the behavior is undefined.
Libhashish Release
- Published in: algorithm
- | Time: 19:00:13 CET
- | SHA1: 12dd9819a176c4d77beff60128861006020bc00a
New are pleased to announce realease version 1.1 of libhashish. Major features are, ... no new features. The release includes bugfixes for (u)int{8,16,32}_t hash key collision, a memory leak fix for the collision array implementation – thanks to Emmanuel Roullit, CFLAGS adjustments from Florian Westpahl, and Debian packageizing fixes from Michael Stapelberg.
Advanced Encryption Standard Crypto Speed
- Published in: programming
- | Time: 00:26:12 CEST
- | SHA1: 0512a635bdfcf8ca0bac78984ddbdf273fece973
Some time ago I benchmarked Rijndal at CPU Cycle Level. I held a small presentation where I presented the raw values. The presentation foils itself are a little bit sparse but the important value are cognizable.
![]()
- Published in: algorithm
- | Time: 19:50:03 CEST
- | SHA1: ec678f83cb16e42aa20f69342be519efc1bccde9
Several month ago I spend some months to analyse the behaviour of Genetic Algorithms. The research focused on mutation, selection and crossover algorithms and their behaviour. Genetic Algorithms are probabilistic search heuristics that emulate the process of evolution. Problem domains for genetic algorithms are complex scheduling problems (e.g. flight scheduling at the airport), Protein structure prediction or academic problems like traveling salesman problem. Genetic algorithms are useful where other search algorithms collapse through an almost infinite solution space.
People not familiar with complex search problems of this kind should imagine a graph where the task is to find the minimum of the function. The naive approach to scan the graph from start till end and remember the smallest value will result in a solution – no doubt. At the end of the process the result is the minimum of the function. Now imagine that the graph is not two dimensional (x and y), but rather a three or even n-dimensional problem. The naive approach to scan each section is simple far too much. Genetic algorithms will spread his probe-points, check if a pre-defined minimum is found and if not the probe-points are new-adjusted. Adjusted through mutation, selection and crossover with other probe-points. One major characteristic is that superior probe-points are favored (superior probe-points in our example will have the smaller value – because we are searching for the minimum in a n-dimension solution space) in the process of crossover. The idea is that mutation from a good and a good parent probe-point will finally result in the superior solution – hopefully.
genetic-algorithm.pdf present focus on the Recombination and Mutation alternatives. Whose impact on the algorithm and some other often underrated conditions.All used and cited code is licenced under the GPLv2, see git.jauu.net
- Published in: networking
- | Time: 20:09:44 CEST
- | SHA1: 8698aee5cf0b182a5ce2f9fbb2c9f26518b5bd3d
Due to the lack of money some month ago – now sounds funny – I skipped a conference where I wanted to present an approach to model group mobility of nodes. The idea is to use a form of terrain map (normally used to generate landscape models) as a basis to emulate real world motion characteristics similar. The focus are groups and their motion characterstic within the group. The terrain map is interpreted as what I called penalty map: the object wish is to evade areas with higher penalties to move from start to end. The intra-group mobility is minted through a “virtual magnet”. Nodes are placed in a position around a virtual center of the nodes. Nodes are appointed to stay there but the terrain map will also shift nodes to a more penalty free zone. For example: within a valley all nodes will come closer to each other because the hill have a higher penalty. In a flat environemt all nodes will stay at the pre-defined relative position because all penalty factors are equal.
![]()
The previous section is a summary of the algorithm. See
TCP Quick ACK versus Packet Overhead
- Published in: networking
- | Time: 02:39:51 CEST
- | SHA1: 86cde63331bdff31052389eac43b3382a9226b0d
Since several weeks I deal with TCP Quick ACK mechanism. TCP Quick ACK where introduced to bypass the disadvantages of delayed ACKs. If a flow is not interactive and send fully Maximum Segment Sized (MSS) packets from one endpoint to the other there is no real need to artificially delay any packets. In this situation the receiving host will never send any data and the TCP instance can instantly trigger an ACK packet. This allows the other hand to even increase the congestion window (CWND) faster compared to TCP instances where the ACK is always delayed.
But there are some challenges with this mechanism. First, the TCP instance does not instantly know if a local TCP application will piggy back some data or not. The TCP does not know if a application behaves interactive like HTTP or like a FTP bulk application. The TCP stack must first analyze the connection pattern. Quick ACK can be employed in two ways:
- a new vanilla connection can start with QUICK ACK enabled and deactivate the mechanism if interactive characteristic is detected
- a new vanilla connection is started with Quick ACK disabled and if the connection is not interactive the QUICK ACK is enabled.
Currently under Linux the former mechanism is employed: TCP Quick ACK is enabled per default and only if the heuristic detects that the connection is interactive (e.g. HTTP) the QUICK ACK is disabled!
The drawback is not negligible: the first ACK packet is always generated and transmitted, although the server may have instantly some data. The prominent example is HTTP! Here the client ask for a file (via HTTP GET) and the server will hand-out the required file. But because of QUICK ACK the TCP instance will generate an ACK first and shortly afterwards will generate a additional DATA packet.
The additional ACK is solely generated the first time till the heuristic detects DATA from the server to the client. But especially short-lived-flows like HTTP will suffer from this. The typical HTTP flow includes only 10 till 20 data packets. One additional ACK packet is not to underrated!
The last two packets in the image illustrates this characteristic.
Future Proof
- Published in: ietf
- | Time: 23:43:12 CEST
- | SHA1: f3e9ca6309ba83d48e552ce28a7f556cd38733bf
Top IPv6 Working Group posting today regarding /127 anycast:
> to future-proof their network. future-proof == adding crap we do not understand or need
Identify Minimum or Maximum of a Value
- Published in: programming
- | Time: 01:02:41 CEST
- | SHA1: bcde2ad7490e741836be7d82d500653945ba1541
To identify the minimum or maximum of two values often a simple if else construct is used:
if (a < b) min = a; else min = b;
or more compact using the ternary operator:
min = (a < b) ? a : b;
The generated instructions for the first if/else construct and the ternary
operator does not differs. At least if the compiler is instructed to optimize
the code. If the code is not optimized the compiler generates a little bit
awkward code with a branch for the former example. The later use a branch free
cmovle instruction on a x86_64 architecture:
[...] movl %edi, -20(%rbp) movl %esi, -24(%rbp) movl -20(%rbp), %eax cmpl %eax, -24(%rbp) cmovle -24(%rbp), %eax movl %eax, -4(%rbp) movl -4(%rbp), %eax leave
The kernel provided macros for min/max extend this by execute an apparently
unnecessary pointer comparison via:
(void) (&_max1 == &_max2);
This check was introduced to enable strict type-checking to make sure that both arguments are of the same type. I build a patch to extend the current two operand limited macro and enable the use for three operands too. This will save some bytes on the stack as well as some processing cycles:
#define min3(x, y, z) ({ \
typeof(x) _min1 = (x); \
typeof(y) _min2 = (y); \
typeof(z) _min3 = (z); \
(void) (&_min1 &_min2 &_min3); \
_min1 < _min2 ? (_min1 < _min3 ? _min1 : _min3) : \
(_min2 < _min3 ? _min2 : _min3); })
Main Memory Corrupt
- Published in: uncategorized
- | Time: 00:40:54 CEST
- | SHA1: 4806c977462dbfc21c34a49fb0acd76ec2eb9a0f
Since several days strange things happen:
Delta compression using up to 2 threads. Compressing objects: 100% (8/8), done. error: inflate: data stream error (incorrect header check) error: corrupt loose object 'ebcab349c392aadd151101e6b2651ed451ff4bf7' fatal: object ebcab349c392aadd151101e6b2651ed451ff4bf7 is corrupted error: pack-objects died with strange error
as well as mysterious compile errors and stuff like that. One similarity is a raised memory access pattern, ... starting memtest to verify my assumption.
- Published in: networking
- | Time: 00:08:44 CEST
- | SHA1: 55497ae1d4cb5e857c04e1c92caf8456ea58f7a8
TCP’s window scale factor is determined at the connection start-up. Among other
things local memory constraints are checked and used as the basis to calculate
the window scale shift factor. The exact procedure is complex and not of interest
here. Currently the Linux network stack does not consider a reduced rcv buffer
which can be reduced via setsockopt(). Today I sent a patch which fix this
with some lines of code:
if (sk->sk_userlocks & SOCK_RCVBUF_LOCK &&
(req->window_clamp > tcp_full_space(sk) || req->window_clamp == 0))
req->window_clamp = tcp_full_space(sk);
I waive explaining text in the patch why I fixed the bug in this way and not
modified the main function (tcp_select_initial_window()) and hope that Dave
and Ilpo trust me … ;-)
- Published in: ietf
- | Time: 21:15:49 CEST
- | SHA1: 4b9c3224c99aca580b58e901105a4b3cc83c352d
Two apparently not related terms, but wait. Network neutrality can be described as a principle that no restriction by internet service providers and governments on content is applied. Restrictions can be limited network resources: Google News experience a higher Quality of Service as my Blog or even worse: you cannot connect to YouTube with your standard ISP contract. In other words: ISP’s start to consider network traffic based on their content. This is not really new, ISP’s for example restrict the usage of peer-to-peer protocols by injecting TCP RST packets to throttle the connection. GSM/UMTS operators on the other hand often prohibit any VoIP application.
The rational behind can be split into two big aspects: a) economic, ISP’s want to participate, they stopped to recognize themself as a pure ISP. ISP’s start to realize that the content is the key to economic success. Think about Youtube and all the shiny upcoming services. And b) resource hungry applications like peer-to-peer congest the network and requires even higher and higher link capacities. Restrict the usage can contribute to reduce their equipment cost greatly.
But where is the interaction between Network Neutrality and Protocol Agnostic Congestion Control? Well, Comcast, one of the biggest ISP in the U.S. enrolled this new congestion control – a large scale congestion management system – in response to dissatisfaction of customers “as well as complaints to the U.S. Federal Communications Commission (FCC) regarding Comcast’s old system, which targeted specific peer-to-peer (P2P) applications”. Comcast customers share their up- and downstream bandwidth with their neighbors. If a individual using most of the bandwidth at times where the network experience congestion then this individual is penalized. A high level overview how the mechanism works is outlined in [1] (section 7, “Implementation and Configuration” provides an detailed description):
- Software installed in the Comcast network continuously examines aggregate traffic usage data for individual segments of Comcast’s HSI network. If overall upstream or downstream usage on a particular segment of Comcast’s HSI network reaches a pre- determined level, the software moves on to step two.
- At step two, the software examines bandwidth usage data for subscribers in the affected network segment to determine which subscribers are using a disproportionate share of the bandwidth. If the software determines that a particular subscriber or subscribers have been the source of high volumes of network traffic during a recent period of minutes, traffic originating from that subscriber or those subscribers temporarily will be assigned a lower priority status.
- During the time that a subscriber’s traffic is assigned the lower priority status, their packets will not be delayed or dropped so long as the network segment is not actually congested. If, however, the network segment becomes congested, their packets could be intermittently delayed or dropped.
- The subscriber’s traffic returns to normal priority status once his or her bandwidth usage drops below a set threshold over a particular time interval.
At the end: for challenge b) there are technical solutions to counteract the problem. There is no excuse why an ISP harm network neutrality because of network bandwidth problems. But we are living in a commercialized world and ISP realize that a lot of money can be earned with content, so a) will be the driver to violate network neutrality.
[1] http://tools.ietf.org/html/draft-livingood-woundy-congestion-mgmt-07
Linux TCP Receive Buffer
- Published in: programming
- | Time: 03:50:21 CEST
- | SHA1: b551dad53efd1d79ab7e70578085b9969e0fddc6
just for the protocol: Linux memory management is terrible complex. Many hidden relations to and from other components in an optimized form that prevent anyone from beeing productive. {Free,Net,Open}BSD code is much more coherent and easier to understand. It is even difficult to write a proper commit message in a understandably way that more then 4 people understand why I modified the code in this way.
Recursion
- Published in: programming
- | Time: 00:59:07 CEST
- | SHA1: bc7da74e57dbb50654ccc32546b45ae7f167c479
Since several days I started to verify the Silly Window Syndrome (SWS)
avoidance algorithms and mechanism. To trigger one peculiarity (receiver side
SWS) it is more or less unavoidable to shift the network stack into a special
state. One requirement is that the socket buffer should be as small as possible
to reduce the initial analysis delay. Via setsockopt() it is possible to tune
the receiver buffer. But during some analysis I spotted an error in the current
network stack, the bug is hidden in the TCP window scale option and the dynamic
memory management component. This bug isn’t a trivial bug and a lot effort is
required to validate my patch.
The validation requires a exact knowledge of the network socket state. During the development I use a more or less a hacky KVM/QEMU setup. But this time it is necessary to verify the patch in a real-world-system too. Why? Because the behaviour in a full memory loaded system differs from my 192 MB KVM setup. There are some other constraints that prefer a life system.
Anyway, to instrument my kernel I use systemtap the Linux pendant to Solaris dtrace. During kernel instrumentation I spotted a weak point:
probe kernel.function("tcp_select_initial_window").return {
printf ("return %s(%s) %u\n", probefunc(), execname(), $rcv_wnd[0]);
}
The TCP receive window should 5840 byte (4 * maximum segment size), but stap
return always 0 – but 0 is definitely not returned! Maybe I make a failure in using stap(1), not sure but the IRC #systemtap crowd is also not sure …
BTW: it is possible to dereference the arguments of the function but this is
only possible in the call path (not in return path) via kernel_int(ulong_arg(3)) ...
UPDATE: the guys at freenodes #systemtap are quite helpful! In the return
path and with systemtap >= 1.3 the following works:
kernel_int(entry($rcv_wnd))@ ...
<fe> please let us know if you get some interesting results with or despite the tool :)
Architectur specific TLB Optimizations
- Published in: programming
- | Time: 23:40:55 CEST
- | SHA1: dc2984c474826a0d5b53aedd5167ee2c402c46cc
I thought about how I would design the Translation
Lookaside Buffer of the Memory Management Unit (MMU) if I was the CTO of Intel,
respective AMD. ;-) In one of my last postings I talked about an potential
optimization for x86/x64_64 but at the end I decided that the effort of
implementing and testing compared with the benefit is to high so I skipped
this. But as said this I thought about how would I design the TLB regardless
of any existing implementations …
As I already wrote
the TLB is a fully associative cache because most programs exhibit so called
“locality of reference”. This means that the working set will be used many
times. The most recently used page mappings will be stored. The TLB cache
virtual addresses to physical addresses on a page basis (e.g. 4096 byte). With
a few page entries a full program can be mapped: .text, .data, heap, ...
Fully associative caches are costly because entries must be searched in parallel. Any address can be stored everywhere in the cache and no hashing or indexing is applied. n-way set associative caches on the other hand use a index.
So far so good! From my last posting we know that the TLB logic is not entirely spilled into silicon. At least some processor architectures swap some logic to the kernel. This is the point where we start!
When a TLB hit occurs the cache returns the physical address. If a miss occur two things depending on the architecture can happen:
- the processor start looking up the page table by himself – called page walk – until he find the physical address and replace the TLB by himself
- the processor generates a trap and shift the task to the operating system. The kernel must now walk over the page table, check the page permission and finally replace the TLB entry with the right physical address.
The former approach is implemented by x86, SPARC and some others. The later
is implemented by the MIPS family.
But that’s not all, there are some subtle differences. As I wrote in the other postings Linux does not flush the TLB if a Kernel thread is scheduled because a kernel thread does not touch user pages and kernel pages are unique. But if the kernel scheduled another user space process the virtual to physical mappings become obsolete. Each new page walk is a costly operation and should be avoided. Some architectures extend the concept of a plain address space by (re)introduce a key – surprise! ;-) This key is bound to the task and only if the task differs the TLB entry is flushed. One example is the TI-SPARC where a context switch does not require a TLB flush.
![]()
I just started to understand how the context switch works for the SPARC is implemented. But it is hard to read SPARC assembler code (not as hard as ia64 asm ;-):
arch/sparc/kernel/tsb.S:
/* At this point we have:
* %g1 -- TSB entry address
* %g3 -- FAULT_CODE_{D,I}TLB
* %g5 -- valid PTE
* %g6 -- TAG TARGET (vaddr >> 22)
*/
tsb_reload:
TSB_LOCK_TAG(%g1, %g2, %g7)
TSB_WRITE(%g1, %g5, %g6)
/* Finally, load TLB and return from trap. */
tsb_tlb_reload:
cmp %g3, FAULT_CODE_DTLB
bne,pn %xcc, tsb_itlb_load
nop
Especially the label tsb_miss_page_table_walk_sun4v_fastpath: is of
interesst. Also the division between instruction and data.
Linux Socket Minimum Receiver Buffer
- Published in: programming
- | Time: 00:16:25 CEST
- | SHA1: fc44a674a3818d9802f07d0758065a4fc7a560db
Via setsockopt(socket, SOL_SOCKET, SO_RCVBUF, n, 4) it is possible to increase/decrease the socket receive buffer where n is the number of byte. Internally the specified number is doubled to cover bookkeeping overhead. The minimum value is defined as SOCK_MIN_RCVBUF (256 byte). It is not possible to specify any smaler value. Even if the user specified something smaller the actual buffer is silently increased to SOCK_MIN_RCVBUF (the man page is a little bit outdated and stated that setsockopt will return with a failure – this is not correct).
if ((val * 2) < SOCK_MIN_RCVBUF)
sk->sk_rcvbuf = SOCK_MIN_RCVBUF;
The default value for the socket receiver buffer is specified in /proc/sys/net/core/rmem_default, the maximum value is specified by /proc/sys/net/core/rmem_max. sysctl_rmem_default is initialized as to:
__u32 sysctl_rmem_default __read_mostly = (sizeof(struct sk_buff) + 256) * 256;
The size of struct sk_buff is not constant a
tcp_select_initial_window() determine a initial window to offer and the corresponding window scale factor. This function is a little bit
include/net/tcp.h:tcp_full_space()
sysctl_tcp_adv_win_scale is initialized to 2.
SOCK_MIN_RCVBUF 256
Finally, the man page entry is not correct: if the user specified a too small value the setsockopt call will return with no error but silently set the buffer to the smallest legal value:
Translation Lookaside Buffer Flush Optimization
- Published in: programming
- | Time: 03:23:31 CEST
- | SHA1: be46a6d53fd93c09d4fe4faf30a68d6ce6801085
The Translation Lookaside Buffer is a small associative memory that caches virtual to physical page table addresses. When page tables have been updated, such as after a page fault, the processor may need to update the TLB for that virtual address mapping. May, because some processor architectures implement this in hardware logic some other architectures need support by the operating system. Flushing TLB is a really expensive operation – depending on the architecture (e.g. PowerPC) – and should be avoided if possible in any way. Linux for example does not invalidate the TLB if a context switch occurs, Linux rather tries to partial invalidate the TLB. Kernel threads for example access only the kernel space and do not require an invalidation of the user space TLB cache, Linux does not invalidate the cache in that case.
Today I played with some TLB optimization. Current kernel code can flush one page cache for a given page or all pages. Although the kernel API provides the possibilities to invalidate a range of pages. If a range is requested for invalidation, all pages are invalidated and not a limited range of addresses.
flush_tlb_range(vma, start, end) for a given mm context from start till
end all pages are declared as invalidate via this method. Partial
invalidation can happend if a memory region has moved or permissions are
changed (via mprotect()). munmap() or mprotect_fixup() for example use
this function.
flush tlb all() on the other hand invalidates the TLB on all processors
running in the system. flush tlb mm() flushes the TLB for a given
userspace context, always below PAGE_OFFSET. fork() is a user of this
function.
Some words about the invalidation procedure on x86: the INTEL manual
stated that all non-global TLB entries are flushed after writing to CR3 .
The CR3 is used to translate the virtual addresses into physical addresses
by pointing to the page directory and page tables of the current task.
Read and re-write the CR3 register was the only way to flush the TLB
after a page table update. Newer CPU have TLB flush instructions to
partial flush a given address, i386 and i486 CPU’s didn’t have these
instruction; via invlpg it is on x86 >i486 possible to flush a given
memory address.
static inline void invalid_tlb_old(void)
{
unsigned long val;
asm volatile("mov %%cr3,%0\n\t" : "=r" (val), "=m" (__force_order));
asm volatile("mov %0,%%cr3": : "r" (val), "m" (__force_order));
}
static inline void invalid_tlb_addr(unsigned long addr)
{
asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
}
UPDATE: I decided to stop digging into this optimization. After a while of analyzing the callees I realized that the effort is much higher as the potential benefit.
Netsend Artificial Read Delay
- Published in: programming
- | Time: 01:38:57 CEST
- | SHA1: 80da6274840cac2353f628280956eb7299a64730
Today I invest my time in adding a new option for
netsend to introduce an artificial read() delay
in the receiver path. The new option can be enabled via:
netsend -v loudish -B 10,1 -b 250 -s SO_RCVBUF 1 tcp receive
The options states that the receiver should initial block for 10 seconds and
then read 250 byte chunks with a delay of one second: -B 10,1. If you waive
the initial delay you can also skip the coma separated list and hand over only
one argument, e.g. -b 1. This will delay the read() operation with a one
second delay.
IPv6 Flow Label Usage Scenarios
- Published in: networking
- | Time: 23:16:46 CEST
- | SHA1: 28cdf980ffab78aa8a41896fb26f80e3d9f1ceb6
The 20 bit wide Flow Label field in the IPv6 header is a integral part of the protocol specification, standardized in RFC 2460 The field may be used to mark sequences of packets for which the sender request special handling at intermediate routers. The standard explicitly does not define the what, it exemplary provides two examples (real-time and non-default quality of service routing) and that’s all. At the specification time nobody was in the ability to specify how this field may be used in the future or may not used for. As so often the time should breed possible use cases. Good designs are often – not always – characterized by a unseen potential which is discovered later on. One famous example is the C++ template system – the powerful scope was identified after the language was already released: Modern C++ Design: Generic Programming and Design Patterns Applied
Time goes by and the use cases are still rare for Flow Label. This may be founded in the low diffusion of IPv6 at the time of writing or because larger carriers already use Multiprotocol Label Switching (MPLS) beside the common 5-tuple used to classify a packet flow (later on is a rare exception in the core). The 5-tupel has it own problems such as fragmentation or next header parsing due to the fact that the port information is at the transport layer. Anyway, use case questions accumulate over time and an I-D address some possible use cases of this 3-tuple {source address, destination address, flow label}: Survey of proposed use cases for the IPv6 flow label The I-D is worth reading!
Mencoder Video Editing
- Published in: uncategorized
- | Time: 03:11:39 CEST
- | SHA1: f39552777e1635108e22b6d1f8aa928ce51fcce9
Today it was a lazy evening: no productive hacking, no deeper technical discussions, no email answer sessions, no peer reviews, nothing—just passive reading and … command-line video editing—vacation for the brain.
Sometimes I am to cushy to use my camcorder or my EOS 7D to make a film. Instead I use my handy Canon IXUS and shoot some pictures, one after another. Later on at home I use mencoder to stick these single frames together, add some music and voilà.
Mencoder is a command line tool and I use the following options to generate the movie. After same experience with vimeo I think these settings are suitable. But hey, I am no multimedia guru … ;-)
opt="vbitrate=2160000:mbd=2:keyint=132:vqblur=1.0:cmp=2:subcmp=2:dia=2:mv0:last_pred=3" mencoder mf://\*.JPG -mf w=1920:h=1440:fps=1.5:type=jpg -ovc lavc -lavcopts vcodec=msmpeg4v2:vpass=1:$opt -oac copy -o /dev/null mencoder mf://\*.JPG -mf w=1920:h=1440:fps=1.5:type=jpg -ovc lavc -lavcopts vcodec=msmpeg4v2:vpass=2:$opt -oac copy -o output.avi # add an mp3 mencoder output.avi -o out.avi -ovc copy -oac copy -audiofile ~/travel.mp3 # generate index, this enables the possibility to play the video without waiting # until the video is fully loaded - vimeo requires this mencoder -idx out.avi -ovc copy -oac copy -o final.avi
The created video can be found on vimeo: Thailand Climbing Trip – 2010
It is already 3am, time to go to bed.
- Published in: networking
- | Time: 22:44:55 CEST
- | SHA1: 5e42c15e4e8b0389b25f89382aef585d462fd48f
For some analysis I plotted the inter-frame delay in microseconds within the transmitter must send a frame to archive the desired transmission rate. For a 10Gbit network adapter and a 10 byte frame the time between subsequent transmission must be 0.008 microseconds (or 8 nanoseconds or 8e-09 seconds) The illustration with a logarithmic y scale:
![]()
Not really thrilling or exciting, I generated the images to illustrate the difficulties to generate a bandwidth exact bytestream where the operating system as well as hardware components are working at the edge.
Interactive TCP Traffic and Delayed ACKs
- Published in: networking
- | Time: 23:29:27 CEST
- | SHA1: 7083142b3967ce93a7a0914ff7e20b94154eb63a
Interactive TCP traffic – like telnet – is often single byte character interactions: a pressed key, encoded via a char data type, is transmitted to the server, processed at the server side and echoed back to the client. These days telnet can be considered as antiquated because of security shortcomings and the Secure SHell step into telnets shoes. The one byte transmission is expanded to a ~32 byte chunk because of cryptography overhead. Anyway, the fundamental communication characteristics remains identical. Within short time-slots each pressed character is transmitted to the server, the server process the input and echo’s back the character. The interactive characteristic is unchanged.
The following sections provides a rudimentary analysis of the timing behavior of a characteristic ssh session – larger breaks are not included. The analysis rather focus on “microscopic” timing interaction and the TCP delayed ACK mechanism.
Firstly some words about the delayed ACK timer. This timer is started at receiver side with each incoming data packet and was introduced to avoid the silly window syndrome. The suggested timeout is 500ms where no specific time is standardized. The majority of network stacks use 200ms as a compromise of additional delay and efficiency.
The next image illustrates the interplay between the echoed data and delayed ACK mechanism

00:00:00.045855 IP h1.47028 > h2.22: Flags [P.], seq 684:720, ack 925, win 304, length 36 00:00:00.017156 IP h2.22 > h1.47028: Flags [P.], seq 925:961, ack 720, win 91, length 36 00:00:00.000048 IP h1.47028 > h2.22: Flags [.], ack 961, win 304, length 0 00:00:00.783306 IP h1.47028 > h2.22: Flags [P.], seq 720:756, ack 961, win 304, length 36 00:00:00.017316 IP h2.22 > h1.47028: Flags [P.], seq 961:997, ack 756, win 91, length 36 00:00:00.000054 IP h1.47028 > h2.22: Flags [.], ack 997, win 304, length 0 00:00:00.178118 IP h1.47028 > h2.22: Flags [P.], seq 756:792, ack 997, win 304, length 36 00:00:00.017426 IP h2.22 > h1.47028: Flags [P.], seq 997:1033, ack 792, win 91, length 36 00:00:00.000051 IP h1.47028 > h2.22: Flags [.], ack 1033, win 304, length 0 00:00:00.182521 IP h1.47028 > h2.22: Flags [P.], seq 792:828, ack 1033, win 304, length 36 00:00:00.017470 IP h2.22 > h1.47028: Flags [P.], seq 1033:1069, ack 828, win 91, length 36 00:00:00.000044 IP h1.47028 > h2.22: Flags [.], ack 1069, win 304, length 0 00:00:00.170489 IP h1.47028 > h2.22: Flags [P.], seq 828:864, ack 1069, win 304, length 36 00:00:00.017774 IP h2.22 > h1.47028: Flags [P.], seq 1069:1105, ack 864, win 91, length 36 00:00:00.000055 IP h1.47028 > h2.22: Flags [.], ack 1105, win 304, length 0
The maximum limit for the delayed ACK is still 200 ms, the minimum can be
smaller – depending on the current analysis of the TCP stream. Therefore
the TCP analysis the packet inter-arrival time and doubles this value. Note
that every other data packet is always immediately acked. The Linux Delayed
ACK is implemented in tcp_delack_timer():net/ipv4/tcp_timer.c.
Drawbacks of the delayed ACK are a reduction of the transmission rate of the sender.This is because the sender increases the congestion windows – CWND – based on the rate of incoming ACK packets. TCP is still an ACK clocked protocol. To increase the CWND even faster an instant ACK is superior and actual Linux implement an algorithm, called Quick ACK, to ACK packets in the start phase of TCP instantly, without any artificial delay. To prevent the Silly Window Syndrom the transmitted quick ACK’s from the receiver is limited to n segments. Where n is defined as the half of the number of segments to reach the receivers advertised window. Additionally, if the network stack detects that the communication is bidirectional is completely disabled. This is an optimization because the receiver knows that he also transmit data and therefore should wait for local data. This reduces the number of vanilla ACK packets.
If delayed ACK’s are disabled the behavior is similar to this illustration. An ACK segment is instantly generated and the echoed packet is transmitted separately, as identifiable this generate a lot more traffic:

UDP TX and RX Delays
- Published in: networking
- | Time: 01:22:07 CEST
- | SHA1: 65279940581377c6a6e8b63f001d507268b5458b
The following image illustrates the accumulated delays during the transmission of UDP packets. Starting with a 1 second sleep with theoretical no delay, but practical delay of ~200 microseconds on a un-busy box till due to timer granularity and concurrent processes (not in this scenario), till transmission delays via intermediate routers, NIC interrupt moderation and finally RX side scheduler latency.
![]()
Increasing TCPs Initial Congestion Window
- Published in: networking
- | Time: 00:44:04 CEST
- | SHA1: 76a37c1b77057ebdbe9a66c2e8e7b8c022bd45f5
Noting really special today, ... except that it is time for another debate about the increase of the initial CWND from currently max. 4 to N. Where N differs from month to month and vendor to vendor. Currently standard TCP is not seasoned to meet the new network environments. There are still networks with a BDP from 1980 – MANET, Sensor Networks and so on are typical examples. Google press ahead the thematic mainly because a increased initial CWND instantly enhance the user experience for a typical web session. Web sessions are a prominent example where the slow start algorithms start to increase the CWND while after ~3 iterations the slow start is interrupted because every byte is transmitted. So our HTTP biased world suffers from the initial CWND – no doubt. Google congestion control and CWND analysis on the other hand does not address topics beyond the scope of HTTP.
There is a lot of additional research required to adjust these TCP intrinsic values and tcpm already began to address this topic.
The discussion at lkml is a little bit more specific but it is also quite late (01:34:00 AM) in the night so I interrupted the lkml discussion as well as this blog post … ;-)
UDP TX Buffer Behaviour and Egress Queueing
- Published in: networking
- | Time: 23:13:58 CEST
- | SHA1: 8fc2eb54d24556684f8747742a01c2df8cea1b40
UDP sockets – if corking is disabled – always push packets directly to the
lower layers. In the case of IPv4, ip_output() will forward the packet to
Netfilter where the firewall rules are applied. ip_finish_output() calls
ip_finish_output2() which on his part calls neigh_hh_output() which put the
cached layer 2 Ethernet header in front of skb and finally call
dev_queue_xmit().
dev_queue_xmit() queues the packet in the local egress queue – default
is a FIFO queue (pfifo_fast) but sophisticated queuing strategies are available and often selected.
Before the actual packet is enqueued the function dev_queue_xmit()
linearize the skb if necessary, do checksumming if necessary, and
finally calls enqueue() which places the packet into the queue. This
function fails if the queue is (temporarily) deactivated or a overflow
happens. In the common cases the function returns NET_XMIT_SUCCESS. Note
that the standard queue has a short cut: if the queue length is 0 – no
packet is queued – the packet is directly scheduled via sch_direct_xmit().
If something goes wrong (somewhere in the driver), the packet is
pushed several times to the NIC put on the wire via qdisc_restart() until the
NIC accepts the new packet. If this fails the one element queue is saved
and a SOFTIRQ is raised on the local CPU. In the hope that next time the
SOFTIRQ is executed the NIC is in the ability to accept the packet.
If the queue supports no short cut or the queue contains at least one element the packet must be enqueued via qdisc_enqueue_root(). This enqueue function is scheduler specific. For FIFO queueing the packet is added at the end of the list, more complicated queues implement a more sophisticated queueing.
dev_queue_xmit() {
netif_needs_gso();
skb_needs_linearize();
if (q->enqueue) {
if (TCQ_F_CAN_BYPASS && qdisc_qlen(q) == 0) {
if (sch_direct_xmit())
__qdisc_run(q);
return NET_XMIT_SUCCESS;
} else {
ret = qdisc_enqueue_root()
qdisc_run(q);
return ret;
}
}
}
__qdisc_run() {
while (qdisc_restart()) {
if (need_resched() || to_often_restarted) {
raise_softirq_irqoff(NET_TX_SOFTIRQ);
}
}
}
qdisc_restart() try to call dev_hard_start_xmit() instantly to put the
packet on the NIC TX descriptor ring – if possible. __qdisc_run() enabled
the SOFTIRQ – if not already enabled.
Finally note that the return code of dev_queue_xmit() make no statement if
the packet can be transmitted. Subsequent congestion or the queue policy can
decide to drop the packet. A positive return code only signals that the
enqueuing was successful.
IPv6 is similar except that neighbor resolution is done by IPv6 neighbor discovery mechanism (ND).
Last but not least some network devices have no associated queues. The
loopback device and all kind of pseudo tunnel devices are common examples.
These devices have no queues and instead of placing the packet in a queue the
function dev_hard_start_xmit() is called directly.
CRC 16 CCITT Performance
- Published in: programming
- | Time: 00:45:23 CEST
- | SHA1: 4df1db1489ab9239be48873421c81ab91b50bda9
The CRC-16-CCITT is a cyclic redundancy check – one among many – defined and derived from polynomial x16 + x12 + x5 + 1. It is used by X.25, HDLC, XMODEM, bluetooth, phy header verification and many others. During some analysis I stumbled across the following implementation:
unsigned char ser_data; static unsigned int crc; crc = (unsigned char)(crc >> 8) | (crc << 8); crc ^= ser_data; crc ^= (unsigned char)(crc & 0xff) >> 4; crc ^= (crc << 8) << 4; crc ^= ((crc & 0xff) << 4) << 1;
The expression (crc << 8) << 4 quizzicaled me a little bit because crc<<12
seems more efficient and is mainly equivalent. The advice on the
site is to implement the above
algorithm with the two shift operation because the latter generates much more
code and executes slower! I compiled the above code fragment, disabled any
optimization and voila:
[...] shrb $4, %al movzbl %al, %eax xorw %ax, -2(%rbp) movzwl -2(%rbp), %eax sall $12, %eax movl %eax, %edx movzwl -2(%rbp), %eax xorl %edx, %eax movw %ax, -2(%rbp) movzwl -2(%rbp), %eax andl $255, %eax sall $5, %eax [...]
Independent of the compiler optimization level (-O0 .. -O7) the compiler
always generates a 12 bit shift. It is even not possible to instruct the
compiler not to do so. But why does the compiler does not optimize the
construct more aggressively? It does, if the user tell him to do so. ;-) But
these optimization are more cosmetic and restricted to reduce the instruction
set size.
[...] shrb $4, %dl movzbl %dl, %edx xorl %eax, %edx movl %edx, %eax sall $12, %eax xorl %edx, %eax movzbl %al, %edx sall $5, %edx [...]
In average the processing time on my x86_64 take 2.06076e^-05^ us per byte:
![]()
When the CRC and TCP Checksum Disagree
- Published in: networking
- | Time: 23:44:37 CEST
- | SHA1: aaf9b3ecec11eea850dd1573ed975902cb5f4548
This post recycles the title of a famous paper from Craig Partridge and Jonathan Stone from 2000. Traces shows that between 1 packet in 1100 and 1 packet in 32000 fails the TCP checksum – even thought a link layer CRC was used. The TCP/UDP checksum is also ineffective at detecting bus specific errors since these errors with simple summations tend to be self canceling. The paper is worth to read it: it talk about the sources of errors, how TCP react and how to reduce the error rate.
But what about the Ehternet CRC? The weakness of the CRC becomes acute in network environments with a MTU larger then 1500 bytes – data center, research networks, cluster systems and so on. Until 1500 byte the CRC is strong. Currently a MTU of 9k is possible without special NICs – almost all current gigabit network adapters support jumbo frames. Some older gigabit NICs from Realtek failed because the CRC routine failed. Intel’s e1000/e1000e adapter can even transmit and receive frames up to 16128 bytes.
Corruption could occur at the source in software, in the network interface card, out on the link, on intermediate routers or at the destination network interface card or node. Maybe I forgot some sources. The actual assumption is that one packet in 10 billion will have an error that goes undetected for Ethernet MTU frames.
Ethernet uses a 32 bit CRC that loses its effectiveness above about 11455 bytes – after this limit the CRC-32 the probability of undetected errors per frame increase. A white paper with the title Extended Frame Sized for Next Generation Ethernets discuss further issues why a frame size should not exceed 9000 byte.
Beside the standard Ethernet CRC the Castagnoli CRC (CRCNc) is worth to mention it. Standard Ethernet CRC was selected because of his performance. CRC32c – the Castagnoli counterpart to CRC32 used by iSCSI or SCTP –
CRC-32-IEEE 802.3 Polynom:
x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x + 1
Castagnoli CRC 32C Polynom:
x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 + x18 + x14 + x13 + x11 + x10 + x9 + x8 + x6 + 1
crctool is a online tool is available which generates VHDL code for a wide number of CRC algorithms. Support for Stronger Error Detection Codes in TCP for Jumbo Frames provides some text about this topic.

Last but not least Intel core i7 processors have the CRC32c function contained within their new SSE4 math coprocessor.
Bluethooth Over WiFi
- Published in: networking
- | Time: 23:29:05 CEST
- | SHA1: a2fa978dac1db9c6d33cd791ac49d4f73ded4551
Bluetooth (IEEE 802.15.1) is a clean wireless industry standard using the ISM band (between 2,402 GHz and 2,480 GHz). Version 2.0 + EDR (Enhanced Data Rate) provides a practical data transfer rate up to 2.1 megabits per second. This sounds huge compared to Version 1.0 but practical is can be really awkward if you want to transfer large files.
Version 3 of the core specification add a clever extension to increase the data rate: it enables the possibility to use the (hopefully available) WiFi link to transfer data. Bluetooth is still used for device discovery, connection setting (negotiating, channel establishment) and profile configuration.
As usual for Bluetooth many components are implemented in hardware and the software core is more or less lightweight. But several v3.0 software details are required, especially the WiFi interaction. Therefore the Linux Kernel Bluetooth development is now located under the Wifi components. Of course, this affect the source code management at first glance. The main development is still done on netdev.
Tracing and Time Measurements
- Published in: programming
- | Time: 00:47:14 CEST
- | SHA1: 1815d7b555afee3ce8ef464dac0968065a4f66fc
In user-space the answer is often simple: use gettimeofday if microseconds
resolution is sufficing. Another often used mechanism is to use the time stamp
counter – but as mentioned in another blog-post in the lion share of all use
cases the advice is often too narrow because of clock variances in SMP/CMP
systems and hibernation issues.
#define rdtscll(val) \
__asm__ __volatile__("rdtsc" : "=A" (val))
In kernel-space the timing capabilities are a little bit more complex and several functions are provided. Three new kinds functions are available, both with different scalability and precision:
- trace_clock—this clock is good compromise of the other clocks. The clock is not completely serialized but try to level CPU boundaries.
- trace_clock_local—complete un-serialized, lowest latency.
- trace_clock_global—use clock at core with id 0, highest latency.
All three clocks provide u64 nanoseconds granularity (but not precision ;). At the end it is often more crucial to get exact timings as a precise timestamping mechanism with a small latency but variances of several magnitudes due to SMP/CMP issues.
- Published in: programming
- | Time: 01:13:49 CEST
- | SHA1: 40f40cf6b16206db776ad6c7930c3faa57525e65
Today I spoted some cacheline misses triggered due to some false sharings in libhashish
My CPU model (athlon 64 X2 dual-core) has the following cache and memory structure: instruction cache, data cache, instruction TLB, data TLB, L2 data TLB, L2 instruction TLB and a unified L2 cache, each core (no shared L3-Cache):
L1-DATA: 64Kb, 2-way associative, 64 byte line size L1-INSTRUCTION: 64Kb, 2-way associative, 64 byte line size L1-DTLB: 4K byte pages, fully associative 32 entries (2M/4M pages: 8 entries) L1-ITLB: 4K byte pages, fully associative 32 entries (2M/4M pages: 8 entries) L2 DTLB: 4K byte pages, 4-way associative. 512 entries L2 ITLB: 4K byte pages, 4-way associative. 512 entries L2 cache: 512Kb, 16-way associative, 64 byte line size, unified
If I illustrate the memory architecture especially the cache structure they would look like this:
But let us come back to the optimization. If you allocate memory via malloc()
your memory allocator guarantee suitably aligned for any kind of data. 8/16 byte
alignment is common value these days. “Any kind of data” is related to
natural data types like int, long double, long long and so on. But this is no
requirement that this must be 8 or even 16. On my x86 based platform an 8 alignment is
also fine: double and long long only requires a 8 alignment. But be warned: SSE,
MMX and the whole SIMD instruction set often require a 16 byte alignment!
posix_memalign() will help you if your application requires a more
sophisticated alignment. Alignment issues are depends highly on the
architecture. x86 architectures for example can handle unaligned accesses in
hardware, PowerPC not and will raise a SIGBUS. But: it has been shown that
unaligned DMA accesses can be expensive on some architectures like INTEL
Nehalem – it is always good to align these on the natural alignment. I fly off
on a tangent again! ;-)
The main reason I started to write about alignment was to get the correlation between data type placement and data type placement mapped into the cacheline. A common technique to optimize cache performance – especially in hot paths in the kernel – is to align data structures at cache line boundaries in memory.
Main intention is to arrange program code that frequently fetch data structures to fit nicely within a cache line and not anywhere in the cachline. As an example consider the following structure (derived from include/net/inet_hashtables.h, not the best example because here the alignment is forced to avoid SMP/CMP false sharing but I hope the example make the basic mechanism clearer):
#define INET_LHTABLE_SIZE 32
struct inet_hashinfo {
struct inet_ehash_bucket *ehash;
spinlock_t *ehash_locks;
unsigned int ehash_mask;
unsigned int ehash_locks_mask;
struct inet_bind_hashbucket *bhash;
unsigned int bhash_size;
/* 4 bytes hole on 64 bit */
struct kmem_cache *bind_bucket_cachep;
struct inet_listen_hashbucket listening_hash[INET_LHTABLE_SIZE]
____cacheline_aligned;
atomic_t bsockets;
In include/linux/cache.h and some other architecture specific header files
the following macros are defined:
/* CONFIG_X86_L1_CACHE_SHIFT=6 -> 64 byte alignment for my configuration */ define L1_CACHE_SHIFT (CONFIG_X86_L1_CACHE_SHIFT) define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) #define ____cacheline_aligned __attribute__((__aligned__(SMP_CACHE_BYTES)))
So listening_hash is alignment at a 64 byte boundary, each element! Now we
will consider struct inet_listen_hashbucket more deeply:
struct inet_listen_hashbucket {
spinlock_t lock;
struct hlist_nulls_head head;
};
Now just imagine that struct inet_listen_hashbucket has some more values, like int’s,
double’s and so on. Think about a program and the access pattern. The typical
flow will be a index into the listening_hash array, get the structure (via lea
instruction) and read/write the elements of inet_listen_hashbucket. The
following image illustrates the memory layout of these structure with and
without alignment:
The above illustration show an unaligned alignment. It is easy spotable that if Struct 1 is accesses is is possible that two memory loads are necessary. Especially if the first memory load access the data elements in the beginning there another access tries to get data from the end of the structure.
The lower illustration show the aligned data structure. Each structure fits in their own cacheline and no additional L2 or even main memory load is required! An important aspect is also spotable: you will eventually waste valuable cache if you align each structure and this can be worse! Think about the size of the structure and the 64 byte cacheline. If the structure is a multiple of 64 byte you are a lucky guy, but everything else will punch big holes into your cache.
As smaller the data structure is this optimization becomes nonsense. Because the probability that everything is already in the cacheline (cause through the first read/write) is high. Furthermore, if the structure is really big, it is often superior to restructure the elements so that often accesses variables are close together where not so often touched variables are at the end of the structure. To separate read-only variables from write-often variables is also a nice optimization, especially on SMP/CMP systems. Nowadays alignment and cache issues hurt performance especially for SMP/CMP systems – but this topic is for an additional posting.
I reticent the fact that each structure has is own alignment too. So it is also highly unlikely that the structure is aligned as illustrated in the above image. Normally their will be wholes between the structures, but at maximum 7 byte wholes nowadays.
BTW: is cache line one word? I mixed it in this posting to reduce the error count! ;-)
UPDATE: lstopo(1) from the hwloc package provides a way to visualize the cache and memory
hierarchy. It parses sys/devices/system/cpu/cpu1/cache/* and
sys/devices/system/cpu/cpu0/topology/*. Another useful utility is x86info(1).
Crisis of Capitalism
- Published in: uncategorized
- | Time: 00:25:09 CEST
- | SHA1: 6dae30f78d83fcd2625da3ede4e1cda2a2752ee7
Animated presentation held by sociologist David Harvey and Professor Philip Zimbardo for Royal Society for the encouragement of Arts, Manufactures and Commerce (RSA):
Nothing Absorbing, Just Hacking
- Published in: programming
- | Time: 23:11:05 CEST
- | SHA1: 0421c67d84c7f8f473d0bd27dcb1ab8be4ea557f
Nothing really interesting these evening – just the usual hacking session. Linus is back from vacation and pent-up patches should be soaked up the next days. Eric submitted one patch after the other and the comment of Davem take the biscuit: “Slow down Eric, you’re on fire :-)”.
Mathieu Lacage – let’s call him the ns-3 main developer ;-) – posted a patch where he fixed a uninitialized memory access, spotted by valgrind.
By the way: you cannot use valgrind directly to spot failures in the network stack/kernel. Mathieu use the Network Simulation Cradle (NSC) to execute the Linux Network Stack on top of ns-3. With this combination it is possible to use valgrind to spot some errors. Florian already spotted some errors triggered by valgrind ~2 years ago where he ported nsc to ns-3.
The posted patch was not really perfect because tcp_check_req() calls
tcp_parse_options() only to update the timestamp indication (saw_tstamp). The
valdgrind message is triggered by a uninitialized opt_rx->user_mss which is
not used here. Eric’s patch tried to fix it and Davem replied:
> - struct tcp_options_received tmp_opt;
> + struct tcp_options_received tmp_opt = {0};
> u8 *hash_location;
> struct sock *child;
That's a 28 byte memset() in the connect fast-path. We shouldn't eat this
just to placate a valgrind miscue. :-)
:-)
Unjoyous OO Python
- Published in: programming
- | Time: 03:39:37 CEST
- | SHA1: 113a3df852183e677cd5febeea6f6c14ef2765fc
Today I spend some hours to complete a python project. Because the project should expandable and manageable the functionality is splitted into classes, packages and so on – python best practices.
But, as long as I program with python I don’t understand why this language is so hyped! Several years back I wanted to leave Perl as the scripting language of choice (sometimes I still use Perl). As far as I remember I looked at Python and Ruby and finally I decided to use Ruby. Nowadays, When I reexamine this decision I must say that this decision was absolutely right!
There are several Python characteristics that bother me currently:
- no really clean object oriented design. Under the hood Python and Ruby are closer together then it seems but Python integrates a lot of extension that make this OO design not visible – why? Ruby on the other hand goes another way: it provides a clean OO interface and do a lot that this OO interface is usable. Why must the OO design artificial hidden from the programmer?
- classes are expandable: with ruby I can add or remove class methods at runtime or even modify already existing class methods. A really great feature which is really handy in some circumstances – where is this feature in python?
Sure, I am no python hacker but from the programmer experience the ruby interface seems quite more matured and cleaner.
666.factorial
NoMethodError: undefined method `factorial' for 666:Fixnum
class Fixnum
def factorial
(1..self).inject { |a,b| a * b }
end
end
23.factorial
25852016738884976640000
Computer Weekend and ...
- Published in: uncategorized
- | Time: 21:06:57 CEST
- | SHA1: 9a8f880c1a37cb67b373c3b907d915d8d5daeb41
... Münchner Stadtlauf as a distraction (but only 10 kilometers).
PS: the compiler optimization patch is now mainline.
TCP Window Scaling and SYN Cookies
- Published in: networking
- | Time: 23:11:11 CEST
- | SHA1: 3189e37dc4a5203f069f0ba70079499460514134
During some off-line discussion with Florian – one of the main developers of TCP SYN cookies – I was a little bit skeptic about the mechanism and the interplay with the TCP window scaling option.
First I will describe these two mechanism; later on I will discuss their relationship and interplay. At the end I will discuss the regression and possible solutions. I shifted this off-line discussion to the kernel ml because it is not that trivial as it sounds.
TCP Window Scaling
This TCP extension was introduced by RFC 1323 (TCP Extensions for High Performance) and expands the 16 bit window to an effective 32 bit window. This option specify a logarithmic scale factor which is applied to the received and transmitted window. Receive and send window scale factor are established separately in each direction. This factor is fixed at the three way handshake (in the SYN and SYN/ACK packet ) and cannot be changed during the TCP session. The scale factor and therefore the maximum receive window is determined by the maximum receive buffer space. Linux for example check the maximum possible receive memory in bytes and level the window scale factor based on this value (sysctl_rmem_max and sysctl_tcp_rmem).
The actual window size is calculated each time a TCP packet is transmitted via
tcp_output.c:tcp_select_window() and advertise the amount of free space in
the receive buffer (under consideration of RFC1323 scaling is applied). The
algorithm never shrink the offered window – conforming to the RFC 793. This
buffer is sticked to exactly one socket. Expanding the window is more
complicated, RFC 1122 says:
the suggested [SWS] avoidance algorithm for the receiver is to keep RECV.NEXT + RCV.WIN fixed until: RCV.BUFF - RCV.USER - RCV.WINDOW >= min(1/2 RCV.BUFF, MSS)
Means that the window is never raised on the right side until at least memory is available to increase it at least MSS bytes. This RFC statement is a little bit unlovely because it breaks the header prediction algorithm but this is another topic. ;)
Last but not least the actual window calculation is not purely based on actual available and advanced allocated memory, it is also based on window clamping. Which is roughly 3/4 the size of the receive buffer minus the size of the application buffer minus the maximum segment size. In the presence of dynamic window sizing the window clamping is a little bit more complicated because the memory control is more dynamic and so on …
SYN Cookies
SYN cookies are one answer to reduce the “effective area” of an network stack
socket listening in the wild. Back in 1990 TCP SYN flood attacks where the new
black and form a not irrelevant problem for network serves. The attack is extreme simple: craft a
lot of SYN packets, sent it to a open listening socket – leave the socket in
the half-open state – and wait until the server resources are exhausted. The
problem is that the initial SYN packet signals the network stack to initiate a
connection which requires a lot of resources. This normally includes all
received TCP socket options, transmitted (in the SYN/ACK packet) socket options
and a lot of other bookkeeping stuff. Under Linux struct tcp_sock is nearly
2000 byte big! It is imaginable that 200k SYN packets can exhaust the whole
memory pool of the server (we are talking about kernel memory which is bound to
1GB at 32 bit architectures without EMT64 support) and no other legitimate
connection can established.
D. J. Bernstein come to the conclusion that holding the initial state is the problem and his solution was to hold no state at the server side until the third packet in the three way handshake is exchanged. The idea was to encode the important data received in the initial SYN packet in the TCP timestamp option in the second (SYN/ACK) packet. The peer is engaged to piggy back this timestamp in the third packet (the ACK packet) where the server can reconstruct the information. Assumption: the peer supports the TCP timestamp option, but this is standard today and was more or less standard back in the 90’s.
But encode all information in the timestamp option is not trivial because RFC 1323 requires that the clock increases, a dump replacement is not sufficient. So the values are encoded in the following manner: top 5 bits encodes the timer counter, followed by 3 bits which encodes the MSS of the client side and finally a 24 bit server-selected secret function of the client IP address and port number. This is required because it must be prevented that the client can control the network stack by sending crafted timestamp options, therefore some “cryptography” must be employed. It is obvious that in few bits not the whole information can be encoded – this is a shortcoming of the mechanism.
This was the original idea of Dan. But ideas involve and nowadays the situation
is more well-thought-out: Linux per default uses mini sockets (struct tcp_request_sock;
~100 byte) to hold the minimal initial state after SYN
packet. This provides the whole feature set but requires only a few bytes. The
cookie mechanism – if enabled – is inactive if the network stack don’t
experience any memory pressure. If the network stack suffers from memory the
cookie mechanism becomes active. Some final word: Linux nowadays encodes
additional values in the SYN cookie: window scaling, selective acknowledgements
and so on.
BTW: if you are not trained you will not even realize that SYN cookies are active:
13:51:04.582464 IP 127.0.0.1.57985 > 127.0.0.1.4050: S 1061746051:1061746051(0) win 32792 <mss 16396,sackOK,timestamp 0xfffea013 0,nop,wscale 6> 13:51:04.582478 IP 127.0.0.1.4050 > 127.0.0.1.57985: S 2800702917:2800702917(0) ack 1061746052 win 32768 <mss 16396,sackOK,timestamp 0xfffe9f66 0xfffea013,nop,wscale 6> 13:51:04.582480 IP 127.0.0.1.57985 > 127.0.0.1.4050: . ack 1 win 513 <nop,nop,timestamp 0xfffea013 0xfffe9466> 13:59:19.047306 IP 127.0.0.1.45979 > 127.0.0.1.4050: S 218483035:218483035(0) win 32792 <mss 16396,sackOK,timestamp 0x0001bed4 0,nop,wscale 6> 13:59:19.047320 IP 127.0.0.1.4050 > 127.0.0.1.45979: S 1141094138:1141094138(0) ack 218483036 win 32768 <mss 16396,sackOK,timestamp 0x0001bd66 0x0001bed4,nop,wscale 6> 13:59:19.047322 IP 127.0.0.1.45979 > 127.0.0.1.4050: . ack 1 win 513 <nop,nop,timestamp 0x0001bed4 0x0001bd66>
Do you realize the “conspicuousness” in the lower nine bits of the timestamp in thesecond SYN/ACK packet? ;-)
Network Stack Regression
The detected regression in the current network stack arise from the circumstance that that their is a race between the SYN/ACK where we initial force a particular window scale and the next time where we recalculate the window via tcp_select_initial_window().
If the user change net.core.rmem_max or net.ipv4.tcp_rmem in between this time, the recalculated window scale (rcv_wscale) can be smaller. But the receiver still operates with the initial window scale and can overshot the granted window – and bang.
There are several solutions:
- encode
rcv_wscaleinto the SYN cookie and don’t recalculate the scaling factor viatcp_select_initial_window()or
*disable window scaling and don’t transmit any scaling option when SYN cookies are active. The later option is not that defective as it sounds. Even if the server suffers from memory the window scaling becomes insignificant.
Namespace Police
- Published in: uncategorized
- | Time: 00:44:59 CEST
- | SHA1: ef74ce47cee184e50b375d1f303046f90a1f670e
Eric Dumazet <eric.dumazet@gmail.com> wrote: > ss is kind of "netstat" with advanced features. Someone call the namespace police!
especially true for a German … ;-) But the tool is old as hell, strange that akpm hear from this tool the very first time …
IETF: TCP Authentication Option
- Published in: ietf
- | Time: 01:51:05 CEST
- | SHA1: f04c9f38cd00b25ee685ea252458abbfe5924c85
Just now, two new Request for Comments are now online available:
- RFC 5925 – The TCP Authentication Option
- RFC 5926 – Cryptographic Algorithms for the TCP Authentication Option (TCP-AO)
These two standards (and probably upcomming enhancements in several years ;) are the replacement for TCP MD5 Option (RFC 2385). TCP-AO specifies stronger Message Authentication Codes to protect against replay attacks for long lived connections like BGP sessions. It is a generic contaier where other authentication codes can be used. Several other aspects are adjusted too like an extended sequence number mechanism (imaginable as shadow registers) and IPv6 support.
Florian Westphal and I held a presentation where we mentioned TCP-AO at that time upcoming standard: Trends und Neuerungen bei der Protokollentwicklung
The Internet is now a more safer place … ;)
Finally BPF Optimization Is On The Run
- Published in: networking
- | Time: 01:58:17 CEST
- | SHA1: 1e62996ee1ba240fb9e85e15bd1cb981f22f4e83
Yesterday I submitted all patches, one for linux-kernel and one for netdev. Normally git send-email saves me and automatically cc all stakeholders (if you construct you patch correctly) but this time I missed the whole kvm maillingslist and all involved stakeholders – shit happens. ;-) I furthermore asked davem how about to use of a special, but proprietary gcc extension. As I think we can gain some performance gains if we permit this extension – let’s wait and see …
![]()
- Published in: uncategorized
- | Time: 23:48:15 CEST
- | SHA1: 1142e2a0332119785b44a8bd13b123b49ed20205
The Munich Symphony Orchestra will perform during Apil 2011 the Lord of the Rings Epos with chorus, soloists and more then 250 assistants – sounds great!
![]()
Source Code Management to the finest, git
- Published in: uncategorized
- | Time: 02:10:57 CEST
- | SHA1: a7ce89c8644768b501e58ccdc0021a98c636686a
I never realized it because it is courteous in open source projects with more then one programmer to write a meaningful commit message (this is no matter of course in commercial projects). I just noticed that git prints the following message:
# Please enter the commit message for your changes. Lines starting # with '#' will be ignored, and an empty message aborts the commit. # On branch kvm-accelerator [...]
Quite nice … ;)
How to Cc in Patches
- Published in: uncategorized
- | Time: 02:20:35 CEST
- | SHA1: a4632081727505810a04f10fb42d27d5ed62216f
Nearly 99 percent of all kernel patches touch someones code. Someone is often the maintainer of the subsystem or another developer who is responsible for special tweaks. So the patch should not limited to some maillinglist but should also be addressed directly to the stakeholders of the code. If you are familiar with the current subsystem you know how is the stakeholder and it easy to Cc the right people. On the other hand if you are working within complete unfamiliar code it is not that easy.
The most simple strategy is to look at source code and annotate the lines. Beside that it is of interest who contributes the lion shares. The following git command displays the top 3 contributors since kernel version 2.6.12 for the KVM directory:
git shortlog --email v2.6.12.. -sn arch/x86/kvm/ | cat -n | head -n 3
The kernel version as well as the restriction of the source code files must be adjusted for the personal needs – of course.
- Published in: programming
- | Time: 15:14:27 CEST
- | SHA1: 41a546cb5742abb073acacaebf3674e0c0cb6bf2
Occasionally I am forced to build larger parts of kernel, not only the “hot area”. This is mainly
caused trough modified header files at a prominent position (e.g. tcp.h) or through some
updates where the timestamps are updated. Depending on the current workload I
restrict the CPU usage by not provide the full SMP load via -j N.
But this is one kind of load – the other is IO/disk load. So as I said sometimes I
want restrict IO usage too. Via ionice it is possible to set the program IO
scheduling class:
- Idle, get only get disk time when no other program has asked for disk IO
- Best effort (aka none class), standard IO class for a program. A feature of
this class is that an IO priority can be assigned. Range from 0 (highest) to
7 (lowest). Within the same priority IO is scheduled in a round robin fashion.
Without any modification the standard IO priority is derived from the CPU nice level:
(cpu nice level + 20) / 5 - Real time, behaves similar to the realtime CPU class and can starve the IO subsystem. I have no use case and therefore never used it.
If I want to reduce the IO impact of the compile process I type the following
command to set the IO scheduling class of the shell to idle. All derived
processes (via fork()/exec()) will inherit the scheduling class.
ionice -c3 -p$$
PS: it works only with the completely fair queueing IO scheduler (cat /sys/block/*/queue/scheduler)
L7 Filter IRC
- Published in: uncategorized
- | Time: 22:15:53 CEST
- | SHA1: a49a6b8d93edba3a414e4fa35dda180d9a564712
(17:17) < fw> pakete kriegt das ding via libnetfilter_queue, was prinzipiell schon richtig ist (17:18) < fw> Typische loesung ist, nur "noch-nicht-klassifiziert" pakete zu queuen, und nach sagen wir 4k aufhoeren (17:18) <hgn> das wird eine baumstruktur (17:19) < fw> so sollte es sein (17:19) <hgn> genau (17:19) < fw> aber fuer l7filter sind alle pakete gleich, und alle pattern auch (17:19) < fw> und das ist schlecht[tm] (17:19) <hgn> ok (17:19) <hgn> das ist wirklich schlecht (17:19) < fw> und es ist c++ ;)
Hypervisor Time Machine Call Graph
- Published in: programming
- | Time: 00:06:45 CEST
- | SHA1: deeedfd8e42dbafdd7f0d2701b7ed014909ded29
Finally: a dream comes true. Doubling or even increase tenfold the clock frequency of your CPU without costs by modifying some variables. Drawback it just works under use of a hypervisor; a virtual environments under kvm. ;-)
Just kidding, but as explained yesterday in a blog posting my idea was to
introduce a decoupled behavior of the virtual clock. Actually I had no notion
to explain the implementation in detail. Just some aspects: the mechanism does
not use a utilize a virtual interrupt, on the contrary a memory page is mapped
into the guest. The patch modifies some places, in the hypervisor as well as in
the guest implementation. Currently a multiplier (called time_machine_mult)
control the clock rate but there are some improvements possible, especially if
several guest instances are run in parallel. Network test setups for example
connected via VDE with a unify time source requirements will need a more
synchronized environments.
After some beautyfications I will push the patch upstream.
![]()
Virtual Machine Time Lapse
- Published in: realtime
- | Time: 20:10:17 CEST
- | SHA1: 2cbbbc31193d1f6eaeeb084749b961febd0678e0
Today I had the idea of accelerate respective decelerate the time for virtual guest machines. The idea comes into mind with time discrete simulators where the wall clock behavior is not required or rather cumbersome. Linux sophisticated kernel virtualization infrastructure is kvm (Kernel-based Virtual Machine, where Intels VT-x or AMD-V native virtualization technique is used).
Several timing systems are supported by QEMU and the kernel side:
- PIT, an emulation of
Intels i8254Programmable Interval Timer, normally with three 16 bit counters, modern back ini8086CPUs days - RTC, the real time clock with a interval of 32768 ticks per second to advance the time system reported by interrupt number 8. A really comprehensive documentation can be found at Documentation/rtc.txt
- TSC, a 64 bit wide counter (time step counter) and present on all x86 processors. Based on the famous
rdtscinstruction, but as usual the disadvantages ofrdtscalso employees for kernel side clock keeping: not CPU hibernation aware, core migration problems and so on. The Kernel on the other hand will do everything to analyze the behavior of the RTC. If the analyze results in a inaccurate behavior the RTC clock is not used. Some days ago Thomas Gleixner talked about the problems of TSC and why he dissuade user space applications from using it. - HPET, the high precision event timer is the new kind on the block and take over the RTC. HPET has a ~10MHz 64 bit wide counter and 3 independent 64-bit comparators as well as 29 additional 32-bit comparators for random interrupt generation. My hardware HPET announce itself as
hpet0: 4 comparators, 32-bit 14.318180 MHz counter(seex86/kernel/hpet.c)
My KVM environment provides no stable RTC clock source so it switch from RTC to HPET emulation mode.
VMWare provides a good overview about timekeeping and virtualization: vmware_timekeeping.pdf
So currently I am working at the mentioned feature but high resultion timer implementation as well as the clock diversity zoo make life not easy. The code lacks of documentation too, so often only reverse engineering provides the required information and this decelerate the flow …
Hypervisor IRC Fun
(22:37) <fw > hast du nen kvm im kvm, oder fummelst du nur am guest kernel rum? (22:37) <hgn> modular, ich lade die hypervisor abstraktion dynamisch via kvm.ko und kvm-amd.ko (22:38) <hgn> was kann schon passieren, jetzt bleib mal auf den boden! ;) (22:38) <fw > harhar (22:38) <fw > *hgn has quit irc [connection reset by peer] (22:39) <fw > hrhr
GCC generated Switch Jump Tables
- Published in: programming
- | Time: 22:47:12 CEST
- | SHA1: 181a9d3b50a4eab19a2019f64bf4c9da448fcb3f
The disassembled output of filter.o looks not cache-line friendly, with the
current generated code two cache-lines must be hit—at least. A switch
statement with two case statements is translated into a sequence of conditional
branches like this (arch: x86_64):
4b8: 8b 06 mov (%rsi),%eax 4ba: 66 83 f8 35 cmp $0x35,%ax 4be: 0f 84 d8 02 00 00 je 79c <sk_run_filter+0x325> 4c4: 0f 87 07 01 00 00 ja 5d1 <sk_run_filter+0x15a> 4ca: 66 83 f8 15 cmp $0x15,%ax 4ce: 0f 84 cd 02 00 00 je 7a1 <sk_run_filter+0x32a> 4d4: 77 73 ja 549 <sk_run_filter+0xd2> 4d6: 66 83 f8 04 cmp $0x4,%ax 4da: 0f 84 1f 02 00 00 je 6ff <sk_run_filter+0x288> 4e0: 77 29 ja 50b <sk_run_filter+0x94>
The whole switch/case jump construct in run_filter() eat exactly 567 byte .text memory! But
why does gcc not generate a jump table (also known as branch table)?
The basic gcc algorithm for switch constructs is simple: less then 5 case statements forces gcc to generate code as explained above. More then 5 case statements command gcc to use a jump table – if the labels are in some degree stick together(dense). A heuristic will calculate if the degree is sufficient. And this kicks in here, the labels are exposed and filling the gaps with defaults is to costly. So keep the labels close to each other to make this optimization possible for gcc! gcc generates a small data section where the addresses of all labels are stored and the finally switch construct leads to a simple jump (I simplified the code):
cmpl $4, %eax ja .L8 jmp *.JUMPTABLE(,%eax,4)
The advantage is a reduced complexity from O(n) to O(1). The threshold is 5 cases if less then 5 a binary search is applied, more then 5 leads to the jump table.
The alternative is illustrated in the first listing where each label is a sequence of compare and jump instructions. The labels are GCC internally encoded as a binary tree data structure which results in the end in a ordered tree. Finally, small switch statements can generate a series of bit instructions and branches.
gcc permits to disable jump table generation via -fno-jump-tables. For
example if you do branch analysis or program other code analyzers the jump
table trick is not easy to parse.
See gcc/stmt.c, especially expand_case() is quite interesting. A simple
jump table optimization by subtracting the minimal label value is only applied
when optimization for speed is enabled—strange!
CPU Cycles versus Cacheline Miss
- Published in: programming
- | Time: 22:44:24 CEST
- | SHA1: ed5270619a581e39dfa90db4432100871b564bca
Optimization changed over time – optimization today does not address CPU
cycle/instruction, rather reduced memory transaction are the key to success.
Keep the code small and keep the code in the cacheline and reduce memory loads.
A cache miss is equivalent to 100 instructions during a CPU stall. Therefore
keeping the .text footprint small can boost your application more then some
sophisticated CPU tweaks. Pahole provides a feeling how the .data segment is
constructed, how structs are aligned in memory, it provides information if
holes in structs exists and how the linker align the data at boundaries. This
scratches the surface of optimization techniques and the kernel uses highly
sophisticated techniques to optimize for memory transactions: false sharing of
elements which are used mainly readonly (put it in another .data section),
align data on different cachelines to avoid false sharing, UNinline functions
to reduce the memory footprint and so on.
One of the best books to understand this kind of optimization is called “UNIX Systems for Modern Architectures – Symmetric Multiprocessing and Caching for Kernel Programmers” from Curt Schimmel. This is by far the best book in this area. Beside this the optimization manuals from AMD and INTEL are worth to read it too, because they provide a in deep understanding about the actual processor specific tweaks.
IPSec and Quagga Automating
- Published in: networking
- | Time: 02:58:37 CEST
- | SHA1: cf438a14f5afa90ea060f273731423d5a091e2d3
A tiny drawback of quagga is a missing feature to signal the host environment when
routes change. So it is not simple to start/stop scripts if routes are added or
deleted. For example: if you want to add or delete a security association if a
route changes then it is not directly possible via quagga. But Linux and other
operating systems provides a monitor mode where all relevant network events can
be caught by the user space. Linux uses netlink sockets for network related
communication between user-space and kernel-space. ip, the command, can be
forced to print all network related messages to STDOUT via ip monitor all
(under BSD route -n monitor will do the same).
IRC License Discussions
Lately in the IRC after looking at some Linuxtag photos
(19:43) < h**> http://www.flickr.com/photos/linuxtag/4699311928/ (19:44) < h**> thomas und drei linux chicks! ;-) (20:02) < n***> open-source-babes sind halt immer nur son community-produkt ;) (20:04) < l***> he, das waren nur die angestellten von Beach at the box (20:14) < c***> n***: na dafuer darfste open-source-babes nach belieben nutzen und weiterreichen ;) (20:17) < k**> yeah, und wenn sie unter einer bsd-lizenz steht darfst du die auch kommerziell weitervertreiben (20:17) < w*****> :D (20:21) < h**> ;) (20:31) < n***> jo, GPL-babes haben den weiteren nachteil, dass man jegliche modifikationen, die man an ihnen vornimmt, publizieren muss ;) (20:36) < h**> warum n0-1? wenn du sie nicht weitergibst musst du die modifikation auch nicht publizieren (20:39) < n***> h**: stimmt. also am besten GPL fuer daheim (da weisste, was de hast) und BSD fuers "business" ;) (20:50) < h**> ack ;-)
On the Generating of IP Fragments
- Published in: networking
- | Time: 22:43:07 CEST
- | SHA1: 15e788d60922f821ceb64aef8cca1d0d0f5fa15a
If the IP packet size exceed the link layer maximum transmission unit (MTU, 1500 byte for Ethernet v2) a packet must be fragmented. Yesterday Changli Gao submited a patch that optimize the reassembling process in the kernel. Changli assumption was that fragments arrived in accureate ascending order. He optimized the processing behavior by optimizing the list processing of inet_frag_queue for IPv4 and IPv6 by introducing an additional pointer, called fragments_tail that points to the end of the list. So nothing sophisticated here.
The main point against his patch was the following argument: Linux generates
since more then 10 years fragments in descending, in other words reverse order
– starting with fragment n, n - 1, n - 2, .... So for Linux this is a noop.
But Linux isn’t the only OS on the earth and as stated by Changli all other
major operating systems generate fragments in ascending order.
I don’t want to write about the Path MTU discovery (PMTUD) mechanism, but to manually
probe for a path MTU it is quite comfortable to use ping or ping6, simple
specify a packet size and set the DF bit in the case of IPv4: ping -c 1 -M do -s 1464 google.com.
1506 bytes on wire, 14 byte Ethernet header, 20 byte IPv4 header, 8 byte ICMP
header, 1464 byte payload; this is a PPP limitation, Ethernet v2 can transport
1514 bytes on wire and if you do local PMTUD you can discover this limit.
BPF Analysis II
- Published in: networking
- | Time: 11:28:04 CEST
- | SHA1: 5b3afa188071e52fe3b70f7c434bd28b278fbffd
| Filter Rule | Processing Time |
|---|---|
| no filter | 469.231535129662 us |
| “icmp” | 524.893505343705 us |
| “icmp and host 127.0.0.1” | 367.231761850006 us |
| “icmp and host 127.0.0.1 and ‘ip6 = 64’” | 598.401917125821 us |
| “icmp and host 127.0.0.1 and ‘ip6 = 64’ and ‘ip[2:2] > 1’” | 649.767235335359 us |
Processing delay sampling (needs smoothing, of course):
![]()
The generated OPCODES:
icmp
(000) ldh [12] (001) jeq #0x800 jt 2 jf 5 (002) ldb [23] (003) jeq #0x1 jt 4 jf 5 (004) ret #65535 (005) ret #0
icmp and host 127.0.0.1
(000) ldh [12] (001) jeq #0x800 jt 2 jf 9 (002) ldb [23] (003) jeq #0x1 jt 4 jf 9 (004) ld [26] (005) jeq #0x7f000001 jt 8 jf 6 (006) ld [30] (007) jeq #0x7f000001 jt 8 jf 9 (008) ret #65535 (009) ret #0
icmp and host 127.0.0.1 and ‘ip6 = 64’
(000) ldh [12] (001) jeq #0x800 jt 2 jf 11 (002) ldb [23] (003) jeq #0x1 jt 4 jf 11 (004) ld [26] (005) jeq #0x7f000001 jt 8 jf 6 (006) ld [30] (007) jeq #0x7f000001 jt 8 jf 11 (008) ldb [20] (009) jeq #0x40 jt 10 jf 11 (010) ret #65535 (011) ret #0
icmp and host 127.0.0.1 and ‘ip6 = 64’ and ‘ip[2:2] > 1’
(000) ldh [12] (001) jeq #0x800 jt 2 jf 13 (002) ldb [23] (003) jeq #0x1 jt 4 jf 13 (004) ld [26] (005) jeq #0x7f000001 jt 8 jf 6 (006) ld [30] (007) jeq #0x7f000001 jt 8 jf 13 (008) ldb [20] (009) jeq #0x40 jt 10 jf 13 (010) ldh [16] (011) jgt #0x1 jt 12 jf 13 (012) ret #65535 (013) ret #0
Skipping Adobe Flash
- Published in: uncategorized
- | Time: 11:05:12 CEST
- | SHA1: a4cadd1ef8fbc14eb38259d0ef872415ebaf8a1b
The posted flash advisory list is really long so I tried to update the player. But unfortunately Adobe skipped their 64 bit support (what a piece of software is this anyway – in 2010?) which actually means I had no change to run flash any more! 32 bit combat mode – no thank you, buggy software – no thank you. I installed the dev version of firefox with webm support which works great. OK, many videos aren’t rendered yet but this is a question of time.
Let say goodby to Adobe Flash! ;-)
- Published in: networking
- | Time: 00:19:43 CEST
- | SHA1: 9f210315c9b27ed69f3b314beef2b1bb42c76c20
Stumbling over the Urgent Pointer code in tcp_recvmsg() and reading some
specs. Urgent data allows the sender to signal the receiver that “urgent
data” of some form has been placed into the packet. The receiver on the other
hand must deal with this condition and is forced by himself to handle this
condition. If not handled the data is silently ignored. Therefore, it must be
negotiated at a higher level that urgent data is transmitted and properly
handled at the receiver side.
The urgent bit in the TCP header is set and the Urgent pointer is set to the relevant data. The urgent pointer is a 16 bit value. I wrote “pointer is set to the relevant data”, and here comes the fun into the game! First: the urgent field is an offset that points to the last byte of urgent data. It is up to the receiving application where the urgent data starts. A real world problem is the terminus “points to the last byte of urgent data”. RFC 1011: “The urgent pointer points to the last octet of urgent data (not to the first octet of non-urgent data)”. At data before can be threated as urgent – but this is application level specific.
The problem is the following: the definition of the Urgent Pointer was declared in the RFC 793 – the original TCP standard. Due to some contradictions RFC 1122 updates and correct the behavior (problem was where the urgent pointer points to, the last byte of the urgent data or the byte after the last urgent data). But: all TCP implementations in the wild implement the behavior as specific in RFC 793. The advice by Gont and Yourtchenko is that new applications employing TCP should not use the urgent mechanism at all.
BPF Filter Complexity versus Execution Time
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: c65eec415c4e449932fc2b1c6862aff01cfe946f
Today after some time-killing IETF debates I started to analyze the in-kernel BPF filter
execution time for different BDP filters. Starting with no filter, which is
translated into a simple BPF_RET|BPF_K OPCODE till some more complex
instructions. The average execution time lies somewhere at 300ns for no filter
and somewhere above 350ns for a simple ICMP filter with 17 CPU instructions on
my x86_64 (excluding call overhead).
The next image illustrates this (statistically sampled data):
![]()
BPF Optimizer
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: bbf84711ad4eb502c6c74f4e07ff84a4be4b6016
I started to analyse the BPF optimizer. Several options helped me: “-d” to dump the generated instructions and “-O” to disable the packet-matching code optimizer (normally only useful if you suspect a bug in the optimizer).
So my modified kernel (I will post the kernel patch after I reworked the tracing ring buffer implementation) and the tcpdump possibilities we are now in the ability to analyse exactly how the optimizer works.
So here we go
A null filter is a plain ret instruction evaluated into(000) ret #65535“ip” is translated into (no difference with w/o or w/ optimization):
(000) ldh [12] (001) jeq #0x800 jt 2 jf 3 (002) ret #65535 (003) ret #0A “tcp” filter is encoded into the following instructions:
(000) ldh [12] (001) jeq #0x86dd jt 2 jf 4 (002) ldb [20] (003) jeq #0x6 jt 7 jf 8 (004) jeq #0x800 jt 5 jf 8 (005) ldb [23] (006) jeq #0x6 jt 7 jf 8 (007) ret #65535 (008) ret #0A “tcp” filter unoptimized is encoded into this:
(000) ldh [12] (001) jeq #0x86dd jt 2 jf 4 (002) ldb [20] (003) jeq #0x6 jt 8 jf 4 (004) ldh [12] (005) jeq #0x800 jt 6 jf 9 (006) ldb [23] (007) jeq #0x6 jt 8 jf 9 (008) ret #65535 (009) ret #0
IETF TCPM Historicize
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 9141bbab2a4d475c6ca7d45eacf6cf16fc293d63
Lars Eggert posted today a Draft where RFC1106, RFC1110, RFC1145, RFC1146, RFC1263, RFC1379, RFC1644 and RFC1693 are declared as historic documents. But RFC1146 – TCP Alternate Checksum Options – was not superseded by a new standard nor is he defective by any sense. These both arguments are normally the statement why a RFC is declared as historic. In my eyes this is not true for this standard. In the debate Lars argued that the already assigned code points do remain assigned. This guarantee that implementations – not seen in the wild for RFC1146 – are not touched or affected.
I mentioned that TCP can have a legal desire towards a more strong checksum and as an example I mentioned interplanetary TCP. Lloyd Wood pointed out that TCP isn’t suitable for Interplanetary communication. The fact is known sure, but I know several modifications of TCP where the RTO mechanism is adjusted just because to meet interplanetary requirements!
Anantha Ramaiah pointed out a Draft that address exactly the same shortcomings: enhancing TCP checksums mechanism. Lars noticed that it Transport Layer Security (TLS/SSL) can be used to guarantee the integrity of the data even more stronger then as with a plain CRC.
The discussion with Lloyd slided away from the actual topic to some general discussion about TCP and the interplay with RTO. His research background focus on interplanetary communication and it is quite interesting to know about some real world problems.
BPF Opcode Analysis
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: e53e544ce70703072b82ba91b97b4b45cb97df5c
The following paragraphs explain the correlation between filter rules provided by any PCAP based filter program, the resulting intermediate OPCODE representation and the kernel side interpretation. The most brilliant logic within PCAP is not the sniffing functionality nor the dump file format, it is rather the optimization logic. To eliminate useless calculations, to generate efficient instruction, to skip possible IPv4 options, jump over IPv6 extension headers and so on. At the same time the optimizer must be able to eliminate useless/duplicate expressions like “IP AND IP” (this is the most trivial example, but it can be quite complex).
Processing Sequence for some common filter expressions
No Expression
A stub filter return immediately by processing BPF_RET|BPF_K and returning
fentry->k. Where fentry->k is the requested packet length in bytes. You will
see this expression in nearly all subsequent filter.
IP Expression
- taken (ICMP generator)
BPF_LD|BPF_H|BPF_ABSBPF_JMP|BPF_JEQ|BPF_KBPF_RET|BPF_K
ICMP Expression
- taken (ICMP generator)
BPF_LD|BPF_H|BPF_ABSBPF_JMP|BPF_JEQ|BPF_KBPF_LD|BPF_B|BPF_ABSBPF_JMP|BPF_JEQ|BPF_KBPF_RET|BPF_K
- not taken (IPv6/ICMP generator)
BPF_LD|BPF_H|BPF_ABSBPF_JMP|BPF_JEQ|BPF_KBPF_RET|BPF_K
TCP Expression
- taken (TCP generator)
BPF_LD|BPF_H|BPF_ABSBPF_JMP|BPF_JEQ|BPF_KBPF_JMP|BPF_JEQ|BPF_KBPF_LD|BPF_B|BPF_ABSBPF_JMP|BPF_JEQ|BPF_KBPF_RET|BPF_K
- not taken (ICMP generator)
BPF_LD|BPF_H|BPF_ABSBPF_JMP|BPF_JEQ|BPF_KBPF_JMP|BPF_JEQ|BPF_KBPF_LD|BPF_B|BPF_ABSBPF_JMP|BPF_JEQ|BPF_KBPF_RET|BPF_K
UDP Expression
- not taken (ICMP generator)
BPF_LD|BPF_H|BPF_ABSBPF_JMP|BPF_JEQ|BPF_KBPF_JMP|BPF_JEQ|BPF_KBPF_LD|BPF_B|BPF_ABSBPF_JMP|BPF_JEQ|BPF_KBPF_RET|BPF_K
Conclusions
There are several possibilities to optimize the generated BPF filter especially in coherence with the kernel interpreter. Next step is to analyse the cache line behavior and try to align the structure for the common case and reduce memory loads.
Network Stack Hash Table Memory Consumption
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 0fd4142b470fc0e1ef739f46908cf161ae953ac2
I stumbled across the default hash size for the different hash tables used in the network subsystem. Hash tables are used as a efficient container for different network subsystems – compared to let say list containers. The optimal complexity of a hash table is O(1), this means access in constant time, no matter how many entries are in the container. The optimum is a theoretical value and requires a hash bucket fill level from maximum ~60 percent as well as good hashing algorithms. Higher workloads tends to collisions, this again shift the complexity of the container from O(1) to O(n). Linux provides the Jenkins Hash function as one of the best hashing algorithm. Some network components use their own hash function because they can do it better because of a good key knowledge.
The fact the the fill level should not be higher they ~60 percent is a little bit problematic. Some major hash table implementation dynamically extend or shrink the hash table size depending on the current workload to guarantee this fill level. Within the Linux kernel this is not as easy as it requires a clever locking mechanism and some other difficult quirks, therefore no dynamic approach is applied. As a consequence the default hash table size is sized for worst case workloads (e.g. server systems, ...).
If I sum up all network related hash tables nearly 0.5 percent of the main memory are use for containers:
TCP established hash table: 4194304 bytes TCP bind hash table: 1048576 bytes IP route cache hash table: 524288 bytes UDP hash table: 32768 bytes UDP-Lite hash table: 32768 byte
The .text segment of my current kernel consumes only 6241407 bytes! Free memory is bad memory—sure, but wasted memory is bad memory too.
Back In Munich
- Published in: travel
- | Time: 00:23:11 CEST
- | SHA1: 3b56bab14db4662e5a719bbe01451a45eab684bb
Three wonderful days in Paris/France: nice people, great food and gorgeous historic sites. A really lovely place on earth!
TCP Minimum RTO
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: f60582f9d0434e66c538d38033bcecdc4b8727df
Actually $user posted a regression that he measured a RTO of less then 200ms via tcpdump. Normally this is not possible because Linux bounds the minimum to 200ms. So lets see what the actual trace offers.
RFC 2988 specifies that the minimum TCP Retransmission Timeout (RTO) SHOULD be 1 second. The relative large value was selected to keep TCP conservative and avoid spurious retransmissions. The RFC was written back in 2000 and things changed. The timer and clock granularity of current operating systems is more accurate and allow a more fine grained minimum. Furthermore, analysis had demonstrated that a RTO of 1 second badly breaks throughput in environments faster then 33kB with minor packet loss rate (e.g. 1%).
Therefore current Operating Systems use a RTO smaller then 1 second. Linux defaults to 200ms and FreeBSD to even 30ms. At least Linux implement a algorithm called Forward RTO Recovery to detect spurious timeouts. This also permits the use of a reduced RTO and make the 1 second suggestion superfluous.
Also of impact is the interplay with Delayed ACKs: The TCP Minimum RTO Revisited
UPDATE: $user emailed the tcpdump traces and no spurious timeout triggered packet is identifiable:
000027 IP 1.46799 > 2.47500: 168809236:168812132(2896) ack 1 win 46 <nop,nop,timestamp 1375977402 2077240628>
000024 IP 1.46799 > 2.47500: 168812132:168815028(2896) ack 1 win 46 <nop,nop,timestamp 1375977402 2077240628>
000029 IP 1.46799 > 2.47500: 168815028:168817924(2896) ack 1 win 46 <nop,nop,timestamp 1375977402 2077240628>
000077 IP 2.47500 > 1.46799: ack 168786068 win 16200 <nop,nop,timestamp 2077240629 1375977402>
000006 IP 2.47500 > 1.46799: ack 168787516 win 16200 <nop,nop,timestamp 2077240629 1375977402,nop,nop,sack 1 {168801996:168804892}>
000002 IP 2.47500 > 1.46799: ack 168787516 win 16200 <nop,nop,timestamp 2077240629 1375977402,nop,nop,sack 1 {168801996:168806340}>
000001 IP 2.47500 > 1.46799: ack 168787516 win 16200 <nop,nop,timestamp 2077240629 1375977402,nop,nop,sack 1 {168801996:168809236}>
000002 IP 2.47500 > 1.46799: ack 168787516 win 16200 <nop,nop,timestamp 2077240629 1375977402,nop,nop,sack 1 {168801996:168812132}>
000003 IP 2.47500 > 1.46799: ack 168787516 win 16200 <nop,nop,timestamp 2077240629 1375977402,nop,nop,sack 1 {168801996:168815028}>
000001 IP 2.47500 > 1.46799: ack 168787516 win 16200 <nop,nop,timestamp 2077240629 1375977402,nop,nop,sack 1 {168801996:168817924}>
000203 IP 1.46799 > 2.47500: 168817924:168819372(1448) ack 1 win 46 <nop,nop,timestamp 1375977403 2077240629>
000028 IP 1.46799 > 2.47500: 168819372:168822268(2896) ack 1 win 46 <nop,nop,timestamp 1375977403 2077240629>
000025 IP 1.46799 > 2.47500: 168822268:168825164(2896) ack 1 win 46 <nop,nop,timestamp 1375977403 2077240629>
000024 IP 1.46799 > 2.47500: 168825164:168826612(1448) ack 1 win 46 <nop,nop,timestamp 1375977403 2077240629>
UPDATE II: so after a more exact analysis the posted regression looks like false alarm – so ignore this post.
Label as Values and Interpreter Optimizations
- Published in: programming
- | Time: 00:23:11 CEST
- | SHA1: b8717648e84ea2892532b5f047dfb94dfca3c1dc
The GNU Compiler supports since several releases a unofficial C feature called Label as Values. This feature offers the ability to retrieve the address of a label and use it in a unconditional branch construct. foo, bar and hack are labels:
static void *array[] = { &&foo, &&bar, &&hack };
[...]
goto *array[i];
The saved condition comes with its one costs: this form is not simple to read compared to simple switch/case statements, especially in real world examples. Anyway this feature offers the ability to get rid of some CPU cycles by substitute the conditional branch. A really nice employment for Label as Values are interpreters, the following code fragment illustrate this:
static int interpret(Opcodes opcodes) {
static const void *codetable[] =
{ &&RETURN, &&INCREMENT, &&DECREMENT, &&DOUBLE };
int result = 0;
goto *codetable[*(opcodes++)];
RETURN:
return result;
INCREMENT:
result++;
goto *codetable[*(opcodes++)];
DECREMENT:
result--;
goto *codetable[*(opcodes++)];
SWAPWORD:
result = (result << 16) | (result >> 16);
goto *codetable[*(opcodes++)];
}
I found this example on the bugzilla site for llvm (http://llvm.org/bugs/show_bug.cgi?id=3120).
llvm support this language feature since some time but internally implemented it using a switch table. So no performance gains are expected for LLVM. Sidenote, LLVM internal generated switch tables where often superior to the hand crafted switch/case counterpart. This changes over time and llvm now supports full Label as Values support, current values depending on the architecture are quite impressive. To sum up: interpreted switch is almost always faster on all workloads – exception: llvm on x86_64. The duel between GCC and llvm are quit variable, depending on the architecture, if switched or threaded and so on.
I looked at the Berkely Packet Filter (http://lxr.linux.no/#linux+v2.6.34/net/core/filter.c) implementation at the Linux kernel but the current structure allows no straight forward modification of the code. Labels are currently compile time calculated bit expressions.
new blog software
- Published in: uncategorized
- | Time: 00:23:11 CEST
- | SHA1: 61fce0352996e5a91b5e8f6e814936d421bb7bfa
Yesterday as well as today I worked meticulously to build a own blog software. This post is the first post on a self-hosting basis. In the next couple of days I will publish the software via git as usual. The technique is simple: blog posts are written with your editor of choice (e.g. vim), after that the file is checked into the git repository (database and versioning is done via git) and after that a simple ruby script generates all html pages (via git post hooks). Simple, lightweight and functional.
One tiny extension is image recognition. Each blog post have impact on two pages: the post specific page and the index.html. To link management for image links is done by textile – path substitution is required.
Anyway, the big blocks are already implemented: html file generation, category archive, year archive and last but not least RSS feed (xml 2.0) generator.
Linux ML TCP Fairness Controversy
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 4ba150ee880fde192de7fe34a4acb0dc7b7ff3f7
TCP fairness discussions are quite frequent for the Linux network maillingslist (netdev). This time Tom Herbert from google posted a patch where the TCP Congestion Window (CWND) can be modified through setsockopt(3). It is not easy to explain the impact of this change because it requires a lot of background information about how TCP works. To shorten the story: this option can – if employed to aggressive – lead to serious performance problems of the Internet. Network components these days are forced to behave good natured. If network components ignore the feedback or information from the network the Internet can collapse. This knowledge is not new at all – in contrast – the Internet was collapsed in the 80s because no congestion control mechanism was integrated. Since this time the cooperative behavior is a basic requirement for the functioning of the Internet.
Over time people tend to neglect the obvious and current network stack implementation soften the requirements. Some months ago another discussion about the plugable congestion control algorithms where similar. Some congestion control algorithms are unfair by default (e.g. BIC in environment with small RTT). I posted a patch to restrict the unfair behavior to root and provide only superuser the capabilities and freedom to do what he want. This patch effective disabled any changes by any other user. Stephen Hemminger negate and pushed another patch. Now the situation is that user who can compile and replace the kernel can enforce a strict or less-strict behavior. Less strict means that every user can select a congestion controll algorithm via setsockopt() his own congestion control algorithm, strict means that the user is forced to use the default congestion control algorithm (e.g. NewReno, Westwood). By the way: congestion control algorithms aren’t the only place where unfair protocol behavior is possible.
But back to the current debate: Tom Herbert patch where declined and some fundamental Internet philosophies where discussed. The most impressive post for me was from Denys Fedorysychenko:
In Lebanon i have around 30k users behind few IP addresses(around 6, for web). Because backbone here $1200/Mbit, and satellites mostly(rtt 400+ ms)... so TCP accelerators and caching proxy a must. Tproxy doesn’t work well yet to use full set of ip’s.
The whole debate can is archived here: comments.gmane.org
IPv6 Again and Again
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 591c5fde59c5aa8180fda1ad59a09858f2f2ff28
Because I see it again and again (to highlight it: I mean in NEW software):
- you MUST use sockaddr, sockaddr_in and sockaddr_in6 since the beginning.Don’t use uint32_t or even worse unsigned long for IPv4 storage. But as you realize this is not protocol independent: sockaddr_in and sockaddr_in6 are container for IPv4(IP Version 4) and IPv6(IP Version 6). You will end up in switch/case statements to distinguish between both protocols. The superior solution is to use struct sockaddr_storage (big enough to also handle UNIX sockets). Currently there is nearly no excuse to use the protocol dependent containers. There are a few exceptions where (e.g. where sockaddr_storage will eat up to much memory) but this is really, really rare.
- gethostbyname() and gethostbyaddr() must be replaced (and also getipnodebyname and getipnodebyaddr)
- MUST also be replaced inet_ntop(), inet_pton() and inet_aton() (e.g. they they do not support scoped IPv6 address)
- in6_addr isn’t guaranteed sufficient to identify a node. The scope information (the interface information) is missing
- Name resolution? use getaddrinfo() and getnameinfo()!
and last but not least: do not hardcode AF_INET or AF_INET6!
MPI, Infiniband and IB cards
- Published in: uncategorized
- | Time: 00:23:11 CEST
- | SHA1: 7f0acd5dfbd49ba8599fb74cecb350f3e296ffd6
Normally you should interconnect cluster nodes via links with a short delay characteristic. For this purpose protocols like infiniband or IB cards are invented. OpenMPI’s default assumption is that nodes are connected in this manner. Therefore, if someone use other link interconnects (like GigE or local CMP/SMP work mode) the openmpi warns about that circumstance.
If you want to shut up mpi messages call your program with “—mca btl \^udapl,openib” as an argument to mpirun:
mpirun --np 4 --mca btl \^udapl,openib build/debug/examples/adhoc-olsr
ns3 Mercurial Repository
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 0dd1677cc7603cb75c3804f571fe194d3f7dd5aa
To clone the parallelized ns-3 branch you can type the following:
hg clone http://code.nsnam.org/pfeifer/ns-3-para
- after some days you can update the repository via hg pull http://code.nsnam.org/pfeifer/ns-3-para
- and update the local copy hg update
BTW: to push the changes upstream type:
hg -v push ssh://pfeifer@code.nsnam.org//home/pfeifer/repositories/pfeifer/ns-3-para
Netsend TCP Options
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 842e0927bec4b6fee1a8b7ab3febad01a28e268c
Because of massive flaming I reintroduced the TCP Option to netsend again. But this time not via the normal commandline options (this SHOULD be implemented as well), in opposite I introduced a new option “-O ”. The “O” option (like Output) is introduced to replace the current machine parseable output.
The advantages are clear: the current machine output use a fixed standard format that cannot be reordered or shrinked because of backward compatibility. The new O option let the user the ability to output only requested values. The output is more easy to parse for gnuplot (e.g.) and also to replace the human output because the user can focus on the really interesting values. Examples? NULL problemo:
while true; do ./netsend tcp rec;done &
./netsend -O "%n bytes transmitted/received in %t seconds\n" tcp t file localhost
./netsend -O "ECN: %Te SACK: %Ts" tcp t /etc/Muttrc localhost
2am and still bug fixing
- Published in: programming
- | Time: 00:23:11 CEST
- | SHA1: ba557b6b355efdf8c65ee8101a2caed68c196658
today in the evening I received the following bug report for libhashish (especially the bloom filter implementation):
[...] ../include/libhashish.h:116: syntax error before "pthread_rwlock_t" ../include/libhashish.h:116: warning: no semicolon at end of struct or union [...]
Looks pretty forward, but hold on! One line above in the spotted code are some tricky compiler forward declarations so we checked this first with some older GCC versions – no success. After that we focused on glibc and their threading support for older versions and voilà: some today really outdated glibc versions (especially the linuxthread ones, the NPTL predecessor) require to define a additional constant.
At the end: thank you for spotting the bug and why for god’s sake use CISCO premature software?
MD5 Options Support
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 95ebf7a12beb04aa8140922de94ddd08af84f08a
Netsend supports now the MD5 socket option (TCP_MD5SIG) for TCP based sockets. This options was introduced to protect long running TCP sessions like BGP RFC 2385. The exchanged digest are introduced to hold a shared secret and prevent spoofing attacks. Only several applications utilize this option – netsend nowadays do also! ;)
Hash Library Locked Now Thread Clean
- Published in: algorithm
- | Time: 00:23:11 CEST
- | SHA1: a8c1094854f2291d93fadc29773123ad1a58da1d
Libhashish is a hash library for UNIX witch support nearly all functionality seen in other hash table implementations. Furthermore, libhashish implement a lot of different ideas beside the normal hash list chaining strategy. It implement array chaining, list and array chaining with reordering functionality (most often referred elements first) and double key hashing (to lower O(n) complexity in the hash key compare function (normally strcmp()).
Yesterday I added mutex lock support so that the library is now thread clean!
Skeleton Parser For DCCP and TIPC
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 6a0bcc8cd7b458863812e33711273a35c69ff766
Netsend RX and TX code-paths for the DCCP and TIPC protocols are now re-implemented. They where adjusted because of the new parser engine within netsend.
svn diff -r 177:178 https://svn.berlios.de/svnroot/repos/netsend/trunk
For the small patch.
CLI options parsing
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 89026f14001ebe6ace1a20ea2c95f6f663d8e71c
I reworked the netsend commandline parsing code. It now support a more network specific user input. See the following screenshot for netsend help:
Usage: netsend [OPTIONS] PROTOCOL MODE { COMMAND | HELP }
OPTIONS := { -T FORMAT | -6 | -4 | -n | -d | -r RTTPROBE | -P SCHED-POLICY | -N level
-m MEM-ADVISORY | -V[version] | -v[erbose] LEVEL | -h[elp] | -a[ll-options] }
PROTOCOL := { tcp | udp | dccp | tipc | sctp | udplite }
MODE := { receive | transmit }
FORMAT := { human | machine }
RTTPROBE := { 10n,10d,10m,10f }
MEM-ADVISORY := { normal | sequential | random | willneed | dontneed | noreuse }
SCHED-POLICY := { sched_rr | sched_fifo | sched_batch | sched_other } priority
LEVEL := { quitscent | gentle | loudish | stressful }
rdtscll cleanup netsend
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 720b90a0e6cf0859cb5342a74cb4748057045773
We removed the rdtscll instruction support in netsend:
- userland header files doesn’t define the macro anymore (sanitized header files)
- functional aspect is limited (ACPI modes and the increased time/cycle time)
- nobody used it ;-)
Point Of Interest
- Published in: programming
- | Time: 00:23:11 CEST
- | SHA1: 4c41284f82bfd090efe3ae95a2442ffeebe1836c
Through some additional performance measurements I realize some interesting plateau. Two points are of interest, first at ~200 byte and the seconds at 600 byte (the x scale is denoted as DWORD size (uint32_t)
![]()
Also quit interesting: the long duration to “warm” the cache, tlb, etc .. (BTW: we talk about microseconds)
mlock and SCHED_FIFO
- Published in: realtime
- | Time: 00:23:11 CEST
- | SHA1: f4425426e61ece4460151fc939592913fab57356
mlock() allows to lock the (current or further used) address space to physical memory. It therefor disables paging for the selected memory (or all for mlockall()). Real-time programs often uses the ability to fix their memory to avoid unpredictable situations. Think about a welding- or laser robot with out-swapped memory …
mlockall() is the big brother of mlock(): you specify that all currently (MCL_CURRENT) or future touched (MCL_FUTURE) pages are locked. This includes code (text segment), data and stack, shared libraries, user space kernel data, shared memory and memory-mapped areas.
Occasional I utilize mlockall() for performance measurements to ensure test reliability. Bundled with SCHED_FIFO (a realtime scheduling policy) you obtain a much cleaner measurement ( exclude IO operations completely (writes to disks, network, whatever)). If you follow this tips you will get clean measurements – but that’s another topic).
But one surprising fact appear by deeper analyses of an algorithm (it was a fast, SIMD enriched Hamming distance calculation – but this doesn’t matter here): the mlock() free version of the program achieve better results then the mlock() version.
w/o mlock():
![]()
w mlock():
![]()
See the irregular area in the lower figure? The peaks are all higher (and therefore disadvantages for the algorithm).
mlockall versus Latency
- Published in: uncategorized
- | Time: 00:23:11 CEST
- | SHA1: 65ec214e63ae54a3a6c5382875a74cc1a98c592e
New measurements for the mentioned, increased latency if you lock pages physically. As you can see, the violet line (without memory locking) reflect a superior latency behavior (lesser is better).
On the other side, there is one negative peek with mlockall() – but it shouldn’t. After all: mlockall() prevent worst case scenario – the principal task for real-time application. On the other hand, it introduces a small overhead – but why? More kernel studies needed!
![]()
recv return constraints
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 90abe3b688289080745f5402e3489c91f2faa6f8
If you call recv() to fetch data, the syscall will block if nothing happens. But what are the constraints for tcp_recvmsg() to return? This posting focus on the major factors (other aspects like OOB data, signals, non-blocking IO is ignored in this posting):
- Timeout
- Amount of data
tcp_recvmsg() first calculate the timeout for this function via sock_rcvtimeo(). This refers to return noblock ? 0 : sk->sk_rcvtimeo – in the common therefore sk->sk_rcvtimeo. This is initialized as MAX_SCHEDULE_TIMEOUT (LONG_MAX).
On the other hand sock_rcvlowat() calculate the minimum amount of data. If the user specify MSG_WAITALL then the minimum is the length argument. In all other cases sock_rcvlowat() uses sk_rcvlowat. That values is standard 1, but can be changes via setsockopts(SO_RCVLOWAT). After all the minimum amount refers to 1 byte!
tcp_recvmsg() will now peek the skbuffs until the amount the data is >= 1. If there is no data available (no one send something) the process sleep until some data is send. Naturally the code doesn’t read explicit 1 byte. skb_peek() is greedy! Normally you recvmsg() will return with all data which is instantly available.
After all: tcp_recvmsg() will block forever until one byte of data is available!
MSG_WAITALL and recv
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 6e2d49e5f7f60a01aac3e24c59e30746ebe3b5bb
The recv() function has a really rare argument: MSG_WAITALL. It tells that the syscall should not return before length bytes are read. The problem is that normally nobody knows how much data is send by the peer node. So if you rely on a particular amount of data and the data isn’t send, this call blocks infinity! On the other hand, a programmer must also handle this kind of failure, because a simple read() of a socket can also block forever. A timeout handler is always needed (alarm() as the simplest solution).
The background is that recv() normally return after length bytes or if a threshold values is raised. sock_rcvlowat() is the function who look if the user want to wait until length bytes are ready or returns earlier. sk_rcvlowat is the socket specific variable which determine the break out szenario.
return (waitall ? len : min_t(int, sk->sk_rcvlowat, len)) ? : 1;
sk_rcvlowat can be set on a socket basis via setsockopt(). Normally sk_rcvlowat is initialized with 1. Therefore a recv() call does block minimum for 1 byte. If you increase sk_rcvlowat beyond the recv() length parameter then min(sk_rcvlowat, length) → length is taken – of course!
recv() will also return if OOB data is received, an error occured or an signal arrived us.
SIMD++, SSE4 and SIMD16
- Published in: programming
- | Time: 00:23:11 CEST
- | SHA1: fa2a11e548db85dba39e9f0179b17669f56a98c8
The SSE4 programming reference is out – a opportunity to study the improvements.
Besides 47 (plus 7 additional for the nehalem microarchitecture) new instructions which mainly focus on multimedia acceleration. MPSADBW (Sum of Absolute Differences), PHMINPOSUW Minimum Search (find minimum uint16_t from eight elements) (if you invite the source you had an fast max() ;-), ROUND (round floating point types) and other instructions too.
Dot product matrix calculation, load hint instruction (MOVNTDQA) to store aligned data in a small data-set, packed integer format conversions (convert in wider data types), IEEE 754 Compliance operations.
A nice giveaway are also the string instructions:
- max 16byte
- explicit length declaration or NULL termination
- string compare function support PCMPxSTRx)
- strlen()
The nehalem microarchitecture also support:
- CRC32
- POPCNT – for searching bit pattern
Recently I read a paper about the increased register width of 512bit for SIMD instructions (called SIMD16). More direct addressable register are also planned. Intel will introduce themin Larrabee.
GCC Optimization Aliasing
- Published in: programming
- | Time: 00:23:11 CEST
- | SHA1: 5c9136936dafe6733654d096e8601652890f76dc
In talks I often noted that in the a big part, pointer casts are dump and futile. The C Standard stated out that a cast to another type raise an “undefined behavior”:
float a = 23.f; uint32_t b = *(uint32_t *)&a;
Depending on the current gcc version this leads to different results in b. Sometimes a is interpreted as a uint32_t type (that what you want – probably) – sometimes the result is 0;
The interesting part come now into play: why 0, why should a compiler catch that user failure and invest CPU cycles here? Why not interpret an values of type x as type y?
The answer is: it is not a service for the user, it is an compiler optimization technique called aliasing.
The root of many optimization challenges in C are pointers. Often it is impossible to predict a values, because a pointer had always the change to modify a memory location. Nor does somebody know how many pointers, points to a particular value. This circumstance prevent the compiler from a lot of optimization.
The optimization is, that the C standard document stated out, that a pointer cannot point to a locations of different type (with the exception of char *) and gcc utilize this circumstance! The compiler can therefor optimize code because he can ensure that they doesn’t interfere.
If you really need transparent access and interpret a value as a values of another types take a union type:
union conv {
float f;
uint32_t ui32;
};
SIMD To Use Intrinsics Or Not To Use Intrinsics
- Published in: programming
- | Time: 00:23:11 CEST
- | SHA1: dbb09835fecb2534c2a63ce890ccc1a4e9debcfb
At a high level, intrinsics is something like a different syntax for vanilla SIMD asm. For example to add two 128 bit width vectors you can use _mm_add_ps() (thats leads to a SSE instructions).
The advantage of using intrinsic is that both, gcc and icc understand them. No need to #ifdef GCC and other compile-time hacks. On the other side, i386 intrinsics are limited. For every SIMD extension (mmx, sse, ...) you must implement a own specialized version via
#ifdef __SSE2__ [...] #else # error Programmed Error #endif
If you use it with gcc, gcc will instantly expand them to __builtin_ia32_addps – a gcc built-in (search the gcc header search path for xmmintrin.h – type gcc -print-search-dirs if you doesn’t know the current search paths).
The other opinion is to let the compiler optimize your code. But often the code is to complex that the compiler isn’t able to do this task!
At the end: if you MUST use SIMD extension you SHOULD use intrinsics. They are more portable, more readable compared to asm and the introduced loss of control, also compared to vanilla asm is evanescent humble.
Netsend Improvements
- Published in: networking
- | Time: 00:23:11 CEST
- | SHA1: 5b6beb46ec64adc8b99640dbc429150631a50793
due a productive weekend we implement three additional protocols for netsend
- DCCP ( RFC 4340 )
- SCTP ( RFC 2960 )
- UDP-LITE ( RFC 3828 )
So nowadays there are seven supported protocols (if you consider ip and transport layer protocols). TCP, IPv4, IPv6, TIPC and UDP.
Some minor bugs are fixed and code rearanged – the daily work …
