Re: [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator

Previous thread: Re: CONFIG_HOTPLUG_CPU on x86 by Robert Hancock on Wednesday, July 30, 2008 - 12:29 pm. (2 messages)

Next thread: 2.6.27-rc1: IP: iov_iter_advance+0x2e/0x90 by Alexey Dobriyan on Wednesday, July 30, 2008 - 12:54 pm. (7 messages)
From: Jeff Garzik
Date: Wednesday, July 30, 2008 - 12:35 pm

Comments:

* SCSI drivers should be submitted via the linux-scsi@vger.kernel.org 
mailing list.

* The driver is clean and readable, well done

* From a networking standpoint, our main concern becomes how this 
interacts with the networking stack.  In particular, I'm concerned based 
on reading the source that this driver uses "TCP port stealing" rather 
than using a totally separate MAC address (and IP).

Stealing a TCP port on an IP/interface already assigned is a common 
solution in this space, but also a flawed one.  Precisely because the 
kernel and applications are unaware of this "special, magic TCP port" 
you open the potential for application problems that are very difficult 
for an admin to diagnose based on observed behavior.

So, additional information on your TCP port usage would be greatly 
appreciated.  Also, how does this interact with IPv6?  Clearly it 
interacts with IPv4...

	Jeff




--

From: Roland Dreier
Date: Wednesday, July 30, 2008 - 2:35 pm

> * From a networking standpoint, our main concern becomes how this
 > interacts with the networking stack.  In particular, I'm concerned
 > based on reading the source that this driver uses "TCP port stealing"
 > rather than using a totally separate MAC address (and IP).
 > 
 > Stealing a TCP port on an IP/interface already assigned is a common
 > solution in this space, but also a flawed one.  Precisely because the
 > kernel and applications are unaware of this "special, magic TCP port"
 > you open the potential for application problems that are very
 > difficult for an admin to diagnose based on observed behavior.

That's true, but using a separate MAC and IP opens up a bunch of other
operational problems.  I don't think the right answer for iSCSI offload
is clear yet.

 - R.
--

From: Divy Le Ray
Date: Thursday, July 31, 2008 - 5:51 pm

Hi Jeff,

We've considered the approach of having a separate IP/MAC addresses to manage
iSCSI connections. In such a context, the stack would have to be unaware of
this iSCSI specific IP address. The iSCSI driver would then have to implement
at least its own ARP reply mechanism. DHCP too would have to be managed
separately. Most network setting/monitoring tools would also be unavailable.

The open-iscsi initiator is not a huge consumer of TCP connections, allocating
a TCP port from the stack would be reasonable in terms of resources in this
context. It is however unclear if it is an acceptable approach.

Our current implementation was designed to be the most tolerable one
within the constraints - real or expected - aforementioned.

Cheers,
Divy

--

From: Divy Le Ray
Date: Thursday, August 7, 2008 - 11:45 am

Hi Jeff,

Mike Christie will not merge this code until he has an explicit 
acknowledgement from netdev.

As you mentioned, the port stealing approach we've taken has its issues.
We consequently analyzed your suggestion to use a different IP/MAC address for 
iSCSI and it raises other tough issues (separate ARP and DHCP management, 
unavailability of common networking tools).
On these grounds, we believe our current approach is the most tolerable.
Would the stack provide a TCP port allocation service, we'd be glad to use it 
to solve the current concerns.
The cxgb3i driver is up and running here, its merge is pending our decision.

Cheers,
Divy
--

From: Mike Christie
Date: Thursday, August 7, 2008 - 1:07 pm

If the iscsi tools could not have to deal with networking issues that 
are already handled by other networking tools it would great for the 
iscsi users so they do not have to learn new tools. Maybe we could 
somehow hook into the existing network tools so they support these iscsi 
hbas as well as normal NICs. Would it be possible to have the iscsi hbas 
export the necessary network interfaces so that existing network tools 
can manage them?

If it comes down to it and your port stealing implementation is not 
acceptable like how broadcom's was not, I will be ok with doing some 
special iscsi network tools. Or instead of special iscsi tools, is there 
something that the RDMA/iWarp guys are using that we can share?
--

From: Steve Wise
Date: Friday, August 8, 2008 - 11:09 am

Hey Dave/Jeff,

I think we need some guidance here on how to proceed.   Is the approach 
currently being reviewed ACKable?  Or is it DOA? If its DOA, then what 
approach do you recommend?  I believe Jeff's opinion is a separate 
ipaddr.  But Dave, what do you think?  Lets get some agreement on a high 
level design here. 

Possible solutions seen to date include:

1) reserving a socket to allocate the port.  This has been NAK'd in the 
past and I assume is still a no go.

2) creating a 4-tuple allocation service so the host stack, the rdma 
stack, and the iscsi stack can share the same TCP 4-tuple space.  This 
also has been NAK'd in the past and I assume is still a no go.

3) the iscsi device allocates its own local ephemeral posts (port 
stealing) and use the host's ip address for the iscsi offload device.  
This is the current proposal and you can review the thread for the pros 
and cons.  IMO it is the least objectionable (and I think we really 
should be doing #2).

4) the iscsi device will manage its own ip address thus ensuring 4-tuple 
uniqueness.

Unless you all want to re-open considering #1 or #2, then we're left 
with 3 or 4.  Which one?

Steve.
--

From: Jeff Garzik
Date: Friday, August 8, 2008 - 3:15 pm

Conceptually, it is a nasty business for the OS kernel to be forced to 
co-manage an IP address in conjunction with a remote, independent entity.

Hardware designers make the mistake of assuming that firmware management 
of a TCP port ("port stealing") successfully provides the illusion to 
the OS that that port is simply inactive, and the OS happily continues 
internetworking its merry way through life.

This is certainly not true, because of current netfilter and userland 
application behavior, which often depends on being able to allocate 
(bind) to random TCP ports.  Allocating a TCP port successfully within 
the OS, that then behaves different from all other TCP ports (because it 
is the magic iSCSI port) creates a cascading functional disconnect.  On 
that magic iSCSI port, strange errors will be returned instead of proper 
behavior.  Which, in turn, cascades through new (and inevitably 
under-utilized) error handling paths in the app.

So, of course, one must work around problems like this, which leads to 
one of two broad choices:

1) implement co-management (sharing) of IP address/port space, between 
the OS kernel and a remote entity.

2) come up with a solution in hardware that does not require the OS to 
co-manage the data it has so far been managing exclusively in software.

It should be obvious that we prefer path #2.

For, trudging down path #1 means

* one must give the user the ability to manage shared IP addresses IN A 
NON-HARDWARE-SPECIFIC manner.  Currently most vendors of "TCP port 
stealing" solutions seem to expect each user to learn a vendor-specific 
method of identifying and managing the "magic port".

Excuse my language, but, what a fucking security and management 
nightmare in a cross-vendor environment.  It is already a pain, with 
some [unnamed system/chipset vendors] management stealing TCP ports -- 
and admins only discover this fact when applications behave strangely on 
new hardware.

But...  its tough to notice because stumbling ...
From: Jeff Garzik
Date: Friday, August 8, 2008 - 3:20 pm

grrr.   but the point is that the solution is not at all complete, with 
feature disconnects and security audit differences still outsanding, and 
non-hw-specific management apps still unwritten.

(I'm not calling for their existence, merely saying trying to strike the 
justification that current capability to limp along exists)

	Jeff


--

From: David Miller
Date: Saturday, August 9, 2008 - 12:28 am

From: Jeff Garzik <jgarzik@pobox.com>

I agree with everything Jeff has stated.

Also, I find it ironic that the port abduction is being asked for in
order to be "compatible with existing tools" yet in fact this stuff
breaks everything.  You can't netfilter this traffic, you can't apply
qdiscs to it, you can't execut TC actions on them, you can't do
segmentation offload on them, you can't look for the usual TCP MIB
statistics on the connection, etc. etc. etc.

It is broken from every possible angle.
--

From: Steve Wise
Date: Saturday, August 9, 2008 - 7:04 am

I think a lot of these _could_ be implemented and integrated with the 
standard tools.





--

From: Roland Dreier
Date: Saturday, August 9, 2008 - 10:14 pm

> Also, I find it ironic that the port abduction is being asked for in
 > order to be "compatible with existing tools" yet in fact this stuff
 > breaks everything.  You can't netfilter this traffic, you can't apply
 > qdiscs to it, you can't execut TC actions on them, you can't do
 > segmentation offload on them, you can't look for the usual TCP MIB
 > statistics on the connection, etc. etc. etc.

We already support offloads that break other features, eg large receive
offload breaks forwarding.  We deal with it.

I'm sure if we thought about it we could come up with clean ways to fix
some of the issues you raise, and just disable the offload if someone
wanted to use a feature we can't support.

 - R.
--

From: David Miller
Date: Saturday, August 9, 2008 - 10:47 pm

From: Roland Dreier <rdreier@cisco.com>

We turn it off.  If I want to shape or filter one of these iSCSI
connections can we turn it off?

It's funny you mention LRO because it probably gives most of whatever
gain these special iSCSI TCP connection offload things get.
--

From: Herbert Xu
Date: Saturday, August 9, 2008 - 11:34 pm

Actually one of my TODO items is to restructure software LRO
so that we preserve the original packet headers while aggregating
the packets.  That would allow us to easily refragment them on
output for forwarding.

In other words LRO (at least the software variant) is not
fundamentally incompatible with forwarding.

I'd also like to encourage all hardware manufacturers considering
LRO support to provide a way for us to access the original headers
so that it doesn't have to be turned off for forwarding.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Steve Wise
Date: Sunday, August 10, 2008 - 10:57 am

Sure.

Seems to me we _could_ architect this all so that these devices would 
have to support a method for the management/admin tools to tweak, and if 
nothing else kill, offload connections if policy rules change and the 
existing connections aren't implementing the policy.  IE: if the offload 
connection doesn't support whatever security or other facilities that 
the admin requires, then the admin should have the ability to disable 
that device.  And of course, some devices will allow doing things like 
netfilter, qos, tweaking vlan tags, etc even on active connection,  if 
the OS infrastructure is there to hook it all up.

BTW:  I think all these offload devices provide MIBs and could be pulled 
in to the normal management tools.


Steve.
--

From: Roland Dreier
Date: Monday, August 11, 2008 - 9:09 am

> We turn it off.  If I want to shape or filter one of these iSCSI
 > connections can we turn it off?

That seems like a reasonable idea to me -- the standard thing to do when
a NIC offload conflicts with something else is to turn off the offload
and fall back to software.

 - R.
--

From: David Miller
Date: Monday, August 11, 2008 - 2:09 pm

From: Roland Dreier <rdreier@cisco.com>

But as Herbert says, we can make LRO such that turning it off
isn't necessary.

Can we shape the iSCSI offload traffic without turning it off?
--

From: Roland Dreier
Date: Monday, August 11, 2008 - 2:37 pm

> But as Herbert says, we can make LRO such that turning it off
 > isn't necessary.
 > 
 > Can we shape the iSCSI offload traffic without turning it off?

Sure... the same way we can ask the HW vendors to keep old headers
around when aggregating for LRO, we can ask HW vendors for hooks for
shaping iSCSI traffic.  And the Chelsio TCP speed record seems to show
that they already have pretty sophisticated queueing/shaping in their
current HW.

 - R.
--

From: David Miller
Date: Monday, August 11, 2008 - 2:51 pm

From: Roland Dreier <rdreier@cisco.com>

You don't get it, you can't add the entire netfilter and qdisc
stack into the silly firmware.

And we can't fix bugs there either.
--

From: Steve Wise
Date: Monday, August 11, 2008 - 4:20 pm

With Chelsio's product you can do this.  Maybe Divy can provide details?



--

From: Divy Le Ray
Date: Monday, August 11, 2008 - 4:45 pm

The T3 adapter is capable of performing rate control and pacing based on RTT 
on a per-connection basis.

Cheers,
Divy
--

From: David Miller
Date: Monday, August 11, 2008 - 5:22 pm

From: Steve Wise <swise@opengridcomputing.com>

When I say shape I mean apply any packet scheduler, any netfilter
module, and any other feature we support.
--

From: Roland Dreier
Date: Saturday, August 9, 2008 - 10:12 pm

> * however, giving the user the ability to co-manage IP addresses means
 > hacking up the kernel TCP code and userland tools for this new
 > concept, something that I think DaveM would rightly be a bit reluctant
 > to do? You are essentially adding a bunch of special case code
 > whenever TCP ports are used:
 > 
 > 	if (port in list of "magic" TCP ports with special,
 > 	    hardware-specific behavior)
 > 		...
 > 	else
 > 		do what we've been doing for decades

I think you're arguing against something that no one is actually
pushing.  What I'm sure Chelsio and probably other iSCSI offload vendors
would like is a way to make iSCSI (and other) offloads not steal magic
ports but actually hook into the normal infrastructure so that the
offloaded connections show up in netstat, etc.  Having this solution
would be nice not just for TCP offload but also for things like in-band
system management, which currently lead to the same hard-to-diagnose
issues when someone hits the stolen port.  And it also would seem to
help "classifier NICs" (Sun Neptune, Solarflare, etc) where some traffic
might be steered to a userspace TCP stack.

I don't think the proposal of just using a separate MAC and IP for the
iSCSI HBA really works, for two reasons:

 - It doesn't work in theory, because the suggestion (I guess) is that
   the iSCSI HBA has its own MAC and IP and behaves like a separate
   system.  But this means that to start with the HBA needs its own ARP,
   ICMP, routing, etc interface, which means we need some (probably new)
   interface to configure all of this.  And then it doesn't work in lots
   of networks; for example the ethernet jack in my office doesn't work
   without 802.1x authentication, and putting all of that in an iSCSI
   HBA's firmware clearly is crazy (not to mention creating the
   interface to pass 802.1x credentials into the kernel to pass to the
   HBA).

 - It doesn't work in practice because most of the existing NICs that
   are capable of iSCSI offload, eg ...
From: David Miller
Date: Saturday, August 9, 2008 - 10:46 pm

From: Roland Dreier <rdreier@cisco.com>

Why show these special connections if the user cannot interact with or
shape the stream at all like normal ones?

This whole "make it look normal" argument is entirely bogus because
none of the standard Linux networking facilities can be applied to
these things.

And I even wonder, these days, if you probably get %90 or more of the
gain these "optimized" iSCSI connections obtain from things like LRO.
And since LRO can be done entirely in software (although stateless
HW assistence helps), it is even a NIC agnostic performance improvement.
--

From: Roland Dreier
Date: Monday, August 11, 2008 - 9:07 am

> Why show these special connections if the user cannot interact with or
 > shape the stream at all like normal ones?

So that an admin can see what connections are open, so that the stack
doesn't try to reuse the same 4-tuple for another connection, etc, etc.

 > And I even wonder, these days, if you probably get %90 or more of the
 > gain these "optimized" iSCSI connections obtain from things like LRO.

Yes, that's the question -- are stateless offloads (plus CRC32C in the
CPU etc) going to give good enough performance that the whole TCP
offload exercise is pointless?  The only issue is that I don't see how
to avoid the fundamental 3X increase in memory bandwidth that is chewed
up if the NIC can't do direct placement.

 - R.
--

From: David Miller
Date: Monday, August 11, 2008 - 2:08 pm

From: Roland Dreier <rdreier@cisco.com>

This is by definition true, over time.  And this has stedfastly proven
itself, over and over again.

That's why we call stateful offloads a point in time solution.
They are constantly being obsoleted by time.
--

From: Roland Dreier
Date: Monday, August 11, 2008 - 2:39 pm

> > Yes, that's the question -- are stateless offloads (plus CRC32C in the
 > > CPU etc) going to give good enough performance that the whole TCP
 > > offload exercise is pointless?
 > 
 > This is by definition true, over time.  And this has stedfastly proven
 > itself, over and over again.

By the definition of what?

 - R.
--

From: David Miller
Date: Monday, August 11, 2008 - 2:52 pm

From: Roland Dreier <rdreier@cisco.com>

By definition of time always advancing forward, and cpus always
getting faster, and memory (albeit more slowly) increasing in
speed too,

--

From: Rick Jones
Date: Monday, August 11, 2008 - 11:13 am

Probably depends on whether or not the iSCSI offload solutions are doing 
zero-copy receive into the filecache?

rick jones
--

From: David Miller
Date: Monday, August 11, 2008 - 2:12 pm

From: Rick Jones <rick.jones2@hp.com>

That's a data placement issue, which also can be solved with
stateless offloading.
--

From: Roland Dreier
Date: Monday, August 11, 2008 - 2:41 pm

> > Probably depends on whether or not the iSCSI offload solutions are doing 
 > > zero-copy receive into the filecache?
 > 
 > That's a data placement issue, which also can be solved with
 > stateless offloading.

How can you place iSCSI data properly with only stateless offloads?

 - R.
--

From: David Miller
Date: Monday, August 11, 2008 - 2:53 pm

From: Roland Dreier <rdreier@cisco.com>

By teaching the stateless offload how to parse the iSCSI headers
on the flow and place the data into pages at the correct offsets
such that you can place the pages hanging off of the SKB directly
into the page cache.
--

From: Divy Le Ray
Date: Tuesday, August 12, 2008 - 2:57 pm

Hi Dave,

iSCSI PDUs might spawn over multiple TCP segments, it is unclear to me how to 
do placement without keeping some state of the transactions.

In any case, such a stateless solution is not yet designed, whereas 
accelerated iSCSI is available now, from us and other companies.
The accelerated iSCSI streams benefit from the performance TOE provides, 
outlined in the following third party papers:
http://www.chelsio.com/assetlibrary/pdf/redhat-chelsio-toe-final_v2.pdf
http://www.chelsio.com/assetlibrary/pdf/RMDS6BNTChelsioRHEL5.pdf

iSCSI is primarily targeted to the data center, where the SW stack's traffic 
shaping features might be redundant with specialized equipment. It should 
however be possible to integrate security features on a per offoaded 
connection basis, and TOEs - at least ours :) - are capable of rate control 
and traffic shaping.

While CPU and - to a far lesser extent - memory performance improves, so does 
ethernet's. 40G, 100G are not too far ahead. It is not obvious at all that 
TOE is a point of time solution, especially for heavy load traffic as in a 
storage environment. It is quite the opposite actually.

There is room for co-existence of the SW managed traffic and accelerated 
traffic. As our submission shows, enabling accelerated iSCSI is not intrusive 
code wise to the stack. The port stealing issue is solved if we can grab a 
port from the stack.

Cheers,
Divy
--

From: David Miller
Date: Tuesday, August 12, 2008 - 3:01 pm

From: Divy Le Ray <divy@chelsio.com>

You keep a flow table with buffer IDs and offsets.

The S2IO guys did something similar for one of their initial LRO
impelementations.

It's still strictly stateless, and best-effort.  Entries can fall out
of the flow cache which makes upcoming data use new buffers and
offsets.

But these are the kinds of tricks you hardware folks should be
more than adequately able to design, rather than me. :-)
--

From: David Miller
Date: Tuesday, August 12, 2008 - 3:02 pm

From: Divy Le Ray <divy@chelsio.com>

So, WHAT?!

There are TOE pieces of crap out there too.

It's strictly not our problem.

Like Herbert said, this is the TOE discussion all over again.
The results will be the same, and as per our decisions wrt.
TOE, history speaks for itself.
--

From: Divy Le Ray
Date: Tuesday, August 12, 2008 - 3:21 pm

Well, there is demand for accerated iscsi out there, which is the driving 

Herbert requested some benchmark numbers, I consequently obliged.

Cheers,
Divy



--

From: Herbert Xu
Date: Tuesday, August 12, 2008 - 6:57 pm

Have you posted a hardware-accelerated iSCSI vs. LRO comparison?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Vladislav Bolkhovitin
Date: Wednesday, August 13, 2008 - 11:35 am

I'm, as an iSCSI target developer, strongly voting for hardware iSCSI 
offload. Having possibility of the direct data placement is a *HUGE* 
performance gain.

For example, according to measurements done by one iSCSI-SCST user in 
system with iSCSI initiator and iSCSI target (with iSCSI-SCST 
(http://scst.sourceforge.net/target_iscsi.html) running), both with 
identical modern high speed hardware and 10GbE cards, the _INITIATOR_ is 
the bottleneck for READs (data transfers from target to initiator). This 
is because the target sends data in a zero-copy manner, so its CPU is 
capable to deal with the load, but on the initiator there are additional 
data copies from skb's to page cache and from page cache to application. 
As the result, in the measurements initiator got near 100% CPU load and 
only ~500MB/s throughput. Target had ~30% CPU load. For the opposite 
direction (WRITEs), where there is no the application data copy on the 
target, throughput was ~800MB/s with also near 100% CPU load, but in 
this case on the target. The initiator ran Linux with open-iscsi. The 
test was with real backstorage: target ran BLOCKIO (direct BIOs to/from 
backstorage) with 3ware card. Locally on the target the backstorage was 
able to provide 900+MB/s for READs and about 1GB/s for WRITEs. The 
commands queue in both cases was sufficiently big to eliminate the link 
and processing latencies (20-30 outstanding commands).

Vlad

--

From: Jeff Garzik
Date: Wednesday, August 13, 2008 - 12:29 pm

Well, two responses here:

* no one is arguing against hardware iSCSI offload.  Rather, it is a 
problem with a specific implementation, one that falsely assumes two 
independent TCP stacks can co-exist peacefully on the same IP address 
and MAC.

* direct data placement is possible without offloading the entire TCP 
stack onto a firmware/chip.

There is plenty of room for hardware iSCSI offload...

	Jeff


--

From: David Miller
Date: Wednesday, August 13, 2008 - 1:13 pm

From: Jeff Garzik <jgarzik@pobox.com>

I've even described in this thread how that's possible.
--

From: Vladislav Bolkhovitin
Date: Thursday, August 14, 2008 - 11:24 am

Sure, nobody is arguing against that. My points are:

1. All those are things not for near future. I don't think it can be 
implemented earlier than in a year time, but there is a huge demand for 
high speed and low CPU overhead iSCSI _now_. Nobody's satisfied by the 
fact that with the latest high end hardware he can saturate 10GbE link 
on only less than 50%(!). Additionally, for me, as an iSCSI target 
developer, it looks especially annoying that hardware requirements for 
_clients_ (initiators) are significantly higher than for _server_ 
(target). This situation for me looks as a nonsense.

2. I believe, that iSCSI/TCP pair is sufficiently heavy weighted 
protocol to be completely offloaded to hardware. All partial offloads 
will never make it comparably efficient. It still would consume a lot of 
CPU. For example, consider digests. Even if they computed by new CRC32C 
instruction, the computation still would need a chunk of CPU power. I 
think, at least as much as to copy the computed block to new location. 
Can we save it? Sure, with hardware offload. The additional CPU load can 
be acceptable if only data are transferred and there are no other 
activities, but in real life this is quite rare. Consider, for instance, 
a VPS server, like VMware. It always lacks CPU power and 30% CPU load 
during data transfers makes a huge difference. Another example is a 
target doing some processing of transferred data, like encryption or 
de-duplication.

Note, I'm not advocating this particular cxgb3 driver. I have not 
examined it closely enough and don't have sufficient knowledge about the 
hardware to judge it. But I'm advocating the concept of full offload 
HBAs, because they provide a real gain, which IMHO can't be reached by 
any partial offloads.

Actually, in the Fibre Channel world from the very beginning the entire 
FC protocol has been implemented on hardware and everybody have been 
happy with that. Now FCoE is coming, which means that Linux kernel is 
going to have ...
From: Nicholas A. Bellinger
Date: Thursday, August 14, 2008 - 2:59 pm

Well, the first step wrt to this for us software folks is getting the
Slicing by 8 algoritim CRC32C into the kernel..  This would be a great
benefit for not just traditional iSCSI/TCP, but Linux/SCTP and

I have always found this to be the historical case wrt iSCSI on x86
hardware.  The rough estimate was that given identical hardware and
network configuration, an iSCSI target talking to a SCSI subsystem layer
would be able to handle 2x throughput compared to an iSCSI Initiator,

Heh, I think the period of designing news ASICs for traditional iSCSI
offload is probably slowing.  Aside from the actual difficulting of
doing this and competing with software iSCSI on commodity x86 4x & 8x
core (8x and 16x thread) micropressors with highly efficent software
implementation, that can do BOTH traditional iSCSI offload (where
available) and real deal OS independent connection recovery
(ErrorRecoveryLevel=2) between multiple stateless iSER iWARP/TCP

With traditional iSCSI, I definately agree on this.

With iWARP and iSER however, I believe the end balance of simplicity is
greater for both hardware and software, and allows both hardware and
software to scale more effectively because  The simple gain of having a
Framed PDU on top of legacy TCP with RFC 504[0-4] in order to determine
the offload of the received packet that will be mapped to storage
subsystem later memory for eventual hardware DMA on a vast array of

So yes, we are talking about quite a few possible cases:

I) Traditional iSCSI:

1) Complete hardware offload for legacy HBAs

2) Hybrid of hardware/software 

As mentioned, reducing application layer checksum overhead for current
software implementations is very important for our quickly increase user
base.  Using the Slicing by 8 CRC32C will help the current code, but I
think the only other real optimization by network ASIC design folks
would be to do something along the lines with traditional iSCSI with the
application layer that the say the e1000 driver does with ...
From: David Miller
Date: Wednesday, August 13, 2008 - 1:23 pm

From: Vladislav Bolkhovitin <vst@vlnb.net>

If you've actually been reading at all what I've been saying in this
thread you'll see that I've described a method to do this copy
avoidance in a completely stateless manner.

You don't need to implement a TCP stack in the card in order to do
data placement optimizations.  They can be done completely stateless.

Also, large portions of the cpu overhead are transactional costs,
which are significantly reduced by existing technologies such as
LRO.
--

From: Vladislav Bolkhovitin
Date: Thursday, August 14, 2008 - 11:27 am

Sure, I read what you wrote before writing (although, frankly, didn't 
get the idea). But I don't think that overall it would be as efficient 

The test used Myricom Myri-10G cards (myri10ge driver), which support 
LRO. And from ethtool -S output I conclude it was enabled. Just in case, 
I attached it, so you can recheck me.

Thus, apparently, LRO doesn't make a fundamental difference. Maybe this 
particular implementation isn't too efficient, I don't know. I don't 
have enough information for that.

Vlad

From: Vladislav Bolkhovitin
Date: Thursday, August 14, 2008 - 11:30 am

Also, there wasn't big difference between MTU 1500 and 9000, which is 

--

From: Roland Dreier
Date: Wednesday, August 13, 2008 - 2:27 pm

> > How can you place iSCSI data properly with only stateless offloads?

 > By teaching the stateless offload how to parse the iSCSI headers
 > on the flow and place the data into pages at the correct offsets
 > such that you can place the pages hanging off of the SKB directly
 > into the page cache.

I don't see how this could work.  First, it seems that you have to let
the adapter know which connections are iSCSI connections so that it
knows when to try and parse iSCSI headers.  So you're already not
totally stateless.  Then, since (AFAIK -- I'm not an expert on iSCSI and
especially I'm not an expert on what common practice is for current
implementations) the iSCSI PDUs can start at any offset in the TCP
stream, I don't see how a stateless adapter can even find the PDU
headers to parse -- there's not any way that I know of to recognize
where a PDU boundary is without keeping track of the lengths of all the
PDUs that go by (ie you need per-connection state).

Even if the adapter could find the PDUs, I don't see how it could come
up with the correct offset to place the data -- PDUs with response data
just carry an opaque tag assigned by the iSCSI initiator.  Finally, if
there are ways around all of those difficulties, we would still have to
do major surgery to our block layer to cope with read requests that
complete into random pages, rather than using a scatter list passed into
the low-level driver.


But I think all this argument is missing the point anyway.  The real
issue is not hand-waving about what someone might build someday, but how
we want to support iSCSI offload with the existing Chelsio, Broadcom,
etc adapters.  The answer might be, "we don't," but I disagree with that
choice because:

 a. "No upstream support" really ends up being "enterprise distros and
    customers end up using hacky out-of-tree drivers and blaming us."

 b. It sends a bad message to vendors who put a lot of effort into
    writing a clean, mergable driver and responding to review if ...
From: David Miller
Date: Wednesday, August 13, 2008 - 3:08 pm

From: Roland Dreier <rdreier@cisco.com>



Like I said, you retain a "flow cache" (say it a million times, "flow
cache") that remembers the current parameters and the buffers
currently assigned to that flow and what offset within those buffers.
--

From: Roland Dreier
Date: Wednesday, August 13, 2008 - 4:03 pm

> Like I said, you retain a "flow cache" (say it a million times, "flow
 > cache") that remembers the current parameters and the buffers
 > currently assigned to that flow and what offset within those buffers.

OK, I admit you could make something work -- add hooks for the low-level
driver to ask the iSCSI initiator where PDU boundaries are so it can
resync when something is evicted from the flow cache, have the initiator
format its tags in a special way to encode placement data, etc, etc.
The scheme does bring to mind Alan's earlier comment about pigs and
propulsion, though.

In any case, as I said in the part of my email that you snipped, the
real issue is not designing hypothetical hardware, but deciding how to
support the Chelsio, Broadcom, etc hardware that exists today.

 - R.
--

From: David Miller
Date: Wednesday, August 13, 2008 - 4:12 pm

From: Roland Dreier <rdreier@cisco.com>

There would need to be _NO_ hooks into the iSCSI initiator at all.

The card would land the block I/O data onto the necessary page boundaries
and the iSCSI code would just be able to thus use the pages directly
and as-is.

It would look perfectly like normal TCP receive traffic.  No hooks,

The same like we support TOE hardware that exists today.  That is, we
don't.


--

From: Tom Tucker
Date: Wednesday, August 13, 2008 - 6:26 pm

Is there any chance your could discuss exactly how a stateless adapter 
can determine if a network segment
is in-order, next expected, minus productive ack, paws compliant, etc... 
without TCP state?

I get how you can optimize "flows", but "flows" are a fancy name for a 
key (typically the four-tuple) that looks into a TCAM  to get the 
"information" necessary to do header prediction.

Can you explain how this "information" somehow doesn't qualify as 
"state". Doesn't the next expected sequence number at the very least 
need to be updated? una? etc...?

Could you also include the "non-state-full" information necessary to do 
iSCSI header digest validation, data placement, and marker removal? 

Thanks,

--

From: David Miller
Date: Wednesday, August 13, 2008 - 6:37 pm

From: Tom Tucker <tom@opengridcomputing.com>

It's stateless because the full packet traverses the real networking
stack and thus can be treated like any other packet.

The data placement is a side effect that the networking stack can
completely ignore if it chooses to.
--

From: Steve Wise
Date: Wednesday, August 13, 2008 - 6:52 pm

How do you envision programming such a device?  It will need TCP and 
iSCSI state to have any chance of doing useful and productive placement 
of data.  The smarts about the iSCSI stateless offload hw will be in the 
device driver, probably the iscsi device driver.  How will it gather the 
information from the TCP stack to insert the correct state for a flow 
into the hw cache?


--

From: David Miller
Date: Wednesday, August 13, 2008 - 7:05 pm

From: Steve Wise <swise@opengridcomputing.com>


The card can see the entire TCP stream, it doesn't need anything
more than that.  It can parse every packet header, see what kind
of data transfer is being requested or responded to, etc.

Look, I'm not going to design this whole friggin' thing for you guys.

I've stated clearly what the base requirement is, which is that the
packet is fully processed by the networking stack and that the card
merely does data placement optimizations that the stack can completely
ignore if it wants to.

You have an entire engine in there that can interpret an iSCSI
transport stream, you have the logic to do these kinds of things,
and it can be done without managing the connection on the card.
--

From: Steve Wise
Date: Wednesday, August 13, 2008 - 7:44 pm

Thanks for finally stating it clearly.

--

From: Tom Tucker
Date: Wednesday, August 13, 2008 - 6:57 pm

Ok. Maybe we're getting somewhere here ... or at least I am :-)

I'm not trying to be pedantic here but let me try and restate what I 
think you said above:

- The "header" traverses the real networking stack
- The "payload" is placed either by by the hardware if possible or by 
the native stack if on the exception path
- The "header" may aggregate multiple  PDU (RSO)
- Data ready indications are controlled entirely by the software/real 
networking stack

Thanks,
Tom

--

From: David Miller
Date: Wednesday, August 13, 2008 - 7:07 pm

From: Tom Tucker <tom@opengridcomputing.com>

SKB's can be paged, in fact many devices already work by chopping
up lists of pages that the driver gives to the card.  NIU is one
of several examples.

The only difference between what a device like NIU is doing now and
what I propose is smart determination of at what offset and into
which buffers to do the demarcation.
--

From: David Miller
Date: Wednesday, August 13, 2008 - 7:09 pm

From: Tom Tucker <tom@opengridcomputing.com>

If you're getting packets out of order, data placement optimizations
are the least of your concerns.

In fact this is exactly where we want all of the advanced loss
handling algorithms of the Linux TCP stack to get engaged.
--

From: Herbert Xu
Date: Saturday, August 9, 2008 - 11:24 pm

We've been here many times before.  This is just the smae old TOE
debate all over again.  The fact with TOE is that history has shown
that Dave's decision has been spot on.

So you're going to have to come up with some really convincing
evidence that shows we are all wrong and these TOE-like hardware
offload solutions is the only way to go.  You can start by collecting
solid benchmark numbers that we can all reproduce and look into.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Alan Cox
Date: Sunday, August 10, 2008 - 2:19 am

Its another system so surely SNMP ;)

More seriously I do think iSCSI is actually a subtly special case of TOE.
Most TOE disintegrates under carefully chosen "malicious" workloads
because of the way it is optimised, and the lack of security integration
ranges can be very very dangeorus. A pure iSCSI connection is generally
private, single purpose and really is the classic application of "pigs fly
given enough thrust" - which is the only way to make the pig in question
(iSCSI) work properly.





--

From: Jeff Garzik
Date: Sunday, August 10, 2008 - 5:49 am

Indeed.

Just like with TOE, from the net stack's point of view, an iSCSI HBA is 
essentially a wholly asynchronous remote system [with a really fast 
communication bus like PCI Express].

As such, the task becomes updating the net stack such that 
formerly-private resources are now shared with an independent, external 
system...  with all the complexity, additional failure modes, and 
additional security complications that come along with that.

	Jeff


--

From: James Bottomley
Date: Sunday, August 10, 2008 - 7:54 am

What's wrong with making it configurable identically to current software
iSCSI?  i.e. plumb the thing into the current iscsi transport class so
that we use the standard daemon for creating and binding sessions?
Then, only once the session is bound do you let your iSCSI TOE stack
take over.

That way the connection appears to the network as completely normal,
because it has an open socket associated with it; and, since the
transport class has done the connection login, it even looks like a
normal iSCSI connection to the usual tools.  iSCSI would manage
connection and authentication, so your TOE stack can be simply around
the block acceleration piece (i.e. you'd need to get the iscsi daemon to
do relogin and things).

I would assume net will require some indicator that the opened
connection has been subsumed, so it knows not to try to manage it, but
other than that I don't see it will need any alteration.  The usual
tools, like netfilter could even use this information to know the limits
of their management.

If this model works, we can use it for TOE acceleration of individual
applications (rather than the entire TCP stack) on an as needed basis.

This is like the port stealing proposal, but since the iSCSI daemon is
responsible for maintaining the session, the port isn't completely
stolen, just switched to accelerator mode when doing the iSCSI offload.

James


--

From: Mike Christie
Date: Monday, August 11, 2008 - 9:50 am

This is what Chelsio and broadcom do today more or less. Chelsio did the 
socket trick you are proposing. Broadcom went with a different hack. But 
in the end both hook into the iscsi transport class (the current iscsi 
transport class works for this today), userspace daemon and tools, so 
that the iscsi daemon handles iscsi login, iscsi authentication and all 
--

Previous thread: Re: CONFIG_HOTPLUG_CPU on x86 by Robert Hancock on Wednesday, July 30, 2008 - 12:29 pm. (2 messages)

Next thread: 2.6.27-rc1: IP: iov_iter_advance+0x2e/0x90 by Alexey Dobriyan on Wednesday, July 30, 2008 - 12:54 pm. (7 messages)