>From Nehalem processor onward, Intel processors can support hardware accelerated CRC32c algorithm with the new CRC32 instruction in SSE 4.2 instruction set. The patch detects the availability of the feature, and chooses the most proper way to calculate CRC32c checksum. Byte code instructions are used for compiler compatibility. No MMX / XMM registers is involved in the implementation. Signed-off-by: Austin Zhang <austin.zhang@intel.com> Signed-off-by: Kent Liu <kent.liu@intel.com> --- arch/x86/crypto/Makefile | 2 arch/x86/crypto/crc32c-intel.c | 192 +++++++++++++++++++++++++++++++++++++++++ crypto/Kconfig | 11 ++ include/asm-x86/cpufeature.h | 2 4 files changed, 207 insertions(+) diff -Naurp linux-2.6/arch/x86/crypto/crc32c-intel.c linux-2.6-patch/arch/x86/crypto/crc32c-intel.c --- linux-2.6/arch/x86/crypto/crc32c-intel.c 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6-patch/arch/x86/crypto/crc32c-intel.c 2008-08-04 01:59:00.000000000 -0400 @@ -0,0 +1,192 @@ +/* + * Using hardware provided CRC32 instruction to accelerate the CRC32 disposal. + * CRC32C polynomial:0x1EDC6F41(BE)/0x82F63B78(LE) + * CRC32 is a new instruction in Intel SSE4.2, the reference can be found at: + * http://www.intel.com/products/processor/manuals/ + * Intel(R) 64 and IA-32 Architectures Software Developer's Manual + * Volume 2A: Instruction Set Reference, A-M + */ +#include <linux/init.h> +#include <linux/module.h> +#include <linux/string.h> +#include <linux/kernel.h> +#include <crypto/internal/hash.h> + +#include <asm/cpufeature.h> + +#define CHKSUM_BLOCK_SIZE 1 +#define CHKSUM_DIGEST_SIZE 4 + +#ifdef CONFIG_X86_64 +#define REX_PRE "0x48, " +#define SCALE_F 8 +#else +#define REX_PRE +#define SCALE_F 4 +#endif + +u32 crc32c_intel_le_hw_byte(u32 crc, unsigned char const *data, size_t length) +{ + while (length--) { + __asm__ __volatile__( + ".byte 0xf2, 0xf, 0x38, 0xf0, 0xf1" + :"=S"(crc) + :"0"(crc), ...
You could perhaps just use 'unsigned long' here, to avoid the ifdef. And it would be nice if we could make libcrc32c use this too, rather than just the 'crypto' users. -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation --
I'm not sure if I remeber correctly but I thing Herbert was planning to convert all users over to the crypto API to avoid compile time dependency. Sebastian --
That's one way of doing it, although it seems a little bit like overkill in this particular case. -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation --
Yes that's the plan. I've been busy with the crypto testing stuff but I'll get onto this soon. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
interface by new crypto because there were few user using the current libcrc32c interface. --
Are we deprecating libcrc32c, then? Or just turning it into a wrapper around the crypto code? Either way, does it really make sense to force all crc32 users to pull in the whole crypto framework? Some may get fractious about that... -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation --
Maybe I can pick up crc32c_intel_le_hw_byte and crc32c_intel_le_hw into one arch-related file and make the current new crypto interface and libcrc32c If there were really few (or no) user using that previous interface, it will be reasonable to merge the crc32c totally into crypto subsystem as a digest method. And I remembered Herbert had mentioned he will convert those previous interface calling to new crypto API. For the crc32c, he had done for it. BTW, why did I always got each email on this thread twice:(? (the same email twice) --
You're probably subscribed to the linux-kernel list, and you're also being Cc'd directly. Normally, your filters should notice the copy which has a Return-Path: matching 'linux-kernel-owner.*@vger.kernel.org', and put that into your lkml folder with a few hundred other messages each day -- while the copy which is direct will still have the original sender, and would go into your inbox where you'll see it. -- dwmw2 --
There only three crc32c users in the kernel tree and the crypto interface will serve the perfectly. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
On Mon, 4 Aug 2008 22:04:35 +0800 isn't it a bit heavy for something as simple as a crc? (which after all is one instruction now ;0) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
It does seem that way. For users who care about 'bloat' and are _only_ interested in crc32, it's yet another chunk of extra infrastructure which offers no benefit to them. And even for people who don't care about that, it doesn't look particularly good. It looks like btrfs would need either to keep setting up a crypto context and then tearing it down, or have a pool of long-standing contexts and do some kind of locking on them -- neither of which seem particularly optimal compared with just calling into libcrc32c. We can't even set up one context per cpu and disable preempt while we use it, can we? The routines are allowed to sleep? (Although I have to admit I do like the fact that it'd only be available through EXPORT_SYMBOL_GPL if we do force people to use the crypto API...) -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation --
No you don't have to set things up every time you use crc32c. The crypto interface lets you have a single tfm that can be used by multiple users simultaneously. For ahash algorithms all the state is stored in the request which can stay on the stack. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Well AES on the PadLock is also a single instruction and nobody has ever complained :) Seriously, the crypto code is extremely small on the data path. The heaviest part is the indirect function call but you have to have that in order to support multiple implementations cleanly. All the fat is on the control path, i.e., tfm allocation. For crc32c you only need a single tfm since all the state is stored in the request object. Note that you should ignore the existing crc32c user, iSCSI as it was written before the new crypto hash interface was available. I will be converting it along with the other two crc32c users. to the new ahash interface. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Long term I'd like to switch btrfs to the crypto api, but right now I'm code wrong, but it looks like my choices are to either have a long standing context and use locking around the digest/hash calls to protect internal crypto state, or create a new context every time and take a perf hit while crypto looks up the right module. Either way it looks slower than just calling good old libcrc32c. -chris --
You're looking at the old hash interface. New users should use the ahash interface which was only recently added to the kernel. It lets you store the state in the request object which you pass to the algorithm on every call. This means that you only need one tfm in the entire system for crc32c. BTW, don't let the a in ahash intimidate you. It's meant to support synchronous implementations such as the Intel instruction just as well as asynchronous ones. And if you're still not convinced here is the benchmark on the digest_null algorithm: testing speed of stub_digest_null test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 190 cycles/operation, 11 cycles/byte test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 367 cycles/operation, 5 cycles/byte test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 192 cycles/operation, 3 cycles/byte test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 1006 cycles/operation, 3 cycles/byte test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 378 cycles/operation, 1 cycles/byte test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 191 cycles/operation, 0 cycles/byte test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 3557 cycles/operation, 3 cycles/byte test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 365 cycles/operation, 0 cycles/byte test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 191 cycles/operation, 0 cycles/byte test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 6903 cycles/operation, 3 cycles/byte test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 574 cycles/operation, 0 cycles/byte test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 259 cycles/operation, 0 cycles/byte test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 192 cycles/operation, 0 cycles/byte test 13 ( 4096 byte blocks, 16 bytes ...
Great to hear, that solves my main concern then. There is still the embedded argument against needing all of crypto api just for libcrc32c. It does make sense to me to have a libcrc32c that does the HW detection and uses HW assist when present, and just have the cypto api call that. -chris --
The existing users are iSCSI, SCTP, Infiniband, all of which are Well then you're going to have to do the check on every call. Seriously, I'm happy to trim off any fat from the crypto API for the embedded space. For a start, if you only needed hashing then we could do without the legacy cipher/compress support. That shaves off 800 bytes on i386. There is also still some legacy code in api.c itself. Getting rid of them should get us to around 2K. On the other hand, one of the advantages of doing it through the crypto API is that this kind of selection is useful for quite a few operations, e.g., xor or even memcpy. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
On Tue, 05 Aug 2008 00:45:34 +0800 well you still have that indirect function call for libcrc32 we could alternatives() that... -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
I don't see why you couldn't do that for the crypto API too if you wanted to. That way it would benefit all crypto users rather than just the crc32c (note the extra c) users. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Anyway, the point here is the crc32c is nothing special. It's just one out of many algorithms that has/will have hardware acceleration support. Rather than doing ad-hoc implementations and optimising that whenever such a thing pops up, let's spend our effort in creating a common platform that can be reused. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
How about making crc32c an inline function then? On processors that have this feature, this compiles to that single instruction, plus whatever setup it needs. Nice and efficient. On other processors, either inline the algorithm or inline a call to an out of line function, depending on how bulky this is. Similiar for any other functions that may or may not have hw support. Helge Hafting --
Please read the thread carefully. Being a single instruction is nothing special. The same thing applies for other algorithms too, e.g., AES is also just a single instruction with the VIA PadLock (and Intel in future). The crypto API has handled this just fine. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Since I couldn't find any ahash user in the tree (outside of tcrypt.c), can you provide some example source code as to how to use it (especially synchonously). For example the code for the digest_null testing would be fine. regards, Benoit --
Sure, here is the async hash speed test. I haven't pushed it
yet because I'm thinking of picking up on David Howells' idea
of creating a sync hash interface that doesn't use scatterlists.
Note that you'll need the appended patch for this to compile as
the partial ahash functions were missing prototypes.
static int test_hash_cycles(struct ahash_request *req, struct scatterlist *sg,
int blen, int plen, char *out)
{
unsigned long cycles = 0;
int i, pcount;
int ret;
if (plen == blen)
return test_hash_cycles_digest(req, sg, blen, out);
ahash_request_set_crypt(req, sg, out, plen);
local_bh_disable();
local_irq_disable();
/* Warm-up run. */
for (i = 0; i < 4; i++) {
ret = crypto_ahash_init(req);
if (ret)
goto out;
for (pcount = 0; pcount < blen; pcount += plen) {
ret = crypto_ahash_update(req);
if (ret)
goto out;
}
ret = crypto_ahash_final(req);
if (ret)
goto out;
}
/* The real thing. */
for (i = 0; i < 8; i++) {
cycles_t start, end;
start = get_cycles();
ret = crypto_ahash_init(req);
if (ret)
goto out;
for (pcount = 0; pcount < blen; pcount += plen) {
ret = crypto_ahash_update(req);
if (ret)
goto out;
}
ret = crypto_ahash_final(req);
if (ret)
goto out;
end = get_cycles();
cycles += end - start;
}
out:
local_irq_enable();
local_bh_enable();
if (ret)
return ret;
printk("%6lu cycles/operation, %4lu cycles/byte\n",
cycles / 8, cycles / (8 * blen));
return 0;
}
static void test_hash_speed(const char *algo, unsigned int sec,
struct hash_speed *speed)
{
struct scatterlist sg[TVMEMSIZE];
struct crypto_ahash *tfm;
char output[1024];
int i;
int ret;
printk("\ntesting speed of %s\n", algo);
tfm = crypto_alloc_ahash(algo, 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(tfm)) {
printk("failed to load transform for %s: %ld\n", algo,
PTR_ERR(tfm));
return;
}
{
struct {
struct ahash_request ...Am I missing something here, or are you registering the crypto algorithm _unconditionally_ and then just causing init requests for it to fail on older hardware? Wouldn't it be better to register the driver _only_ when the hardware is capable? Or at least "if at least one cpu is I think that should depend on CONFIG_X86? -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation --
Yes I think this is a show-stopper :) Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Thanks. --
This check needs to be moved to the module init function and if it fails the module should not register the algorithm. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
You need some sort of a dependency here. See what the other assembly algorithms do it. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
How about: +config CRYPTO_CRC32C_INTEL + tristate "CRC32c INTEL hardware acceleration" + depends on X86 + select CRYPTO_ALGAPI + help + In Intel processor with SSE4.2 supported, the processor will ...... It should only depend on X86. --
Yes that looks good. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
and don't end lines with spaces... --- ~Randy Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA http://linuxplumbersconf.org/ --
Thanks a lot:) --
