Hi All, Please review the next iteration of the patch set of the new (although, in fact, the oldest) SCSI target framework for Linux SCST with a set of dev handlers and 3 target drivers: for local access (scst_local), for Infiniband SRP (srpt) and IBM POWER Virtual SCSI . SCST is the most advanced and features rich SCSI target subsystem for Linux. It allows to build the fastest devices and clusters delivering millions of IOPS. Many companies are using it as a foundation for their storage products and solutions. List of some of them you can find on http://scst.sourceforge.net/users.html. There are also many other who not yet added there or prefer for some reasons to not be listed on this page. The previous iterations of the SCST patch set you can find on http://lkml.org/lkml/2008 and http://lkml.org/lkml/2010/4/13/146. We believe that code is fully mainline ready (considering that ibmvstgt driver for SCST not yet tested on hardware). This iteration for simplicity contains only 2 target drivers: for local access (scst_local) and SRP (ib_srpt). If SCST accepted, we will prepare and submit patches for other target drivers: for iSCSI, FCoE, QLogic Fibre Channel QLA 2xxx, Emulex Fibre Channel/FCoE and Marvell 88SE63xx/64xx/68xx/94xx SAS hardware. Also this patchset contains port of ibmvstgt driver for SCST, many thanks to Bart Van Assche for performing it! This patchset is for kernel 2.6.35. On request we will rebase it for any other kernel tree. Since SCST is quite self-contained and almost doesn't touch outside kernel code, there are no problems in it. As Dmitry Torokhov requested (thanks for the review!), in this iteration we added more detail descriptions of each patch. Since the space for description is too limited, we can't describe full SCST internals (it's worth a book). But we hope that those description will be a good starting point. More detail description of SCST, its drivers and utilities you can find on SCST home page http://scst.sourceforge.net. ...
This patch contains SCST core's Makefile and Kconfig. Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- Kconfig | 244 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Makefile | 11 ++ 2 files changed, 255 insertions(+) diff -uprN orig/linux-2.6.35/drivers/scst/Kconfig linux-2.6.35/drivers/scst/Kconfig --- orig/linux-2.6.35/drivers/scst/Kconfig +++ linux-2.6.35/drivers/scst/Kconfig @@ -0,0 +1,244 @@ +menu "SCSI target (SCST) support" + +config SCST + tristate "SCSI target (SCST) support" + depends on SCSI + help + SCSI target (SCST) is designed to provide unified, consistent + interface between SCSI target drivers and Linux kernel and + simplify target drivers development as much as possible. Visit + http://scst.sourceforge.net for more info about it. + +config SCST_DISK + tristate "SCSI target disk support" + default SCST + depends on SCSI && SCST + help + SCST pass-through device handler for disk device. + +config SCST_TAPE + tristate "SCSI target tape support" + default SCST + depends on SCSI && SCST + help + SCST pass-through device handler for tape device. + +config SCST_CDROM + tristate "SCSI target CDROM support" + default SCST + depends on SCSI && SCST + help + SCST pass-through device handler for CDROM device. + +config SCST_MODISK + tristate "SCSI target MO disk support" + default SCST + depends on SCSI && SCST + help + SCST pass-through device handler for MO disk device. + +config SCST_CHANGER + tristate "SCSI target changer support" + default SCST + depends on SCSI && SCST + help + SCST pass-through device handler for changer device. + +config SCST_PROCESSOR + tristate "SCSI target processor support" + default SCST + depends on SCSI && SCST + help + SCST pass-through device handler for processor device. + +config SCST_RAID + tristate "SCSI target storage array controller (RAID) support" + default SCST + depends on SCSI && SCST + help + SCST pass-through device handler for raid ...
This patch contains changes in drivers/Kconfig and drivers/Makefile to integrate SCST into the Linux kernel tree. Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- Kconfig | 2 ++ Makefile | 1 + 2 files changed, 3 insertions(+) diff -upkr -X linux-2.6.35/Documentation/dontdiff linux-2.6.35/drivers/Kconfig linux-2.6.35/drivers/Kconfig --- orig/linux-2.6.35/drivers/Kconfig 01:51:29.000000000 +0400 +++ linux-2.6.35/drivers/Kconfig 14:14:46.000000000 +0400 @@ -22,6 +22,8 @@ source "drivers/ide/Kconfig" source "drivers/scsi/Kconfig" +source "drivers/scst/Kconfig" + source "drivers/ata/Kconfig" source "drivers/md/Kconfig" diff -upkr -X linux-2.6.35/Documentation/dontdiff linux-2.6.35/drivers/Makefile linux-2.6.35/drivers/Makefile --- orig/linux-2.6.35/drivers/Makefile 01:51:29.000000000 +0400 +++ linux-2.6.35/drivers/Makefile 14:15:29.000000000 +0400 @@ -43,6 +43,7 @@ obj-$(CONFIG_ATM) += atm/ obj-y += macintosh/ obj-$(CONFIG_IDE) += ide/ obj-$(CONFIG_SCSI) += scsi/ +obj-$(CONFIG_SCST) += scst/ obj-$(CONFIG_ATA) += ata/ obj-y += net/ obj-$(CONFIG_ATM) += atm/ --
This patch contains declarations of all externally visible SCST constants, types and functions prototypes. Particularly, it has definitions of all core SCST structures: - scst_tgt_template - defines SCST target driver, its callbacks and SCST core behavior dealing with it. - scst_dev_type - defines SCST backend driver, its callbacks and SCST core behavior dealing with it. - scst_tgt - defines SCST target (SCSI target port). Analog of Scsi_Host on the initiator side. - scst_session - defines SCST session (SCSI I_T nexus) - scst_cmd - defines SCST command (SCSI I_T_L_Q nexus) - scst_device - defines SCST device (SCSI device) - scst_tgt_dev - defines SCSI I_T_L nexus - scst_acg - defines SCST access control group. Such groups define which initiators see which LUNs. as well as prototypes for all the functions to manage those objects. Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- scst.h | 3642 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ scst_const.h | 376 ++++++ 2 files changed, 4018 insertions(+) diff -uprN orig/linux-2.6.35/include/scst/scst_const.h linux-2.6.35/include/scst/scst_const.h --- orig/linux-2.6.35/include/scst/scst_const.h +++ linux-2.6.35/include/scst/scst_const.h @@ -0,0 +1,376 @@ +/* + * include/scst_const.h + * + * Copyright (C) 2004 - 2010 Vladislav Bolkhovitin <vst@vlnb.net> + * Copyright (C) 2007 - 2010 ID7 Ltd. + * + * Contains common SCST constants. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2 + * of the License. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef __SCST_CONST_H +#define ...
This patch contains main management files and private headers. File scst_main.c contains management functions to create and destroy target drivers, targets, sessions, device drivers, devices and threads. A typical life-circle of a target driver: 1. scst_register_target_template() 2. scst_register_target() for each target 3. scst_register_session() for each incoming session. 4. Serving commands: 4.1. scst_rx_cmd() on a new command 4.2. scst_cmd_init_done() after the command's initialization finished 4.3. For WRITE-direction commands only: 4.3.1. rdy_to_xfer() callback to receive data into prepared buffer 4.3.2. scst_rx_data() to notify SCST core that all the data received 4.4. xmit_response() callback to transfer response with (possibly) data to the initiator 4.5. scst_tgt_cmd_done() - to notify SCST core that the response has sent 4.6. on_free_cmd() callback to notify that the command is about to be freed 5. scst_unregister_session() when initiator disconnects 6. scst_unregister_target() 7. scst_unregister_target_template() A typical lifecircle of a virtual dev handler: 1. scst_register_virtual_dev_driver() 2. scst_register_virtual_device() for each virtual device as requested user space configurator 3. Serving commands: 3.1. parse() callback to ensure that command initialized correctly and initialize what's not initialized yet. Particularly, data transfer direction and data transfer size. 3.2. exec() callback to execute the command 3.3. Calling cmd->scst_cmd_done() to notify SCST core that the command finished. 3.4. dev_done() callback to notify that SCST core finished post processing of the command 3.5. on_free_cmd() callback to notify that the command is about to be freed 4. scst_unregister_virtual_device() 5. scst_register_virtual_dev_driver() Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- scst_main.c | 2105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ scst_module.c | 63 + ...
This patch contains implementation of the SCSI target state machine as well as functions to interact with target drivers and dev handlers during commands processing. The core of SCSI target state machine is scst_process_active_cmd(), which depending on the current processing state (cmd->state) schedules the next processing function. All the processing supposed to be asynchronous, i.e. with no sleeping states. Each command can be processed only by a single thread at time. There are the following processing states: 1. SCST_CMD_STATE_INIT_WAIT - waiting for the target driver to call scst_cmd_init_done() 2. SCST_CMD_STATE_INIT - LUN is being translated to the corresponding SCST device 3. SCST_CMD_STATE_PRE_PARSE - internal parsing of the command to do the most common work of it in scst_pre_parse() 4. SCST_CMD_STATE_DEV_PARSE - the dev handler's parse() is going to be called 5. SCST_CMD_STATE_PREPARE_SPACE - allocation of the cmd's data buffer 6. SCST_CMD_STATE_PREPROCESSING_DONE - calling the target driver's preprocessing_done(), if requested by it. 7. SCST_CMD_STATE_PREPROCESSING_DONE_CALLED - waiting for the target driver to call scst_restart_cmd() 8. SCST_CMD_STATE_RDY_TO_XFER - the target driver's rdy_to_xfer() is going to be called 9. SCST_CMD_STATE_DATA_WAIT - waiting for the target driver to call scst_rx_data() 10. SCST_CMD_STATE_TGT_PRE_EXEC - the target driver's pre_exec() is going to be called 11. SCST_CMD_STATE_SEND_FOR_EXEC - cmd is going to be sent for execution 12. SCST_CMD_STATE_LOCAL_EXEC - cmd is being checked if it should be executed locally (e.g. REPORT LUNS or reservation commands) and, if yes, execute them 13. SCST_CMD_STATE_REAL_EXEC - cmd is being sent for execution 14. SCST_CMD_STATE_REAL_EXECUTING - waiting for the dev handler to call cmd->scst_cmd_done() 15. SCST_CMD_STATE_PRE_DEV_DONE - internal post-exec checks 16. SCST_CMD_STATE_MODE_SELECT_CHECKS - internal MODE SELECT pages related checks for MODE SELECT command to ...
This patch contains SCST internal library functions. For instance to create
and destroy various internal objects, like commands, targets, sessions, etc.
Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net>
---
scst_lib.c | 7007 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 7007 insertions(+)
diff -uprN orig/linux-2.6.35/drivers/scst/scst_lib.c linux-2.6.35/drivers/scst/scst_lib.c
--- orig/linux-2.6.35/drivers/scst/scst_lib.c
+++ linux-2.6.35/drivers/scst/scst_lib.c
@@ -0,0 +1,7007 @@
+/*
+ * scst_lib.c
+ *
+ * Copyright (C) 2004 - 2010 Vladislav Bolkhovitin <vst@vlnb.net>
+ * Copyright (C) 2004 - 2005 Leonid Stoljar
+ * Copyright (C) 2007 - 2010 ID7 Ltd.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation, version 2
+ * of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/kthread.h>
+#include <linux/cdrom.h>
+#include <linux/unistd.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/delay.h>
+#include <linux/vmalloc.h>
+#include <asm/kmap_types.h>
+#include <asm/unaligned.h>
+
+#include <scst/scst.h>
+#include "scst_priv.h"
+#include "scst_mem.h"
+#include "scst_pres.h"
+
+struct scsi_io_context {
+ void *data;
+ void (*done)(void *data, char *sense, int result, int resid);
+ char sense[SCST_SENSE_BUFFERSIZE];
+};
+static struct kmem_cache *scsi_io_context_cache;
+
+/* get_trans_len_x extract x bytes from cdb as length starting from off ...This patch contains SCST Persistent Reservations implementation. SCST implements Persistent Reservations with full set of capabilities, including "Persistence Through Power Loss". The "Persistence Through Power Loss" data are saved in /var/lib/scst/pr with files with names the same as the names of the corresponding devices. Also this directory contains backup versions of those files with suffix ".1". Those backup files are used in case of power or other failure to prevent Persistent Reservation information from corruption during update. Signed-off-by: Alexey Obitotskiy <alexeyo1@open-e.com> Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- scst_pres.c | 2498 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ scst_pres.h | 159 +++ 2 files changed, 2657 insertions(+) diff -uprN orig/linux-2.6.35/drivers/scst/scst_pres.c linux-2.6.35/drivers/scst/scst_pres.c --- orig/linux-2.6.35/drivers/scst/scst_pres.c +++ linux-2.6.35/drivers/scst/scst_pres.c @@ -0,0 +1,2498 @@ +/* + * scst_pres.c + * + * Copyright (C) 2009 - 2010 Alexey Obitotskiy <alexeyo1@open-e.com> + * Copyright (C) 2009 - 2010 Open-E, Inc. + * Copyright (C) 2009 - 2010 Vladislav Bolkhovitin <vst@vlnb.net> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2 + * of the License. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/errno.h> +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/slab.h> +#include <linux/sched.h> +#include <linux/smp_lock.h> +#include <linux/unistd.h> +#include <linux/string.h> +#include ...
This patch contains SCST debugging support routines.
Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net>
---
drivers/scst/scst_debug.c | 223 ++++++++++++++++++++++++++++++++++++
include/scst/scst_debug.h | 284 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 507 insertions(+)
diff -uprN orig/linux-2.6.35/include/scst/scst_debug.h linux-2.6.35/include/scst/scst_debug.h
--- orig/linux-2.6.35/include/scst/scst_debug.h
+++ linux-2.6.35/include/scst/scst_debug.h
@@ -0,0 +1,284 @@
+/*
+ * include/scst_debug.h
+ *
+ * Copyright (C) 2004 - 2010 Vladislav Bolkhovitin <vst@vlnb.net>
+ * Copyright (C) 2004 - 2005 Leonid Stoljar
+ * Copyright (C) 2007 - 2010 ID7 Ltd.
+ *
+ * Contains macroses for execution tracing and error reporting
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation, version 2
+ * of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __SCST_DEBUG_H
+#define __SCST_DEBUG_H
+
+#include <generated/autoconf.h> /* for CONFIG_* */
+
+#include <linux/bug.h> /* for WARN_ON_ONCE */
+
+#ifdef CONFIG_SCST_EXTRACHECKS
+#define EXTRACHECKS_BUG_ON(a) BUG_ON(a)
+#define EXTRACHECKS_WARN_ON(a) WARN_ON(a)
+#define EXTRACHECKS_WARN_ON_ONCE(a) WARN_ON_ONCE(a)
+#else
+#define EXTRACHECKS_BUG_ON(a) do { } while (0)
+#define EXTRACHECKS_WARN_ON(a) do { } while (0)
+#define EXTRACHECKS_WARN_ON_ONCE(a) do { } while (0)
+#endif
+
+#define TRACE_NULL 0x00000000
+#define TRACE_DEBUG 0x00000001
+#define TRACE_FUNCTION 0x00000002
+#define TRACE_LINE 0x00000004
+#define TRACE_PID 0x00000008
+#define TRACE_BUFF ...This patch contains SYSFS interface implementation.
This interface provides possibility for a user to configure his/her SCST
server configuration: add/delete/manage target drivers, targets, dev handlers,
virtual devices and access control to them as well as it allows to see current
SCST configuration with necessary statistical and debug info (e.g. SGV cache
statistics).
For some management events processing is redirected to dedicated thread for
2 reasons:
1. It naturally serializes all SYSFS management operations, so allows to
simplify locking in target drivers and dev handlers. For instance, it allows
in add_target() callback don't worry if del_target() or another add_target()
for the same target name called simultaneously.
2. To make processing be done outside of the internal SYSFS locking.
All internal SCST management is done for simplicity under scst_mutex.
It's simple, robust and worked well even under the highest load for ages.
But in 2.6.35 sysfs was improved to make lockdep to check s_active related
deadlocks and we discovered potential circular locking dependency
between scst_mutex and s_active. On some management operations lockdep triggered
output like:
[ 2036.926891] =======================================================
[ 2036.927670] [ INFO: possible circular locking dependency detected ]
[ 2036.927670] 2.6.35-scst-dbg #15
[ 2036.927670] -------------------------------------------------------
[ 2036.927670] rmmod/4715 is trying to acquire lock:
[ 2036.927670] (s_active#230){++++.+}, at: [<78240a24>] sysfs_hash_and_remove+0x63/0x67
[ 2036.927670]
[ 2036.927670] but task is already holding lock:
[ 2036.927670] (&scst_mutex){+.+.+.}, at: [<fefd7fe2>] scst_unregister_virtual_device+0x58/0x216 [scst]
[ 2036.927670]
[ 2036.927670] which lock already depends on the new lock.
[ 2036.927670]
[ 2036.927670]
[ 2036.927670] the existing dependency chain (in reverse order) is:
[ 2036.927670]
[ 2036.927670] -> #2 (&scst_mutex){+.+.+.}:
[ 2036.927670] ...Nice, but you forgot to document it. All sysfs changes need to be documented in Documentation/ABI/ And here. thanks, greg k-h --
Thanks for the review. In all those functions kobjects for simplicity are embedded into the outer objects, so they will be freed as part of the outer objects free. Hence, kfree() for the kobjects in the release functions are not needed. Thanks again, Vlad --
Sweet, you now have opened yourself up to public ridicule as per the documentation in the kernel for how to use kobjects! Nice job :) Seriously, you CAN NOT DO THIS! If you embed a kobject in a different structure, then you have to rely on the kobject to handle the reference counting for that larger structure. To do ANYTHING else is a bug and wrong. Please read the kobject documentation and fix this code up before submitting it again. thanks, greg k-h --
Sure, I have read it and we rely on the kobject to handle the reference counting for the larger structure. It's only done not in a straightforward way, because the way it is implemented is simpler for us + for some other reasons. For instance, for structure scst_tgt it is done using tgt_kobj_release_cmpl completion. When a target driver calls scst_unregister_target(), scst_unregister_target() in the end calls scst_tgt_sysfs_del(), which calls kobject_put(&tgt->tgt_kobj) and wait for tgt_kobj_release_cmpl to complete. At this point tgt_kobj can be taken only by the SYSFS. Scst_tgt_sysfs_del() can wait as much as needed until the SYSFS code released it. As far as I can see, it can't be forever, so it's OK. Then, after scst_tgt_sysfs_del() returned, scst_unregister_target() will free scst_tgt together with embedded tgt_kobj. Sure, if you insist, I can convert tgt_kobj and other similar kobjects to pointers, but it would be just a formal code introducing additional kmalloc()/kfree() pair per each kobject without changing any logic anywhere. Thanks, Vlad --
I don't understand, why can't you just free the memory, what are you having to wait for? You are only having one kobject for your structure, right? If so, then free the memory in the release, to wait for something else to free the As no other kernel code is like this, I don't think it's valid to be doing so, sorry. Please fix this. good luck, greg k-h --
I'm sorry, but after I started implementing it I'm confused..
Originally I thought you are asking to make tgt_kobj be not embedded in
struct scst_tgt, but a pointer in it, so scst_tgtt_release() will
kfree() tgt_kobj. Hence, all the above I wrote about why we have
tgt_kobj embedded.
But now I feel like you are asking that scst_tgtt_release() should
kfree() tgt, not tgt_kobj.
Is it correct?
If yes, we did it this way to make errors processing uniform and
straightforward. In the code (simplified) our current errors recovery
looks like:
void scst_tgt_sysfs_del(struct scst_tgt *tgt)
{
...
kobject_put(&tgt->tgt_kobj);
wait_for_completion(&tgt->tgt_kobj_release_cmpl);
}
int scst_tgt_sysfs_create(struct scst_tgt *tgt)
{
init_completion(&tgt->tgt_kobj_release_cmpl);
res = kobject_init_and_add(&tgt->tgt_kobj, &tgt_ktype,
&tgt->tgtt->tgtt_kobj, tgt->tgt_name);
if (res != 0)
goto out;
res = sysfs_create_file(&tgt->tgt_kobj,
&tgt_enable_attr.attr);
if (res != 0)
goto out_err;
...
out:
return res;
out_err:
scst_tgt_sysfs_del(tgt);
goto out;
}
struct scst_tgt *scst_register_target()
{
struct scst_tgt *tgt;
int rc = 0;
TRACE_ENTRY();
rc = scst_alloc_tgt();
if (rc != 0)
goto out;
...
tgt->tgt_name = kmalloc(strlen(target_name) + 1, GFP_KERNEL);
if (tgt->tgt_name == NULL)
goto out_free_tgt;
...
mutex_lock();
rc = scst_tgt_sysfs_create(tgt);
if (rc < 0)
goto out_unlock;
tgt->default_acg = scst_alloc_add_acg();
if (tgt->default_acg == NULL)
goto out_sysfs_del;
...
out:
return tgt;
out_sysfs_del:
mutex_unlock();
scst_tgt_sysfs_del(tgt)
goto out_free_tgt;
out_unlock:
mutex_unlock();
out_free_tgt:
scst_free_tgt(tgt);
}
We have a simple and straightforward errors recovery semantic: if
scst_tgt_sysfs_create() failed, then tgt_kobj returned deinited as if
scst_tgt_sysfs_create() has never been called. This is a regular
practice in the kernel: don't return half-initialized ...I am asking that ANY kobject release function call kfree to release the memory that object is embedded in. That is how kobjects work, please That's not "error prone", it's "people don't read the provided documentation about how to use the API". And yes, one could argue to make the API easier to use, and patches are There are good reasons, but the most important one being, if you pass off an object, as of that moment in time, you had better handle the reference counting correctly. If you think of it this way, cleanup logic is even simpler as that's the only rule you ever need to think Again, yes. Please read the documentation. thanks, greg k-h --
Here is the patch. It converts all the cases except 3:
- struct scst_tgt_template and struct scst_dev_type, because they are supplied by
target drivers and dev handlers correspondingly and usually static.
- struct scst_session, because sometimes we need to create several sessions for the
same initiator (for iSCSI sessions reinstatement, for instance) and for such cases need
a logic to have different SYSFS names for them. To have this logic functioning correctly,
we need to keep sessions alive longer than life of their kobject (see scst_sess_sysfs_create()
and scst_free_session() for more details).
I don't like this patch, because it makes the code in almost 100 lines of code bigger and
worse by making objects initialization/freeing a lot more complicated and harder to audit.
But, since you insist, I applied it.
It isn't late to revert it, just say a word.
Thanks,
Vlad
Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net>
---
drivers/scst/scst_lib.c | 102 ++++++++++++++----
drivers/scst/scst_main.c | 26 +---
drivers/scst/scst_priv.h | 37 +++++-
drivers/scst/scst_sysfs.c | 247 ++++++++++++++++++++++++----------------------
include/scst/scst.h | 22 +++-
5 files changed, 261 insertions(+), 173 deletions(-)
diff -upr linux-2.6.35/include/scst/scst.h linux-2.6.35/include/scst/scst.h
--- include/scst/scst.h
+++ include/scst/scst.h
@@ -1431,8 +1431,10 @@ struct scst_tgt {
char *default_group_name;
#endif
+ unsigned int tgt_kobj_initialized:1;
+
/* sysfs release completion */
- struct completion tgt_kobj_release_cmpl;
+ struct completion *tgt_kobj_release_cmpl;
struct kobject tgt_kobj; /* main targets/target kobject */
struct kobject *tgt_sess_kobj; /* target/sessions/ */
@@ -2077,6 +2079,9 @@ struct scst_device {
/* If set, dev is read only */
unsigned short rd_only:1;
+ /* If set, dev_kobj was initialized */
+ unsigned short dev_kobj_initialized:1;
+
...It's the middle of the merge window, and I'm about to go on vacation, so
I didn't read this patch after this line.
It's obvious that this patch is wrong, you shouldn't need to worry about
this. And even if you did, you don't need this flag.
Why are you trying to do something that no one else needs? Why make
things harder than they have to be.
{sigh}
greg k-h
--
I tried to explain that to you in http://lkml.org/lkml/2010/10/14/291 and mentioned there the need to create this flag to track half-initialized kobjects. You agreed (http://lkml.org/lkml/2010/10/14/299) that don't return half-initialized objects is a regular kernel practice, but then requested to strictly bound the larger object freeing to its kobject release(), which means that all SYSFS creating functions now have to return half-initialized SYSFS hierarchy in case of any error. Hence the flag to track it. Simply, any SCST object has a lot of other things to initialize besides its kobject, hence the need either to free it independently from its kobject, or track by a flag if its kobjects initialized. For illustration, here is the simplified code from my previous example, i.e. without this patch. I added to it scst_unregister_target() to make it more complete. void scst_tgt_sysfs_del(struct scst_tgt *tgt) { ... kobject_put(&tgt->tgt_kobj); wait_for_completion(&tgt->tgt_kobj_release_cmpl); } int scst_tgt_sysfs_create(struct scst_tgt *tgt) { init_completion(&tgt->tgt_kobj_release_cmpl); res = kobject_init_and_add(&tgt->tgt_kobj, &tgt_ktype, &tgt->tgtt->tgtt_kobj, tgt->tgt_name); if (res != 0) goto out; res = sysfs_create_file(&tgt->tgt_kobj, &tgt_enable_attr.attr); if (res != 0) goto out_err; ... out: return res; out_err: scst_tgt_sysfs_del(tgt); goto out; } struct scst_tgt *scst_register_target() { struct scst_tgt *tgt; int rc = 0; rc = scst_alloc_tgt(); if (rc != 0) goto out; ... tgt->tgt_name = kmalloc(strlen(target_name) + 1, GFP_KERNEL); if (tgt->tgt_name == NULL) goto out_free_tgt; ... mutex_lock(); rc = scst_tgt_sysfs_create(tgt); if (rc < 0) goto out_unlock; tgt->default_acg = scst_alloc_add_acg(); if (tgt->default_acg == NULL) goto out_sysfs_del; ... out: return tgt; out_sysfs_del: mutex_unlock(); scst_tgt_sysfs_del(tgt) goto ...
I agreed that you needed to do something about it, not that this is the correct way to do it. Think for a second as to why your code path looks different from EVERY other kobject user in the kernel. Perhaps it is incorrect? You don't need all this completion mess, in fact, it's wrong. Just do what everyone else does please, as that is the simpler, and correct, way to do it. Oh, and why are you using a kobject at all anyway? Shouldn't you be using a 'struct device'? Anyway, I don't have time for this anymore for the next 2 weeks, good luck. greg k-h --
Hello Greg, Why SCST objects are different from most other kernel objects and why they can't be implemented the same way as them is exactly what I'm trying to explain you. Let me try again from a bit different angle. SCST objects are different from the most other kernel objects, because they are very complex, hence complex to initialize and delete with dependencies in order how initialization and delete actions should be performed. Particularly, SCST objects have a lot of attributes and sub-objects, so their kobjects can't be inited and exposed to SYSFS or removed from it until they reach some point during initialization or delete correspondingly, otherwise their attributes' reads and writes can crash. I can elaborate all those on examples, if you request. I understand you are very busy and appreciate very much your review and comments, so be glad to implement what you are requesting. But I don't see how to implement that in an acceptable manner with neither the completion-based delete as now, nor with x_initialized flag as in the previous patch. Note, SCST's kobjects are not the only kobjects in the kernel using the completion based delete. See, for instance, ktype_cpufreq, ktype_cpuidle or ext4_ktype. Also, elevator (struct elevator_queue) uses "registered" flag to see if it was added to the SYSFS. Thus, you can see, both the completion and the flag approaches already used in the kernel. Hence, I believe, the way how SCST is doing it should also be acceptable. I'm not a person who, as you, probably, guessed, at first doing, only then thinking and reading docs. I before writing a code line at first read all the available docs and carefully consider possible alternatives, so there isn't a well thought bit in SCST. We need only to represent our internal SCST objects on the SYSFS. The objects are not devices, but special entities for special purposes. For instance, struct scst_session is a representation of SCSI I_T nexuses. Struct scst_tgt_dev - I_T_L ...
Then don't abuse kobjects with this "different" type of kobject, as that is not how the kobject code was designed to be used. That sounds like an implementation error. No other kernel code has that ext4 is using this to get stuff under the /sys/fs/ext4 location. And even there, I don't think it is using kobjects correctly, but I really don't want to go audit that code at the moment. cpufreq and cpuidle is also probably incorrect, but again, I don't feel like auditing it right now. No, you should be using a struct device as you are putting stuff into the device tree. NEVER put a kobject into the device tree, that is just wrong and will cause problems. thanks, greg k-h --
SCST objects are generally 1:1 mapping of SCSI Architecture Model (SAM) objects. Is SAM insane and should have been designed with kobjects in "To get stuff under /sys/..." is exactly for what SCST needs kobjects. SCST objects are user interface agnostic and can be equally well exposed to user space via SYSFS, PROCFS, or anything else, like CONFIGFS or Sorry, but what is incorrect in the working implementation without any bugs doing its job in the simplest, smallest and clearest way? If those objects remade to free themselves in the kobjects release(), what value would it add to them? Would the implementation be simpler, smaller or clearer? Not, I believe, new implementation would be only Hmm, we never put any stuff into the device tree. Why do you think we do? None of SCST objects has any relation to any hardware device, hence there's no need to touch the device tree. Thanks, Vlad --
Totally theoretically speaking, since I have not inspected the code. If today you wait for the count to reach zero, then unregister and send an event to some other subsystem to free the object. Is it not the same as if you take an extra refcount, unregister and send the event at count=1. Then at that other place decrement the last count to cause the object to be freed. I agree that it is hard to do lockless. what some places do is have an extra kref. The kobj has a single ref on it. everything takes the other kref. when that reaches zero the unregister and event fires and at free you decrement the only kobj ref to deallocate. This is one way. In some situations you can manage with a single counter it all depends. Boaz --
Thanks for sharing your thoughts with us. But the question isn't about
if it's possible to implement what we need locklessly. The question is
in two approaches how to synchronously delete objects with entries on SYSFS:
1. struct object_x {
...
struct kobject kobj;
struct completion *release_completion;
};
static void x_release(struct kobject *kobj)
{
struct object_x *x;
struct completion *c;
x = container_of(kobj, struct object_x, kobj);
c = x->release_completion;
kfree(x);
complete_all(c);
}
void del_object(struct object_x *x)
{
DECLARE_COMPLETION_ONSTACK(completion);
...
x->release_completion = &completion;
kobject_put(&x->kobj);
wait_for_completion(&completion);
}
and
2. struct object_x {
...
struct kobject kobj;
struct completion release_completion;
};
static void x_release(struct kobject *kobj)
{
struct object_x *x;
x = container_of(kobj, struct object_x, kobj);
complete_all(&x->release_completion);
}
void del_object(struct object_x *x)
{
...
kobject_put(&x->kobj);
wait_for_completion(&completion);
...
kfree(x);
}
Greg asserts that (1) is the only correct approach while (2) is
incorrect, and I'm trying to justify that (2) is correct too and
sometimes could be better, i.e. simpler and clearer, because it
decouples object_x from SYSFS and its kobj. Then kobj becomes an
ordinary member of struct object_x without any special treatment and
with the same lifetime rules as other members of struct object_x. While
in (1) all lifetime of struct object_x is strictly attached to kobj, so
it needs be specially handled with additional code for that if struct
object_x has many other members which needed to be initialized/deleted
_before and after_ kobj as we have in SCST.
Vlad
--
I'll admit I don't understand this all that well, but why not just have x_release() (based on (2)) do free(x), and have del_object do the kobject_put(&x->kobj) as its very last thing? --
We are discussing _synchronous_ delete of x, so need to wait until x->kobj released, hence the completion is needed in both cases. For instance, the sync delete is needed for targets to let the corresponding target driver be safely unloaded after its target unregistered. Vlad --
Ah, well to answer my own question, I guess that's (1). Never mind. --
Thanks for putting up an example we can now speak more specifically.
(And saved me the time to actually look at code)
I'll first comment on your code below and I have some questions, please
see if I understood you correctly. Later below I'll try to explain what I
I don't see the unregister of the object_x.kobj, where do
This put might not be the last put on the object, IOs in flight
and/or open files might have extra reference on the object.
We release our initial ref, and below wait for all operations
DECLARE_COMPLETION_ONSTACK(completion);
x->release_completion = &completion;
One possibility (There are others)
3. struct object_x {
...
struct kref kref;
struct kobject kobj;
struct completion *release_completion;
};
Every body takes kref_put(&object_x.kref) and kref_get(&object_x.kref)
I hope you have x_get/x_put, Yes?
static void x_kref_release(struct kref *kref)
{
struct object_x *x = container_of(kref, struct object_x, kref);
complete_all(&x->release_completion);
}
static void x_obj_release(struct kobject *kobj)
{
struct object_x *x = container_of(kobj, struct object_x, kobj);
kfree(x);
}
int x_put(struct object_x *x)
{
return kref_put(&x->kref, x_kref_release);
}
void del_object(struct object_x *x)
{
DECLARE_COMPLETION_ONSTACK(completion);
...
x->release_completion = &completion;
x_put(x)
wait_for_completion(&completion);
kobject_put(&x->kobj);
}
Or
4. Exactly Like 3 but without the extra kref member
Only x_put() changes and x_kref_release() now receives
an x_object
int x_put(struct object_x *x)
{
if (kobject_put(&x->kobj) == 1)
// Like above [3] x_kref_release()
x_kref_release(x);
}
Note that in 4 you don't actually have a kref member, and that you have
one extra ref on kobj from the beginning. In del_object above the first
x_put(x) makes it possible to reach the "1" count and then the final
kobject_put(&x->kobj); frees the object.
(You need to be carfull with [4] because it must ...This is racy, please never do this. thanks, greg k-h --
The last ref belongs to the core code. 1 means there are no more external clients on the object. So it can not race with decrements. But I guess there is a possibility that it can race with new increments. If it is the case that new increments can only come from, say, sysfs access, then if we call the x_put() == 1 after we are unregistered from sysfs and no new users are allowed then the counter can only go down and we have the last reference. No? Thanks Boaz --
Just don't do this, it's not worth it and will break over time when others mess with the code. Also note that kobject_put() does not even return a value, so the code above will not even compile, let alone work. thanks, greg k-h --
OK Point taken, it is fragile. So there is option [3] then, with the extra Thanks Boaz --
This is the last internal put. All other references are from outsiders. OK, I see. You know, all non-trivial things can be done in >1 correct way ;) (Although (4) is not too correct as Greg already wrote) Vlad --
The question is why do you need to wait here? I presume it is module unloading path, but then it is quite bad - you can easily wedge your subsystem if you make something to take a reference to your kobject while module is ytying to be unloaded. Back when sysfs attributes tied kobjects the easiest thing was to do: rmmod <module> < / sys/devices/..../attribute If you are done with the kobject - just proceed with what you were doing and let it die its own peaceful death some time later. You just need to make sure release code sticks around to free it and your subsystem core can be tasked with this. Use module counter to prevent unloading of the subsystem core until all kobjects belonging to the subsystem are destroyed. -- Dmitry --
On Fri, Nov 12, 2010 at 2:23 AM, Dmitry Torokhov Do you mean keeping a kref object in the kernel module, invoking kref_get() every time a kobject has been created and invoking kref_put() from the kobject/ktype release method ? That would help to reduce the race window but would not eliminate all races: as soon as the last kref_put() has been invoked from the release method, the module can get unloaded. And module unloading involves freeing all module code sections, including the section that contains the implementation of the release method. Which is a race condition. I'm not sure that it is even possible with the current kobject implementation to solve this race. I haven't found any information about this race in Documentation/kobject.txt. And it seems to me that the code in samples/kobject/kobject-example.c is vulnerable to this race: methods like foo_show() and foo_store() can access statically allocated memory ("static int foo") after the module has been unloaded. Although the race window is small, this makes me wonder whether module unloading been overlooked at the time the kobject subsystem has been designed and implemented ? Bart. --
No, you do not add a kref, but rather manipulate module use counter:
static void blah_blah_release(struct kobject *kobj)
{
struct blah_blah *b = to_blah_blah(kobj);
...
kfree(kobj);
module_put(THIS_MODULE);
}
int blah_blah_register(struct blah_blah *blah)
{
...
__module_get(THIS_MODULE);
...
return 0;
}
The above should reside in subsystem _core_ and it will pin the core
module until last kobject belonging to the subsystem is released.
Once all users are gone module counter will go to 0 and rmmod will
allow core to unload. Note that no new kobjects will be created while
module usage count is 0 because there are no users of the core - all of
them have to be unloaded already, otherwise module loader would have
--
Dmitry
--
On Fri, Nov 12, 2010 at 7:44 PM, Dmitry Torokhov Thanks for replying, but sorry, it's still not clear to me. The use counter of which module should be manipulated ? Manipulating the use counter of the module that contains the kobject/ktype callback function implementations would make it impossible to unload that module because rmmod refuses to unload any module whose use counter is above zero. And manipulating the use counter of a parent module would not help in any way. It would help if you could tell us where in the kernel tree we can find a good example. I have tried to find such an example, but all I found are examples of kobject release method implementations in which no attempt is made to prevent the aforementioned race: * iscsi_boot_kobj_release() in drivers/scsi/iscsi_boot_sysfs.c * pkt_kobj_release() in drivers/block/pktcdvd.c Bart. --
This is a very good question. During implementation I spent a lot of time working on it. In fact, the first implementation was asynchronous similarly as you proposing, i.e. it just proceed with what you were doing and let it die its own peaceful death some time later. But soon the implementation was becoming so complicated, so it started getting out of control. For instance, some of the tasks to solve with this approach were: 1. What to do if another SCST object is being created with the same name as supposed to be deleted, but not completely dead yet? 2. What to do if a dieing object is found on some list and reference for is supposed to be taken? If the object deleted from the list before it marked dieing, i.e. the latest internal put() done, it made additional problems during deleting it after the latest external put done. ... So, I decided to reimplement it to be completely synchronous. SYSFS authors did really great job and thanks to the excellent internal SYSFS design and implementation it is absolutely safe. See: [root@tgt ~]# modprobe scst [root@tgt ~]# cd /sys/kernel/scst_tgt/ [root@tgt scst_tgt]# ls -l total 0 drwxr-xr-x 4 root root 0 Nov 13 21:31 devices drwxr-xr-x 2 root root 0 Nov 13 21:31 handlers -r--r--r-- 1 root root 4096 Nov 13 21:30 last_sysfs_mgmt_res -rw-r--r-- 1 root root 4096 Nov 13 21:30 setup_id drwxr-xr-x 5 root root 0 Nov 13 21:31 sgv drwxr-xr-x 2 root root 0 Nov 13 21:31 targets -rw-r--r-- 1 root root 4096 Nov 13 21:30 threads -rw-r--r-- 1 root root 4096 Nov 13 21:30 trace_level -r--r--r-- 1 root root 4096 Nov 13 21:30 version [root@tgt scst_tgt]# cat version 2.1.0-pre1 EXTRACHECKS DEBUG [root@tgt scst_tgt]# rmmod scst </sys/kernel/scst_tgt/version [root@tgt scst_tgt]# ls -l total 0 [root@tgt scst_tgt]# pwd /sys/kernel/scst_tgt [root@tgt scst_tgt]# lsmod Module Size Used by scsi_debug 65188 0 w83627hf 22424 0 hwmon_vid 2207 1 w83627hf adm1021 ...
Sorry, but no, you can't put this in /sys/kernel/ without getting the approval of the sysfs maintainer. I really don't understand why you are using kobjects in the first place, why isn't this in the main device tree in the kernel, using 'struct device'? In the end, I guess it really doesn't matter as this code isn't getting merged so I shouldn't worry about it, right? thanks, greg k-h --
It is my understanding that Vlad is able to reflect the topology by This is quite unfortunate as I still have not seen the public comparison of the 2 implementations and the lists of benefits and shortfalls for both of them. -- Dmitry --
On Mon, Nov 15, 2010 at 7:59 AM, Dmitry Torokhov The reason why we are proposing SCST for upstream inclusion are: - It has a large user base. - The project is known for its high performance, low latency I/O and excellent stability. - A feature-wise overview can be found here: http://scst.sourceforge.net/comparison.html. Bart. --
Correct. As I wrote in the previous e-mail, SCST doesn't deal with Indeed, it is unfortunate :(. Undercover political games continue... Vlad --
This is not nice and is uncharacteristic of you. This project, even though out-of-tree, is an old and mature project that has many users. These are all *Linux* users. The authors and community have come to us for help, and advice on making this code acceptable for mainline and hardening the code the way, only one project on the planet can do, the Linux community. I think it is our courtesy and obligation to the Linux users of this Project to comment where they are doing wrong and where they should do better. It is not of their choice to be out-of-tree. It is ours. The least we can do. Is give then some assistance if we can, and have 5 minutes of our time. All these issues we were discussing are interesting and are real Kernel problems. For instance the last comment you made was that for such a dynamic system and life time problems, and functionality. A better and expected solution might be the device tree and not sysfs. And for such big additions the sysfs maintainer must give his blessings. This is most valuable information regardless of if we accept their code or not at the end. Sincerely yours Boaz --
Why, I'm not allowed to get frustrated at repeated attempts to get the original poster to change their code to something that is acceptable and just give up and walk away? It is also the job of the kernel community to say "No, what you are doing is wrong, please don't do that." Yes, that is what I have been saying for a while now. Again: This code is using kobjects incorrectly. This code should not be using kobjects. this is my last response to this thread now, and I'm sure you can understand why. thanks, greg k-h --
Thank you Greg for your time and most valuable input. I'm sorry for not understanding your position. I needed the clear cut statement: This code should not be using kobjects. i.e not belong in sysfs SCST guys. This sounds pretty clear cut to me. Sysfs was not built in mind for such dynamic systems, and it will cause never ending conflicts with future maintenance of sysfs vs SCST. Perhaps consider a new alternative like the device tree as Greg suggested or maybe finally accept the harsh realities of ConfigFS, and come join us in the LIO project. SCST is a most valuable project and community which we would like to join forces with in making Linux the best. Lets call it Linux-Target and unify all our efforts. Boaz --
I think that Vlad has already explained several times why ConfigFS is not suited for the needs of a storage target: a storage target must not only be able to accept configuration information from userspace but must also be able to create new directories and file nodes itself. See e.g. this message from October 6: http://kerneltrap.org/mailarchive/linux-kernel/2010/10/6/4628664. Bart. --
<Trimming CC list> Sorry, but this post explains nothing but a single misguided and uninformed opinion, with no hard facts on the actual usage of a native configfs control plane within target mode infrastructure. The hard facts are: Using configfs works to drive the generation/release, real time configuration, dependancy relationships because: *) configfs represents the individual target data structures who's creation/deletion are driven from enterily userspace. *) The parent/child relationships of dependant data structures is handled transparently to the configfs consumer (eg: no hard requirement internal reference counting) *) The module reference counting of target core -> fabric module is handled transparently to configfs consumers *and* TCM fabric modules *) The current implementation of TCM/ConfigFS contains no global locks. Each /sys/kernel/config/target/$FABRIC_1 operates independently of /sys/kernel/config/target/$FABRIC_2 *) Expecting target fabric module developers to (by-hand) add struct config_groups, struct kobjects or anything else directly to their fabric module code is really clumsy and stupid. This problem was solved earlier this year in TCM v4.0 with the advent of: http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=blob;f=drivers/target... What this code means is that target fabric module developers no longer has to duplicate complex control path code, and all of the interesting port attributes (like ALUA secondary access state for example) are transparent to new fabric modules, and 'just work'. We can even generate *functional* configfs skeletons for new fabric modules using tcm_mod_builder.py discussed here: http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=blob;f=Documentation/... For the specific case of the target control path, until someone from SCST can present some hard facts *and* running code about ...
What is "misguided and uninformed opinion"? That you can't with ConfigFS serve real life needs for target driver developers and can't create targets for hardware target ports from withing the kernel and instead enforce users to clumsy workarounds to somehow magically know their names to manually perform "mkdir target_name"? The same problem exists for all other objects where you need to create ConfigFS entries (directories), like for per-session/per-initiator statistics or default ACLs. ConfigFS is just too simple to serve real life needs of a SCSI target subsystem. Vlad --
I'm just an outsider - but maybe my perspective has value - it seems there are two sides to this debate: 1) sysfs is great for scst due to certain stability concerns and code concerns 2) sysfs is bad for scst due to the intended role of sysfs and its namespace Maybe I misunderstand - But if both sides have merit then wouldn't a compromise be appropriate? Maybe the sensical compromise is to use sysfs code to create a new namespace that would fit this purpose? It seems that I am also hearing that the alternatives to sysfs aren't always adequate - so why not use sysfs, but have a place where it's appropriate to use it? Apologies in advance if I'm just way off base here... - Richard Williams--
(Since this discussion goes to a quite fundamental scope, I let myself to add Joel Becker and Linus Torvalds on CC) Your questions are very good, so let's summarize what we need to serve the needs of a SCSI target subsystem (not necessary SCST) and see what can fit them. So, the needs: 1. Be capable to represent to user space internal configuration to let user space be able to see and analyze it, including various statistics. 2. Let user space manage the internal configuration. 3. Desired: possibility to send to user space events about important internal actions, like I/O failures, which may need user space intervention to recover, like switching from active to passive role in a cluster. So, what can we do with ConfigFS: (1): Only partially, because by design ConfigFS isn't supposed to represent internal configuration, it can only manage it. Extending ConfigFS to be capable to do that would be, in my understanding, a strong violation of its purpose and, hence, design and if went this way eventually ConfigFS would become just a duplication of the SYSFS functionality. (2): ConfigFS can do that. This is exactly for what it was designed and implemented. But in this particular application it would have some limitations derived from (1): to manage harware-related entries a user should magically know from somewhen names of those entries to create them by "mkdir" command. For instance, consider a user has a Fibre Channel HBA and want to use it in the target mode. Before he can configure it, he should somehow know its ports names and for each of them run: # mkdir /sys/kernel/config/.../50:50:00:00:00:00:00:11 # mkdir /sys/kernel/config/.../50:50:00:00:00:00:00:12 ... where 50:50:00:00:00:00:00:1x are the ports' names. Only after that those ports appear on the ConfigFS and can be managed. (3): No events at all. Now consider SYSFS: (1): Easily. This is exactly for what it was designed and implemented. (2): Possible without any limitations and side ...
Since nobody objected, Greg, could you consider to ACK SCST SYSFS management interface in /sys/kernel/scst_tgt/, please? Please find the SCST SYSFS ABI documentation file you requested below. We are also preparing a new patch to free our objects in kobjects.release() without explicit x_initialized flags as you requested. Thank you for your effort, Vlad Documentation/ABI/testing/sysfs-scst: What: /sys/kernel/scst_tgt/ Date: November 2010 Contact: Vladislav Bolkhovitin <vst@vlnb.net> Description: Contains SCST management interface entries. What: /sys/kernel/scst_tgt/devices/ Date: November 2010 Contact: Vladislav Bolkhovitin <vst@vlnb.net> Description: Contains subdirectories for all SCST devices What: /sys/kernel/scst_tgt/devices/<device>/exported/exportX Date: November 2010 Contact: Vladislav Bolkhovitin <vst@vlnb.net> Description: Links to LUNs in the LUNs group where <device> exported, e.g. to /sys/kernel/scst_tgt/targets/iscsi/iqn.2006-10.net.vlnb:tgt/luns/11 What: /sys/kernel/scst_tgt/devices/<device>/handler Date: November 2010 Contact: Vladislav Bolkhovitin <vst@vlnb.net> Description: Link to dev handler of this device, if assigned, e.g. to /sys/kernel/scst_tgt/handlers/dev_disk What: /sys/kernel/scst_tgt/devices/<device>/type Date: November 2010 Contact: Vladislav Bolkhovitin <vst@vlnb.net> Description: SCSI type of this device as define by SAM. What: /sys/kernel/scst_tgt/handlers/ Date: November 2010 Contact: Vladislav Bolkhovitin <vst@vlnb.net> Description: Contains all SCST dev handlers. What: /sys/kernel/scst_tgt/handlers/<handler>/<device> Date: November 2010 Contact: Vladislav Bolkhovitin <vst@vlnb.net> Description: Links to <device> managed by this dev handler, e.g. ext3_disk1_4K -> /sys/kernel/scst_tgt/devices/ext3_disk1_4K What: /sys/kernel/scst_tgt/handlers/<handler>/mgmt Date: November 2010 Contact: Vladislav Bolkhovitin <vst@vlnb.net> Description: Management entry, which allows to ...
No, sorry, again, you should not be using kobjects, and do not polute the main /sys/kernel/ namespace with this. Use 'struct device' please, that is what it is there for, and is what the rest of the kernel is using. And use the rest of the driver/bus/device infrastructure as your model will work with it just fine. Yes, I know you said you didn't think you could use it, and that your code was different than everyone elses, but I still do not believe it, sorry. good luck, greg k-h --
Greg, sorry, I don't understand your requirements and, because of this, we can't go ahead implementing them. Could you explain your position, please? None of the SCST objects are Linux devices. None of them has entries in /dev, none of them needs to send any events to udev and none of them sends or receives data from DMA, hence has any DMA parameters or restrictions. So, how can them fit into the driver/bus/device model you are enforcing? For instance: - struct sgv_pool (/sys/kernel/scst_tgt/sgv/<cache>) is an SG vectors cache. Isn't it a nonsense to make it device? - struct scst_acg (/sys/kernel/scst_tgt/targets/<target_driver>/<target>/ini_groups/<group>/) is an access control group to define which initiators see which LUNs from target <target>. Same, isn't it a nonsense to make it device? - struct scst_acn (/sys/kernel/scst_tgt/targets/<target_driver>/<target>/ini_groups/<group>/initiators/<initiator_name>) is an entry in the access control group <group> to define which initiators should use group <group>. Again, isn't it a nonsense to make it device? Etc. How could they fit in the driver/bus/device model? I guess, your confusion comes from word "device" you see in the SCST SYSFS tree and SCST supposed to work with "devices" and "dev handlers"? But those devices are not Linux devices, they are just objects to couple initiators' requests and backstorage which stores data. In other words, SCST devices are just redirection points passing incoming requests to Linux files and devices. SCST devices are the same as NFS exported directories. Do you believe, that struct device must be created for each NFS export? In the cases when we need to present SCST devices as Linux devices we use scst_local driver [1], which performs it exactly as you requesting, i.e. doing: root_device_register() bus_register() driver_register() What can we do to make you believe it? Shall we write a document describing all SCST objects and their relationship between each ...
Note that the entities in /sys/devices/... tree and not necessarily physical devices bit rather interface abstractionss. Consider, for example, /sys/class/input/*. None of the "devices" there talk directly to hardware, do DMA or other things. Some of them don't even talk to usrespace directly but rather through additional interfaces (evdev. mousedev, ect). Still they are represented there and even have suspend and resume methods (because even for logical devices it makes sense to Maybe not all of them are. Some of them could probably be represented by attributes of other devices. And some of them are, fitting into the overall /sys/devices hierarchy and describing physical and logical relations between them. -- Dmitry --
But all of them still place from where to events received and where requests from Linux sent, aren't them? SCST devices are not even logical devices. As I wrote, "devices" word is misleading. SCST devices are converse of what Linux means under this word. SCST devices are like NFS exports: a place where those events generated and those requests received. Think of SCST device as if it sits on the opposite side of the PCI bus of the corresponding SCSI device Linux sees in /sys/class and /sys/bus. So, if we need Linux devices for SCST devices, we create them using scst_local driver. And then, of course, all them have their place in /sys/class/ and /sys/bus. Although more common use of them from remote systems via iSCSI, Fibre Channel, InfiniBand, etc. The remote systems create devices in /sys/class/ and /sys/bus (if they are Linux), we serve them. Vlad --
No, just /sys/bus/ which will cause them to become part of the big device tree in /sys/devices/ Again, use this interface, it is what it is there for, to not use it is just wrong. good luck, greg k-h --
USB gadgets are a good example, but not quite. I'm objecting not the possibility to implement SCST objects as devices. We are creating software, so all what hardware allows can be implemented. I'm arguing usage of devices view for something which fundamentally not devices. USB gadgets are subset of what SCST allows. On the external side they are tightly coupled to hardware, on the backend side they don't need any management interface, because (at least for storage) all devices configuration they have is static via module parameters. The only attributes file_storage has are to change FUA and file path, which can be placed anywhere, including /sys/module/g_file_storage/parameters. So, for USB gadgets the config interface and place in /sys doesn't matter much. I guess, making its LUNs as Linux devices were just part of the ritual to get accepted into the kernel. On the USB gadgets developers place I wouldn't mind too to go this way. But SCST is a _big_, _complete_, _new_ Linux subsystem. Actually, USB storage gadgets should be SCST targets to use full power of all possible virtual and real backends SCST provides, hence stay inside SCST subtree. For SCST it is very important that all its functionality concentrated in one place and not confused with other client side devices Linux has. Important for clearness, easy to use, easy to understand, flexibility, maintainability, etc. It's like as you created rules to keep and train dogs. Then you need to keep and train cats as well. Would it be wise to keep cats in the same place together with dogs and train them the same tricks using the same If you have such wide devices definition, why file systems have separate /sys/fs namespace? Or ksm place in /sys/kernel/mm/ksm? Or hugepages in /sys/kernel/mm/hugepages? If NFS exports are devices, file systems must also be devices, correct? Like /sys/fs/ext4/sda11. Isn't it a full analogy to NFS export Well, if I don't agree with you it doesn't mean I'm not listening to you. ...
That doesn't matter. They are still "devices" that the kernel knows It's the duty of a subsystem's maintainer to enforce the correct model of the kernel, and that is what I am doing. Again, this is the last email I'm writing on this topic, as none of the previous ones seem to be sinking in. good luck, greg k-h --
How about using 'struct device' as follows ? * Move /sys/kernel/scst_tgt/targets to /sys/class/target_driver. * Move /sys/kernel/scst_tgt/handlers to /sys/class/device_driver. * Move /sys/kernel/scst_tgt/devices to /sys/class/target_device. * Move the attributes of /sys/kernel/scst_tgt to /sys/devices/scst. * Move /sys/kernel/scst_tgt/sgv to /sys/devices/scst/sgv Some quickly hacked up code (that needs further polishing) that implements this scheme can be found here: https://scst.svn.sourceforge.net/svnroot/scst/branches/sysfs-tree-changes. An example of an SCST configuration file and the corresponding sysfs hierarchy: # uname -r 2.6.37-rc5-scst+ # cat /etc/scst.conf HANDLER vdisk_blockio { DEVICE disk01 { filename /dev/ram0 nv_cache 1 } DEVICE disk02 { filename /dev/ram1 nv_cache 1 } } HANDLER vdisk_fileio { DEVICE disk03 { filename /dev/vdisk nv_cache 1 } DEVICE disk04 { filename /dev/vdisk nv_cache 1 } } HANDLER vdisk_nullio { DEVICE disk05 DEVICE disk06 } TARGET_DRIVER scst_local { TARGET local { session_name local LUN 0 disk01 LUN 1 disk02 LUN 2 disk03 LUN 3 disk04 LUN 4 disk05 LUN 5 disk06 } } TARGET_DRIVER ib_srpt { TARGET ib_srpt_target_0 { rel_tgt_id 1 enabled 1 LUN 0 disk01 LUN 1 disk02 LUN 2 disk03 LUN 3 disk04 LUN 4 disk05 LUN 5 disk06 } TARGET ib_srpt_target_1 { rel_tgt_id 2 enabled 1 LUN 0 disk01 LUN 1 disk02 LUN 2 disk03 LUN 3 disk04 LUN 4 disk05 LUN 5 disk06 } } TARGET_DRIVER iscsi { enabled 1 TARGET iqn.2005-03.org.open-iscsi:dbc01e1792b:storage { rel_tgt_id 4 LUN 0 disk01 LUN 1 disk02 LUN 2 disk03 LUN 3 disk04 LUN 4 disk05 LUN 5 disk06 enabled 1 } } # find /sys/{class,devices/virtual}/{target_driver,device_driver,target_device} /sys/devices/scst | while read f; do if [ -h $f ]; then echo "$f -> $(readlink $f)"; else echo "$f"; fi; ...
Please never create a new class, use 'struct bus_type' instead and create a bus and have drivers and devices on it. thanks, greg k-h --
How about the hierarchy illustrated below, using the two new bus types scsi_tgt_dev and scsi_tgt_port ? Note: this time I neither have loaded the scst_local kernel module nor iscsi-scst in order to keep the output short. # find /sys/bus/scsi_tgt_* /sys/devices/scst | while read f; do if [ -h $f ]; then echo "$f -> $(readlink $f)"; else echo $f; fi; done /sys/bus/scsi_tgt_dev /sys/bus/scsi_tgt_dev/uevent /sys/bus/scsi_tgt_dev/devices /sys/bus/scsi_tgt_dev/devices/2:0:0:0 -> ../../../devices/2:0:0:0 /sys/bus/scsi_tgt_dev/devices/2:0:1:0 -> ../../../devices/2:0:1:0 /sys/bus/scsi_tgt_dev/devices/3:0:0:0 -> ../../../devices/3:0:0:0 /sys/bus/scsi_tgt_dev/devices/disk01 -> ../../../devices/disk01 /sys/bus/scsi_tgt_dev/devices/disk02 -> ../../../devices/disk02 /sys/bus/scsi_tgt_dev/devices/disk03 -> ../../../devices/disk03 /sys/bus/scsi_tgt_dev/devices/disk04 -> ../../../devices/disk04 /sys/bus/scsi_tgt_dev/devices/disk05 -> ../../../devices/disk05 /sys/bus/scsi_tgt_dev/devices/disk06 -> ../../../devices/disk06 /sys/bus/scsi_tgt_dev/drivers /sys/bus/scsi_tgt_dev/drivers/dev_disk /sys/bus/scsi_tgt_dev/drivers/dev_disk/module -> ../../../../module/scst_disk /sys/bus/scsi_tgt_dev/drivers/dev_disk/uevent /sys/bus/scsi_tgt_dev/drivers/dev_disk/type /sys/bus/scsi_tgt_dev/drivers/dev_disk/trace_level /sys/bus/scsi_tgt_dev/drivers/dev_disk_perf /sys/bus/scsi_tgt_dev/drivers/dev_disk_perf/module -> ../../../../module/scst_disk /sys/bus/scsi_tgt_dev/drivers/dev_disk_perf/uevent /sys/bus/scsi_tgt_dev/drivers/dev_disk_perf/type /sys/bus/scsi_tgt_dev/drivers/dev_disk_perf/trace_level /sys/bus/scsi_tgt_dev/drivers/vdisk_fileio /sys/bus/scsi_tgt_dev/drivers/vdisk_fileio/module -> ../../../../module/scst_vdisk /sys/bus/scsi_tgt_dev/drivers/vdisk_fileio/uevent /sys/bus/scsi_tgt_dev/drivers/vdisk_fileio/type /sys/bus/scsi_tgt_dev/drivers/vdisk_fileio/add_device_parameters /sys/bus/scsi_tgt_dev/drivers/vdisk_fileio/trace_level /sys/bus/scsi_tgt_dev/drivers/vdisk_fileio/disk01 -> ...
Hello Greg, As you can see in recent messages (e.g. http://lkml.org/lkml/2010/11/18/578 or http://lkml.org/lkml/2010/11/15/296), the abstractions represented in SCST are: * Target templates (scst_tgt_template), e.g. scst_local or ib_srpt. These are drivers that implement a storage protocol. * Target ports (scst_tgt), e.g. ib_srpt_target_0, which represent a communication interface controlled by a storage protocol driver. * Sessions (scst_session), e.g. 0x00000000000000000002c9030005f34b. A session corresponds to a single initiator-target nexus. * Device handlers (scst_dev_type), e.g. scst_disk or scst_vdisk, which are drivers that allow SCST to export storage. * Target devices (scst_device), e.g. disk01, which represent exported storage. Each instance is controlled by a single device handler. This concept includes e.g. block devices and files. Some but not all target devices have a corresponding device node in /dev. * Access control groups (scst_acg), which allow e.g. to implement LUN masking. * Device-specific information such as the SCSI LUN (scst_acg_dev). As we all know the driver, bus and device abstractions were invented to model how peripheral devices are connected to a system. What a storage target does is the converse - define devices and make it possible for other systems to use these devices. As a result, unfortunately, the driver, bus and device abstractions do not map one-to-one on all of the concepts used in a storage target. So I'm still not sure whether it is a good idea to use the driver, bus and or/device concepts for all of the above concepts. Also, using the driver, bus or device concept for one or more of the storage target concepts would open up a potential for naming conflicts. There is already e.g. the well-known null device. SCST e.g. defines a vdisk_nullio target template. Having these all as device nodes under /dev might not only confuse users but also creates a huge potential for naming conflicts. That's why we prefer separate namespaces ...
As I explained in the previous e-mail, I believe, SYSFS perfectly suits SCST and SCST perfectly suits SYSFS. If you think it isn't so, let's discuss each showstopper for that, one Looks like a great idea! Thanks, Vlad --
Hmm, frankly, I decided that you agreed with my arguments.. As I wrote, I'm willing to make any changes you requests. I only asked why this should be done. I really don't understand why we and other similar in-kernel developers should treat kobjects in the different way than any other subobjects of our outer objects and make for them _additional code_ to specially treat them as life-time center (http://lkml.org/lkml/2010/11/10/421)? You have not explained it anywhere in any doc I can find. This is just small "why" question. Greg, don't we have a right to ask It is REALLY frustrating you are refusing to explain why. I guess, I'm too stupid to figure out that alone. Don't you want we rise as highly skilled kernel developers? I believe, not only SCST developers are very interested to know background behind particular moves in the kernel. Thanks, Vlad --
We might have missed something, but as far as we know it has not yet been explained in this thread why using 'struct device' would be an advantage over using 'struct kobject'. All I can see are the disadvantages of such a transition: instead of having a single hierarchy that represents all SCST-related information, there would be multiple, and the hierarchical relationship between objects would be lost. Also, during startup, once all SCST-related kernel modules have been loaded, configuration happens by writing values to individual sysfs variables. There is a user-space tool included with SCST that not only can restore a configuration from a file but also can save an existing configuration to file. I'm afraid that saving an existing configuration would be made considerably more difficult by transforming the single SCST sysfs-tree into multiple. Below you can find an example of a sysfs-tree created by SCST: # (cd /sys/kernel/scst_tgt && find | cut ...
It's very simple. You want your device to show up in the global device tree in the kernel, not off to one side, unconnected to anything else. Please use 'struct device', it is what you want to do here. good luck, greg k-h --
But we don't have any device to show up in the global device tree! We don't have any devices in the struct device's understanding at all! Vlad --
Then create them just like you are doing so for your kobject use. The first device would be the root one, and then everything trickles down from there. And use configfs for your configuration stuff, that's what it is there for, and if it doesn't somehow work properly for you, please work with the configfs developers to fix that up. thanks, greg k-h --
I don't have any opinion on the above, but I don't see why sysfs can't be used for configuration as well as its other roles. It seems to me wasteful to require configfs to be used in order to change configuration when sysfs works fine for this. Here are a couple of existing examples where sysfs is used in a role that would seem similar to SCST's usage: 1) scsi_transport_fc already has an sysfs file, fc_host/vport_create, which can be written to create new fc_host and scsi_host instances. 2) fcoe.ko uses a write to the sysfs file /sys/module/fcoe/parameters/create to start the protocol on a particular ethernet interface. There's another file for destroy. I'll bet there are other examples in other subsystems, and I don't think there is anything wrong with the above usages of sysfs. I agree with Vladislav's point that configfs doesn't work instead of sysfs because configs doesn't make it easy for the kernel side to create nodes for dynamic information like connected initiators and statistics. So, if the above examples are considered a misuse of sysfs, SCST would need to use both sysfs and configfs. It would use configfs to do configuration, and sysfs to display and access the dynamic information like connected initiators. That seems a minor role for configfs, which is easily handled in sysfs as SCST currently does. Just my two cents. Joe --
I disagree that conversion of 'demo mode' struct se_node_acls to explict userspace syscall driven configfs registered struct se_node_acl->acl_group is overly complex, or otherwise difficult in context of a configfs consumer for a transition demo mode -> explict NodeACL via: mkdir -p /sys/kernel/config/target/$FABRIC/$WWN/$TPGT/acls/$INITIATOR_WWPN. Note the current v4.0 code supports this for LIO-Target and TCM/FC fabric module code, and will just work 'out of the box' for v4.0 compatible fabric modules using target_core_fabric_configfs.c logic. As for statistics point, this is also currently handled by TCM in struct se_node_acl in target_core_mib.c:scsi_auth_intr_seq_show() for both explict NodeACL and TPG context 'demo-mode' generated struct se_node_acl I think the question is more along the lines of: does sysfs have a hard requirement for target mode using native TCM/ConfigFS logic.?. Early in the development of TCM/LIO v3.0 I thought that driving creation of configfs struct_groups from kernel code was in fact useful for my own development and for future mainline code, but with the help of foresight of jlbec, gregkh and others with our modern v4.0 code I am no longer convinced we need any direct kernel level interaction between native v4.0 target mode configfs code and existing sysfs code. So far using interpreted userspace python/shell code has served this purpose really quite well in terms of interaction between target mode configfs and existing sysfs design, and I do not currently see any inherient limitations from a userspace code driven persective for a full native configfs target implementation. I would be happy to be proven wrong on this point, but so far the original decision to drive target mode completely from userspace with native configfs code has in fact proven to be a better decision than a hybrid kernel level configfs + sysfs implementation from the perspective of the userspace developer driving /sys/kernel/config/target/* code. So on this ...
On Mon, 15 Nov 2010 21:04:19 -0800 Well, I'm not involved here.. but to me it makes sense. It's just a useful principle in software design. You don't take something that is designed for one purpose and mis-use it. It will grow warts and mutate, stealing flexibility. it's one of the reasons many software projects become unmaintainable... nobody can change anything anymore because the Flo --
Thank you, Joe, for those good examples. I may not know for what SYSFS was _originally_ designed, but the current reality is that SYSFS is _widely_ used to both represent the kernel internal configuration as well as to change it. And SCST is just following that. After all, if SYSFS supposed to only represent the kernel internal configuration, why wasn't it from the beginning made read-only? Vlad --
Hello Greg, The reason why we use sysfs instead of configfs is that we do not only want to export kernel objects to user space but because we also want to allow configuration from user space. As far as I can see configfs has been designed to allow configuration from user space only and not for exporting kernel objects. A quote from Documentation/filesystems/configfs/configfs.txt: 12 [What is configfs?] 13 14 configfs is a ram-based filesystem that provides the converse of 15 sysfs's functionality. Where sysfs is a filesystem-based view of 16 kernel objects, configfs is a filesystem-based manager of kernel 17 objects, or config_items. Bart. --
Why? Greg, sorry, you keep writing as if we all are idiots, but keep refusing to explain us, idiots, why we can't do what we did. It isn't very Sorry, I can't understand what you mean. What would be the purpose of this configuration and what benefits it would bring? Anyway, let's forget about SCSI. Do you believe that struct device should be created for all IP addresses on which an NFS server listen and I explained why it isn't a too good option in another e-mail. Thanks, Vlad --
Hmm, I thought we had the approvement in the "[RFC]: SCST sysfs layout" thread (http://osdir.com/ml/linux-kernel/2009-04/msg07822.html). Particularly, in the message http://osdir.com/ml/linux-kernel/2009-04/msg08557.html. But, looks like it isn't so and I should have asked you, the SYSFS maintainer. Sorry. Could you consider it, please? I'll try to explain. It's a bit long story involving deep SCSI target specific knowledge, but I'll try to make it as simple and short as possible. SCST is a SCSI _target_ side subsystem, i.e. it is a _server_ side sybsystem. It exports SCSI-speaking devices, not using them. You can consider it as an NFS server. What usually meant by "SCSI subsystem" is SCSI _initiator_ subsystem, i.e. client side subsystem (like NFS client), which is using SCSI-speaking devices by sending SCSI commands to them. Any SCSI-speaking protocol can be used with SCST: parallel (wide) SCSI, Fibre Channel, iSCSI, SAS, SRP, iSER, etc. (Also, non-SCSI speaking protocols, like AoE and NBD can be used, but that's another story.) Strictly as required by SCSI Architecture Model (SAM [1]), SCST doesn't deal with hardware devices. The closest to hardware things SCST deals with are SCSI target ports and SCSI target devices. SCSI target port is an abstract concept reflecting path through which SCST exports devices. You can consider it as an IP network (network, not interface!) through which an NFS server's exports can be used. For instance, for iSCSI such ports are iSCSI targets. For Fibre Channel - virtual or hardware Fibre Channel ports. SCSI target device is an abstract concept, which provides a way to reach real storage (files, block devices, etc.) and contains internal state information (reservations, configuration parameters, etc.). You can consider it as an NFS export. Please don't confuse it with SCSI _initiator_ device, which is a place to generate SCSI commands and send them via one or more SCSI initiator ports (MPIO). On the target side they will be ...
The same rules as with files - the object disappears from the "directories" so no new users can get it but is not destroyed till last You delete the object from the list, then mark it as dead, notify users, drop refcount. No new users will get it (as it is not on the list anymore) and existing ones should notice that it is dead and stop using Right, Tejun plugged this particular (and very annoying) attributes behavior, but that does not mean that this is the only way kobject's reference might be pinned. -- Dmitry --
Those are good in theory, but on practice, you know, devil is in the This behavior isn't annoying, it's GREAT, because it allows to use SYSFS Could you be more specific and point out on exact ways for that? From my quite deep SYSFS source code study I see such cases should not exist. Thanks, Vlad --
Right, I mean that _before_ Tejun plugged that hole the behavior _was_ While I do not know offhand I am sure there are such scenarios. Isn't there any way for the users that you are waiting on descend back into your module that is waiting for kobject removal and get stuck on some resource? Even if it isn't possible now the scheme is quite fragile. Kobjects are refcounted so work with them appropriately (rely on refcount, do not wait, etc). -- Dmitry --
No, I don't see any, because SYSFS implements atomic "all or nothing" The same is true for other SCST objects. For instance, a target can't be destroyed until there are commands from it being processed. So, kobjects are only one of the objects we wait for all their ref counters reach zero, hence addition of kobjects to SCST objects changed nothing in this area. Vlad --
This patch contains SCST SGV cache. SCST SGV cache is a memory management subsystem in SCST. One can call it a "memory pool", but Linux kernel already have a mempool interface, which serves different purposes. SGV cache provides to SCST core, target drivers and backend dev handlers facilities to allocate, build and cache SG vectors for data buffers. The main advantage of it is the caching facility, when it doesn't free to the system each vector, which is not used anymore, but keeps it for a while (possibly indefinitely) to let it be reused by the next consecutive command. This allows to: - Reduce commands processing latencies and, hence, improve performance; - Make commands processing latencies predictable, which is essential for RT applications. The freed SG vectors are kept by the SGV cache either for some (possibly indefinite) time, or, optionally, until the system needs more memory and asks to free some using the set_shrinker() interface. Also the SGV cache allows to: - Cluster pages together. "Cluster" means merging adjacent pages in a single SG entry. It allows to have less SG entries in the resulting SG vector, hence improve performance handling it as well as allow to work with bigger buffers on hardware with limited SG capabilities. - Set custom page allocator functions. For instance, scst_user device handler uses this facility to eliminate unneeded mapping/unmapping of user space pages and avoid unneeded IOCTL calls for buffers allocations. In fileio_tgt application, which uses a regular malloc() function to allocate data buffers, this facility allows ~30% less CPU load and considerable performance increase. - Prevent each initiator or all initiators altogether to allocate too much memory and DoS the target. Consider 10 initiators, which can have access to 10 devices each. Any of them can queue up to 64 commands, each can transfer up to 1MB of data. So, all of them in a peak can allocate up to 10*10*64 = ~6.5GB of memory for data buffers. This ...
This patch contains SCST core's docs. Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- README.scst | 1473 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ SysfsRules | 942 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 2415 insertions(+) diff -uprN orig/linux-2.6.35/Documentation/scst/README.scst linux-2.6.35/Documentation/scst/README.scst --- orig/linux-2.6.35/Documentation/scst/README.scst +++ linux-2.6.35/Documentation/scst/README.scst @@ -0,0 +1,1473 @@ +Generic SCSI target mid-level for Linux (SCST) +============================================== + +SCST is designed to provide unified, consistent interface between SCSI +target drivers and Linux kernel and simplify target drivers development +as much as possible. Detail description of SCST's features and internals +could be found on its Internet page http://scst.sourceforge.net. + +SCST supports the following I/O modes: + + * Pass-through mode with one to many relationship, i.e. when multiple + initiators can connect to the exported pass-through devices, for + the following SCSI devices types: disks (type 0), tapes (type 1), + processors (type 3), CDROMs (type 5), MO disks (type 7), medium + changers (type 8) and RAID controllers (type 0xC). + + * FILEIO mode, which allows to use files on file systems or block + devices as virtual remotely available SCSI disks or CDROMs with + benefits of the Linux page cache. + + * BLOCKIO mode, which performs direct block IO with a block device, + bypassing page-cache for all operations. This mode works ideally with + high-end storage HBAs and for applications that either do not need + caching between application and disk or need the large block + throughput. + + * "Performance" device handlers, which provide in pseudo pass-through + mode a way for direct performance measurements without overhead of + actual data transferring from/to underlying SCSI device. + +In addition, SCST supports advanced per-initiator access and ...
This patch contains SCST dev handlers' Makefile. Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- Makefile | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff -uprN orig/linux-2.6.35/drivers/scst/dev_handlers/Makefile linux-2.6.35/drivers/scst/dev_handlers/Makefile --- orig/linux-2.6.35/drivers/scst/dev_handlers/Makefile +++ linux-2.6.35/drivers/scst/dev_handlers/Makefile @@ -0,0 +1,13 @@ +ccflags-y += -Wno-unused-parameter + +obj-m := scst_cdrom.o scst_changer.o scst_disk.o scst_modisk.o scst_tape.o \ + scst_vdisk.o scst_raid.o scst_processor.o + +obj-$(CONFIG_SCST_DISK) += scst_disk.o +obj-$(CONFIG_SCST_TAPE) += scst_tape.o +obj-$(CONFIG_SCST_CDROM) += scst_cdrom.o +obj-$(CONFIG_SCST_MODISK) += scst_modisk.o +obj-$(CONFIG_SCST_CHANGER) += scst_changer.o +obj-$(CONFIG_SCST_RAID) += scst_raid.o +obj-$(CONFIG_SCST_PROCESSOR) += scst_processor.o +obj-$(CONFIG_SCST_VDISK) += scst_vdisk.o --
This patch contains SCST vdisk dev handler. This dev handler allows to create virtual disks and CDROMs from files on file system. It supports the following modes: - FILEIO mode, which allows to use files on file systems or block devices as virtual remotely available SCSI disks or CDROMs with benefits of the Linux page cache. - BLOCKIO mode, which performs direct block IO with a block device, bypassing page-cache for all operations. This mode works ideally with high-end storage HBAs and for applications that either do not need caching between application and disk or need the large block throughput. - NULLIO mode, in which all the commands immediately completed. This mode is intended for performance measurements without overhead of actual data transfers from/to underlying SCSI device. Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- scst_vdisk.c | 4186 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 4186 insertions(+) diff -uprN orig/linux-2.6.35/drivers/scst/dev_handlers/scst_vdisk.c linux-2.6.35/drivers/scst/dev_handlers/scst_vdisk.c --- orig/linux-2.6.35/drivers/scst/dev_handlers/scst_vdisk.c +++ linux-2.6.35/drivers/scst/dev_handlers/scst_vdisk.c @@ -0,0 +1,4186 @@ +/* + * scst_vdisk.c + * + * Copyright (C) 2004 - 2010 Vladislav Bolkhovitin <vst@vlnb.net> + * Copyright (C) 2004 - 2005 Leonid Stoljar + * Copyright (C) 2007 Ming Zhang <blackmagic02881 at gmail dot com> + * Copyright (C) 2007 Ross Walker <rswwalker at hotmail dot com> + * Copyright (C) 2007 - 2010 ID7 Ltd. + * + * SCSI disk (type 0) and CDROM (type 5) dev handler using files + * on file systems or block devices (VDISK) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2 + * of the License. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ...
This patch contains SCST pass-through dev handlers. Those handlers allows to export local SCSI-capable devices to remote initiators. 1:many relationship with remote initiators is supported, i.e. many initiators can connect to single exported local SCSI-capable device at the same time. It is possible, because SCST core emulates necessary functionality of SCSI host adapter. Such emulation is needed, because from remote initiators' point of view a SCSI target acts as a SCSI host with its own devices. You can find more deep elaboration why it is needed in http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg06911.html. Some of the emulated functions are: * Generation of necessary UNIT ATTENTIONs, their storage and delivery to all connected remote initiators. * RESERVE/RELEASE functionality. * All types of RESETs and other task management functions. * REPORT LUNS command as well as SCSI address space management in order to have consistent address space on all remote initiators, since local SCSI devices could not know about each other to report via REPORT LUNS command. Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- scst_cdrom.c | 245 ++++++++++++++++++++ scst_changer.c | 167 +++++++++++++ scst_dev_handler.h | 27 ++ scst_disk.c | 647 +++++++++++++++++++++++++++++++++++++++++++++++++++++ scst_modisk.c | 328 ++++++++++++++++++++++++++ scst_processor.c | 167 +++++++++++++ scst_raid.c | 168 +++++++++++++ scst_tape.c | 361 +++++++++++++++++++++++++++++ 8 files changed, 2110 insertions(+) diff -uprN orig/linux-2.6.35/drivers/scst/dev_handlers/scst_cdrom.c linux-2.6.35/drivers/scst/dev_handlers/scst_cdrom.c --- orig/linux-2.6.35/drivers/scst/dev_handlers/scst_cdrom.c +++ linux-2.6.35/drivers/scst/dev_handlers/scst_cdrom.c @@ -0,0 +1,245 @@ +/* + * scst_cdrom.c + * + * Copyright (C) 2004 - 2010 Vladislav Bolkhovitin <vst@vlnb.net> + * Copyright (C) 2004 - 2005 Leonid Stoljar + * ...
This patch implements function blk_rq_map_kern_sg(), which allows to map a kernel-originated SG vector to a block request. It is necessary to execute SCSI commands with from the kernel going SG buffer. SCST needs this functionality, because its target drivers, which are, basically, SCSI drivers, can deal only with SGs, not with BIOs. Highlight of this implementations: - It uses BIOs chaining instead of kmalloc()'ing the whole bio, which is more reliable. - It uses SGs chaining instead of kmalloc()'ing one big SG in case if direct mapping failed (e.g. because of DMA alignment or padding), which is also more reliable. - When needed, copy_page() is used instead of memcpy() to copy the whole pages faster. Also this patch adds and exports function sg_copy(), which copies one SG vector to another. This patch has already been reviewed in LKML/linux-scsi in http://lkml.org/lkml/2009/8/12/304. Since then it was only been updated for 2.6.35. All this functionality can be a part of the internal SCST library, but looks like a better place will be in the common place near similar routines. Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- block/blk-map.c | 334 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/blkdev.h | 3 include/linux/scatterlist.h | 5 lib/scatterlist.c | 129 ++++++++++++++++ 4 files changed, 471 insertions(+) diff -upkr linux-2.6.35/block/blk-map.c linux-2.6.35/block/blk-map.c --- linux-2.6.35/block/blk-map.c 2010-05-17 01:17:36.000000000 +0400 +++ linux-2.6.35/block/blk-map.c 2010-05-24 15:19:49.000000000 +0400 @@ -5,6 +5,8 @@ #include <linux/module.h> #include <linux/bio.h> #include <linux/blkdev.h> +#include <linux/scatterlist.h> +#include <linux/slab.h> #include <scsi/sg.h> /* for struct sg_iovec */ #include "blk.h" @@ -271,6 +273,338 @@ int blk_rq_unmap_user(struct bio *bio) } EXPORT_SYMBOL(blk_rq_unmap_user); +struct blk_kern_sg_work { + atomic_t ...
This patchset contains scst_local target driver with documentation. This driver allows to access devices that are exported via SCST directly on the same Linux system that they are exported from. Some highlights: 1. Fully zero-copy implementation with in-kernel dev handlers 2. Support for BIDI commands and long CDBs 3. Easy way to locate sg/bsg devices (LUNs) for each scst_local's session (Scsi_Host) 4. Allows to create several targets with several sessions (Scsi_Hosts) each 5. Allows for each target to set transport type (SCSI and physical transport version descriptors) 6. Allows for each session set TransportID, used, e.g., to identify the 'I' part of I_T nexus in Persistent Reservation commands (3) means that each scst_local's session's SYSFS entry contains a link to the corresponding SCSI host. Using it one can find local sg/bsg/sd/etc. devices of this session. For instance, this links points out to host12, so you can find your sg devices by: $ lsscsi -g|grep "\[12:" [12:0:0:0] disk SCST_FIO rd1 200 /dev/sdc /dev/sg2 [12:0:0:1] disk SCST_FIO nullio 200 /dev/sdd /dev/sg3 They are /dev/sg2 and /dev/sg3. (4)-(6) allows creation of full features multi-transport/multi-initiator (I_T nexus) SCST target drivers in user space, which can share devices with in-kernel target drivers, i.e. all the reservations, including Persistent Reservations, and other similar per-I_T nexus functionality will be done correctly. The recommended workflow for a user space target driver: 1. For each SCSI target a user space target driver should create an scst_local's target using "add_target" SYSFS command. 2. Then the user space target driver should, if needed, set its SCSI and physical transport version descriptors (by default it is SAS) using SYSFS attributes scsi_transport_version and phys_transport_version correspondingly in /sys/kernel/scst_tgt/targets/scst_local/target_name directory. 3. For incoming session ...
This patch contains SCST InfiniBand SRP target driver. This driver works directly on top of InfiniBand stack and SCST. It is a high performance driver capable of handling 600K+ 4K random write IOPS by a single target as well as 2.5+GB/s sequential throughput from a single QDR IB port. It was originally developed by Vu Pham/Mellanox, currently Bart Van Assche is maintaining and improving it. Signed-off-by: Vu Pham <vu@mellanox.com> Signed-off-by: Bart Van Assche <bart.vanassche@gmail.com> Signed-off-by: Vladislav Bolkhovitin <vst@vlnb.net> --- Documentation/scst/README.srpt | 112 + drivers/scst/srpt/Kconfig | 12 drivers/scst/srpt/Makefile | 1 drivers/scst/srpt/ib_dm_mad.h | 139 + drivers/scst/srpt/ib_srpt.c | 3809 +++++++++++++++++++++++++++++++++++++++++ drivers/scst/srpt/ib_srpt.h | 370 +++ 6 files changed, 4443 insertions(+) diff -uprN orig/linux-2.6.35/drivers/scst/srpt/Kconfig linux-2.6.35/drivers/scst/srpt/Kconfig --- orig/linux-2.6.35/drivers/scst/srpt/Kconfig +++ linux-2.6.35/drivers/scst/srpt/Kconfig @@ -0,0 +1,12 @@ +config SCST_SRPT + tristate "InfiniBand SCSI RDMA Protocol target support" + depends on INFINIBAND && SCST + ---help--- + + Support for the SCSI RDMA Protocol (SRP) Target driver. The + SRP protocol is a protocol that allows an initiator to access + a block storage device on another host (target) over a network + that supports the RDMA protocol. Currently the RDMA protocol is + supported by InfiniBand and by iWarp network hardware. More + information about the SRP protocol can be found on the website + of the INCITS T10 technical committee (http://www.t10.org/). diff -uprN orig/linux-2.6.35/drivers/scst/srpt/Makefile linux-2.6.35/drivers/scst/srpt/Makefile --- orig/linux-2.6.35/drivers/scst/srpt/Makefile +++ linux-2.6.35/drivers/scst/srpt/Makefile @@ -0,0 +1,1 @@ +obj-$(CONFIG_SCST_SRPT) += ib_srpt.o diff -uprN orig/linux-2.6.35/drivers/scst/srpt/ib_dm_mad.h ...
The current ibmvstgt and libsrp kernel modules are based on the tgt infrastructure. Both modules need the scsi_tgt kernel module and the tgtd user space process in order to function properly. This patch modifies the ibmvstgt and libsrp kernel modules such that both use the SCST storage target framework instead of tgt. This patch introduces one backwards-incompatible change, namely that the path of the ibmvstgt sysfs attributes is modified. This change is unavoidable because this patch dissociates ibmvstgt SRP sessions from a SCSI host instance. Notes: - ibmvstgt is the only user of libsrp. - A 2.6.35 kernel tree with this patch applied does compile cleanly on the systems supported by the ibmvstgt kernel module, the patch itself is checkpatch clean and does not introduce any new sparse warnings. This patch has not been tested in any other way however. The primary purpose of this patch is to invite feedback about the chosen approach. <IMPORTANT> We are looking for hardware to complete this driver. Any help will be greatly appreciated! </IMPORTANT> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Vladislav Bolkhovitin <vst@vlnb.net> --- Documentation/powerpc/ibmvstgt.txt | 2 drivers/scsi/ibmvscsi/ibmvstgt.c | 609 +++++++++++++++++++++++++------------ drivers/scsi/libsrp.c | 87 ++--- include/scsi/libsrp.h | 16 4 files changed, 474 insertions(+), 240 deletions(-) --- orig/linux-2.6.35/drivers/scsi/ibmvscsi/ibmvstgt.c 16:47:55.220115813 +0400 +++ linux-2.6.35/drivers/scsi/ibmvscsi/ibmvstgt.c 15:50:36.616855875 +0400 @@ -5,6 +5,7 @@ * Linda Xie (lxie@us.ibm.com) IBM Corp. * * Copyright (C) 2005-2006 FUJITA Tomonori <tomof@acm.org> + * Copyright (C) 2010 Bart Van Assche <bvanassche@acm.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -24,15 +25,13 @@ #include <linux/interrupt.h> #include ...
Because of the conversion of the ibmvstgt driver from tgt to SCST, and because the ibmvstgt driver was the only user of scsi_tgt, the scsi_tgt kernel module, the CONFIG_SCSI_TGT, CONFIG_SCSI_SRP_TGT_ATTRS and CONFIG_SCSI_FC_TGT_ATTRS kbuild variable, the scsi_host_template member variables transfer_response, supportedmode and active_mode and the constants MODE_UNKNOWN, MODE_INITIATOR and MODE_TARGET are no longer needed. Note: this patch applies cleanly on a 2.6.35 kernel tree. The patch tool however complains about the defconfig changes when trying to apply this patch on a 2.6.36 kernel tree. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Vladislav Bolkhovitin <vst@vlnb.net> --- arch/arm/configs/at572d940hfek_defconfig | 1 arch/arm/configs/cam60_defconfig | 1 arch/arm/configs/s3c2410_defconfig | 1 arch/m68k/configs/amiga_defconfig | 2 arch/m68k/configs/apollo_defconfig | 2 arch/m68k/configs/atari_defconfig | 2 arch/m68k/configs/bvme6000_defconfig | 2 arch/m68k/configs/hp300_defconfig | 2 arch/m68k/configs/mac_defconfig | 2 arch/m68k/configs/multi_defconfig | 2 arch/m68k/configs/mvme147_defconfig | 2 arch/m68k/configs/mvme16x_defconfig | 2 arch/m68k/configs/q40_defconfig | 2 arch/m68k/configs/sun3_defconfig | 2 arch/m68k/configs/sun3x_defconfig | 2 arch/mips/configs/bcm47xx_defconfig | 1 arch/mips/configs/decstation_defconfig | 1 arch/mips/configs/ip22_defconfig | 1 arch/mips/configs/ip27_defconfig | 1 arch/mips/configs/ip32_defconfig | 1 arch/mips/configs/jazz_defconfig | 1 arch/mips/configs/malta_defconfig | 1 ...
Note: as far as I know currently SCST is the first and only in-kernel target storage framework that satisfies both criteria (1) and (2) as defined during the LSF 2010 summit storage track (see also http://lwn.net/Articles/400589/): 1. Being a drop in replacement for STGT (the current in-kernel target mode driver). 2. Using a modern sysfs based control and configuration plane. Regarding criterion (3), that the code was reviewed as clean enough for inclusion: any feedback is welcome. Bart. --
Hi All, Small Tree would very much like to see the SCST framework included in the mainline kernel. It's used in one of the products we've been working on as well as in some products we sell. It seems reliable and the performance seems good. Having it in the mainline kernel would save us a great deal of headache. Steve -- Steve Modica CTO - Small Tree Communications www.small-tree.com phone: 651-209-6509 ext 301 mobile: 651-261-3201 --
