Ok, round 2 of an attempt at making a format agnostic structured output library. The idea being that the command doing the output doesn't have to care what the actual output format is, it uses one set of output functions and the user gets to choose their preferred output style. The current backend/frontend interface probably needs expanding so that a less noddy XML output can be used, but it's a start. Rather than building an in-memory structure and then writing it out I've gone for the approach of writing out immediately. The thought behind this was that I didn't really want to force commands like log to have to wait 'til the end to start outputting information (though I still haven't got around to working on converting log). Probably the biggest change from v1 is an expanded aim. Now the output library is aimed at controlling _all_ plubming output. This series includes a patch for ls-tree that has all it's output going through the library, and a patch for status that has all the --porcelain output going through the library. The XML patch still needs a lot of work, as I've been busy playing with the library API and the NORMAL/ZERO backends ... Julian Phillips (4): output: Add a new library for plumbing output ls-tree: complete conversion to using output library status: use output library for porcelain output output: WIP: Add XML backend Documentation/technical/api-output.txt | 116 ++++++++++++++ Makefile | 5 + builtin/commit.c | 21 +++- builtin/ls-tree.c | 51 ++++-- output-json.c | 127 +++++++++++++++ output-normal.c | 95 +++++++++++ output-xml.c | 68 ++++++++ output-zero.c | 74 +++++++++ output.c | 270 ++++++++++++++++++++++++++++++++ output.h | 93 +++++++++++ wt-status.c ...
All output from ls-tree now goes through the output library - even the
regular output.
Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---
builtin/ls-tree.c | 51 +++++++++++++++++++++++++++++++++------------------
1 files changed, 33 insertions(+), 18 deletions(-)
diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
index dc86b0d..7e19d19 100644
--- a/builtin/ls-tree.c
+++ b/builtin/ls-tree.c
@@ -10,8 +10,8 @@
#include "quote.h"
#include "builtin.h"
#include "parse-options.h"
+#include "output.h"
-static int line_termination = '\n';
#define LS_RECURSIVE 1
#define LS_TREE_ONLY 2
#define LS_SHOW_TREES 4
@@ -22,6 +22,7 @@ static int ls_options;
static const char **pathspec;
static int chomp_prefix;
static const char *ls_tree_prefix;
+static struct output_context *oc;
static const char * const ls_tree_usage[] = {
"git ls-tree [<options>] <tree-ish> [path...]",
@@ -90,27 +91,32 @@ static int show_tree(const unsigned char *sha1, const char *base, int baselen,
(baselen < chomp_prefix || memcmp(ls_tree_prefix, base, chomp_prefix)))
return 0;
+ output_start_object(oc, "entry");
if (!(ls_options & LS_NAME_ONLY)) {
+ output_strf(oc, "mode", "%06o", mode);
+ output_token(oc, " ");
+ output_str(oc, "type", type);
+ output_token(oc, " ");
+ output_str(oc, "hash", find_unique_abbrev(sha1, abbrev));
if (ls_options & LS_SHOW_SIZE) {
- char size_text[24];
- if (!strcmp(type, blob_type)) {
+ if (strcmp(type, blob_type)) {
+ output_token(oc, " -");
+ } else {
unsigned long size;
+ output_token(oc, " ");
+ output_next_directive(oc, "7");
if (sha1_object_info(sha1, &size) == OBJ_BAD)
- strcpy(size_text, "BAD");
+ output_str(oc, "size", "BAD");
else
- snprintf(size_text, sizeof(size_text),
- "%lu", size);
- } else
- strcpy(size_text, "-");
- printf("%06o %s %s %7s\t", mode, type,
- find_unique_abbrev(sha1, abbrev),
- size_text);
- } ...Use the new output library for output when in porcelain mode.
Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---
builtin/commit.c | 21 +++++++++++-
wt-status.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++---
wt-status.h | 3 +-
3 files changed, 104 insertions(+), 8 deletions(-)
diff --git a/builtin/commit.c b/builtin/commit.c
index c5ab683..4e506b1 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -25,6 +25,7 @@
#include "rerere.h"
#include "unpack-trees.h"
#include "quote.h"
+#include "output.h"
static const char * const builtin_commit_usage[] = {
"git commit [options] [--] <filepattern>...",
@@ -68,6 +69,8 @@ static int all, edit_flag, also, interactive, only, amend, signoff;
static int quiet, verbose, no_verify, allow_empty, dry_run, renew_authorship;
static int no_post_rewrite;
static char *untracked_files_arg, *force_date;
+static char *structured_output_arg;
+static enum output_style output_style;
/*
* The default commit message cleanup mode will remove the lines
* beginning with # (shell comments) and leading and trailing
@@ -137,6 +140,7 @@ static struct option builtin_commit_options[] = {
"show porcelain output format", STATUS_FORMAT_PORCELAIN),
OPT_BOOLEAN('z', "null", &null_termination,
"terminate entries with NUL"),
+ OPT_OUTPUT(0, "output", &structured_output_arg),
OPT_BOOLEAN(0, "amend", &amend, "amend previous commit"),
OPT_BOOLEAN(0, "no-post-rewrite", &no_post_rewrite, "bypass post-rewrite hook"),
{ OPTION_STRING, 'u', "untracked-files", &untracked_files_arg, "mode", "show untracked files, optional modes: all, normal, no. (Default: all)", PARSE_OPT_OPTARG, NULL, (intptr_t)"all" },
@@ -420,7 +424,7 @@ static int run_status(FILE *fp, const char *index_file, const char *prefix, int
wt_shortstatus_print(s, null_termination);
break;
case STATUS_FORMAT_PORCELAIN:
- wt_porcelain_print(s, null_termination);
+ wt_porcelain_print(s, output_style);
break;
...Add a library that allows commands to produce output in any of a range of formats using a single API. The idea being that by running all plumbing command output through this library the output can easily be switched to an alternative output style (e.g. JSON), while still supporting the current output formats. The API includes an OPT_OUTPUT and handle_output_arg so that the option handling for different commands will be as similar as possible. Documentation for the API is included in Documentation/technical/api-output.txt. At the moment the only new output format is JSON. Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk> --- Documentation/technical/api-output.txt | 116 ++++++++++++++ Makefile | 4 + output-json.c | 127 +++++++++++++++ output-normal.c | 95 +++++++++++ output-zero.c | 74 +++++++++ output.c | 266 ++++++++++++++++++++++++++++++++ output.h | 92 +++++++++++ 7 files changed, 774 insertions(+), 0 deletions(-) create mode 100644 Documentation/technical/api-output.txt create mode 100644 output-json.c create mode 100644 output-normal.c create mode 100644 output-zero.c create mode 100644 output.c create mode 100644 output.h diff --git a/Documentation/technical/api-output.txt b/Documentation/technical/api-output.txt new file mode 100644 index 0000000..a811cbe --- /dev/null +++ b/Documentation/technical/api-output.txt @@ -0,0 +1,116 @@ +structured output API +===================== + +The structured output API is provided by output.h and consists of a set of +functions for outputting data in a structured manner in one of a number of +formats (referred to as output styles). + +The output consists of objects, arrays and the actual values, the term item is +used where any of these may be used, and container when either an object or +array may be used. Objects are ...
On Mon, Apr 12, 2010 at 12:21:14AM +0100, Julian Phillips wrote: I'm writing S-Expression output backend as experiment (not yet even sendable as WIP) and hit an issue in general framework... List? Above says types are 'object', 'array' and 'value'. Then it defines Maybe add extra note about these. When one sees output_token used in code outputting stuff, one can get puzzled until one realizes that token output This is AFAIK really inapporiate for canonical S-Expression output. Point of canonical S-Expressions is to have only one way to serialize given tree (bit for bit identicality) and linefeeds are not allowed except as serialization of linefeed in string. Perhaps one could add method/flag to output backend to tell wheither to -Ilari --
On Tue, 13 Apr 2010 12:43:51 +0300, Ilari Liusvaara Yes. I intend to revist the header and documentation, as they were mostly I think that it probably makes sense to have explicit calls into the backend for start and end rather than assuming that wrapping everything in an object is appropriate, an XML backend for example could then use those callbacks to do XML headers and the outmost element tags etc. -- Julian --
Heya, I like where this is going, a lot, especially since we don't have to convert everything in one go, but we can do it as desired, similar to optparsification. I still think more commands than just these two should be converted to validate the design though, perhaps something like 'git blame', or 'git for-each-ref'? -- Cheers, Sverre Rabbelier --
Speaking as a major customer for the new capabilities it will enable, I strongly concur. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> --
I don't think it is needed for either command. 'git blame' has --porcelain and --incremental output, which is line-based and pretty much self-describing (with "header-name value" syntax for most of it), and well documented. JSON output would only add unnecessary chatter and different quoting rules. 'git for-each-ref' has both --format=<format> to allow to get data what one needs, and in the format one wants (with e.g. %00 to reresent NUL), and [--shell|--perl|--python|--tcl] for placeholders in <format> to be quoted as string literals suitable for specified host language. Although I am not sure if this option, meant to produce scriptlets, is used that much/ note that there is not support for --json quoting, nor --xml escaping. -- Jakub Narebski Poland --
Heya, Those were just the first two plumbing commands that came to mind that I use myself, do you have any suggestions for others that would be more appropriate? -- Cheers, Sverre Rabbelier --
"git config [<type>] --get <name>" and friends... especially that even
in the case of '{ name: "<name>"; value: <value> }' (and without e.g.
"type: <type>" field) we can have <value> of correct JSON type.
"git ls-files" (especially when mixing information about files in
working area and those in index), "git diff-tree" (especially for merges,
as ordinary columnar output do not include all possible information,
like pre-image filename in case of renames), perhaps "git branch" so
people stop trying to parse it in scripts, perhaps "git describe"
(you need "git describe --long" to unambiguously parse its output),
perhaps "git remote show" (I am not sure about this case), "git show-ref"
if "git for-each-ref" didn't exists...
--
Jakub Narebski
Poland
--
Wouldn't the exact same argument apply equally well to the output format of "status --porcelain", by the way? It is line-based and pretty much self-describing (once you know the mnemonic but you can make an educated guess from previous SCM experience). --
No, current "git status --porcelain" output is record-based (tabular); the meaning is not described by header but depends on field in record, i.e. position in line. Self describing output of "git status --porcelain" would be filename <maybe-quoted filename> renamed-from <maybe-quoted filename> similarity 95% worktree ... index ... or something like that... -- Jakub Narebski Poland --
Now, what's wrong about that? For that matter, would you say "diff --raw" output should be JSON/XMLified because it is columnar? --
Well, this whole idea started with the fact, that "git status --short"
was hard (or impossible) to parse unambigously by scripts[1], and even
"git status --porcelain -z"[2] is not that easy to parse[3].
With JSON output format one can use existing JSON parsers, which are
available in any language.
[1] And it was woefully underdocumented
[2] I wonder why git-config and git-grep have '--null' as long version
of '-z' option... and only those.
[3] I rather liked the idea of -Z output format, the form that uses
NUL as field separator for each field (and not only filenames),
and NUL NUL as record terminator; it makes parsing much easier
because you don't need to take a look at other field to know
It would be nice to have raw diff format JSONified, or have --porcelain
(like "git blame --porcelain" output format) version of it. Especially
for "diff -c --raw" i.e. raw output format for merges, which lacks some
information, namely filename and similarity score for n-th pre-image,
if rename or copy was detected.
--
Jakub Narebski
Poland
--
And you apparently seem to agree with that claim, but I don't. I think Jeff (who did the --porcelain stuff; by the way, why did we lose him from Cc list?) has already said that he is open to an update. --
I haven't seen any evidence that status --porcelain (or its -z form) is impossible to parse unambiguously. I don't even think it's that hard, but it certainly could be easier. But more importantly, from looking at the output it's not necessarily _obvious_ how to parse it correctly (e.g., whitespace as value and as field separator, syntax of "-z" depends on semantics of field contents). The approach I proposed was to leave it be and document it a bit better. Adding some format that is close but subtly different is just going to lead to more confusion. But since Julian was willing to do the JSON work, I think that is a much nicer approach. It's not subtly different; it's very different and way easier to read and parse. And I'm really happy with the way he has structured the code to handle multiple output formats. It keeps the code much cleaner, and it should silence any "but YAML is better than JSON is better than XML" debates. Even with Julian's patches, we should still better document the regular and "-z" forms. Eric promised to send some patches this week; I'm hoping -Peff --
Well, IMVHO output of "git status --short" / "git status --porcelain"
(without '-z') is very hard to parse. Even assuming that in the case
of ambiguity filenames are quoted (which also means that in the case of
ambiguity whether they are quoted they must be quoted), the fact that
separator between source and destination filename in the case of rename
detection is " -> " (if I understand it correctly), and neither of ' '
(SPC), '-' nor '>' is replaced by escape sequence means that one needs
to detect where quoted filename begins and where ends. This means
either parsing character by character, taking into account quoting and
escaping (e.g. '\\', '\"' etc.), or using 'balanced quote' regexp like
the one from Text::Balanced, e.g.: (?:\"(?:[^\\\"]*(?:\\.[^\\\"]*)*)\")
What was the reason behind choosing " -> " as separator between pair[1]
of filenames in rename, instead of using default "git diff --stat" format
i.e. 'arch/{i386 => x86}/Makefile' for "git status --short" which is
meant for end user, and for "git status --porcelain" the same format
that raw diff format, i.e. with TAB as separator between filenames,
and filename quited if it contains TAB (then TAB is relaced by '\t',
and does not appear in filename, therefore you can split on TAB)?
IMVHO "git status --porcelain -z" format is not easy to parse either.
(The same can be said for "git diff --raw -z" output format.) You
can't just split on record separator; you have to take into account
status to check if there are two filenames or one.
[1] A question: we have working area version, index version, and HEAD
version of file. Isn't it possible for *each* of them to have
different filename? What about the case of rename/rename merge
Well, the proposed '-Z' output format, in the OFS="\0", ORS="\0\0"
variant, would be very easy to parse. If I understand it correctly
I really like this outputification ;-) too.
Although if possible I'd like to have it wrapped in utility macros,
like parseopt, so one does ...Yeah, I don't disagree with your reasons (which are largely the same as Eric's). I just don't think it's "oh no this is useless and we have to I don't know Junio's reason for using " -> " in --short; probably because it was the format used in non-short status. For --porcelain, it was simply because I used exactly --short. I assumed that --short was suitable for parsing (which it _is_, it just has some rough edges), and wanted to provide an option right away that would keep the output stable, so we didn't run into the usual problem of people wanting to enhance the human-readable interface, but being blocked by script Yep, I agree. I think the JSON approach is the best solution, as it is Good question. The answer is no, the three different versions can't have three filenames on the same line, because we don't do rename detection between the working tree and the index. Which makes sense. Consider something like this: mkdir repo && cd repo && git init echo content >one git add one && git commit -m one mv one two && git add -A mv two three git status We will see the movement of "one -> two" between the index and HEAD. In theory we could see the movement of "three -> two" between the index and working tree. But "three" isn't tracked, so instead we see "two" deleted and "three" untracked. We can mark "three" with intent-to-add to note that we are interested in it, but then it is not a new file any more (since it has an index entry), and is therefore not eligible for rename detection. As for a rename/rename conflict, it gets represented in the index as both deleting the source and then each side adding its new version with a conflict. So: mkdir repo && cd repo && git init echo content >one git add one && git commit -m base git mv one two && git commit -m two git checkout -b other HEAD^ git mv one three && git commit -m three git merge master git status generates: # On branch other # Unmerged paths: # both deleted: ...
[cut]
Something like that (please remember that it is still in vague beginnings
of an idea stage:
OUT_OBJECT(
OUT_FIELD("mode", OUT_MODE, tree.mode), SP,
OUT_FIELD("type", "%s", tree.object.type), SP,
OUT_FIELD("object", OUT_SHA1, tree.object.sha1), TAB,
OUT_FIELD("file", OUT_FILE(sep), tree.filename),
sep
);
--
Jakub Narebski
Poland
--
Doing that would require variadic macros, which are a C99-ism. So you
would have to do:
OUT_OBJECT_START();
OUT_FIELD("mode", OUT_MODE, tree.mode); OUT_SP;
...
OUT_OBJECT_END();
which is not all that different from what Julian has now. I do think
some type-specific conversions might be handy. They don't even need to
be macros. E.g.,:
void output_mode(struct output_context *oc, int mode)
{
output_strf(oc, "mode", "%06o", mode);
}
OTOH, looking over Julian's last patch series, there really aren't that
many that would be generally applicable, and as you can see they only
save a few characters, not even a line. A few bigger objects could be
factored out, but he has already done that (e.g., see
wt_porcelain_unmerged in his v2 3/4).
-Peff
--
Also, backends such as JSON want to know which things are strings, and which are numbers - as they print differently. An XML backend may want to It might help standardise the output between commands if there were helper functions for some of the larger structures - e.g. commits. Though I don't think that those functions would be able to do legacy output, due to the current lack of cross-command output compatibility. I'm starting to see this with blame and diff-tree (and family), where they both want to output information about commits. I think that maybe I need to design and document the output structure for common concepts - so that it would be possible to pass the output from any command to a common parser, with matching utility functions in the code. Though, I'm not sure if there actually are any common concepts that need outputting apart from commits. Current Status -------------- I had been planning to post an updated series this weekend, but I'm too tired to attempt tidying things up for posting at the moment ... If you want to see the current state then my current mess is available at http://git.q42.co.uk/w/output.git. A quick summary of main changes since v2: - backends are now in a subdirectory - blame, diff-tree, have --ouptut=... support for plumbing output - log has some support for --ouput=... - output library has extended API, including quoted strings and is_structured_output function - backend API includes explicit functions for top-level items -- Julian --
Yeah, that was what I saw on looking at the code. And we have to support those old formats, obviously. For the most part, I found the level of verbosity in the patches you posted (and I just peeked at your repo) to be fine. Sure, it's more lines, but they're IMHO very easy to read. If we have to tradeoff between either duplicating output entirely (for both the output form and traditional form) or having a more flexible but slightly more verbose output library, I think I would rather go with the latter. It will be more maintainable in the long run. -Peff --
On Wed, 14 Apr 2010 21:10:35 +0200, Jakub Narebski <jnareb@gmail.com> I think that the ability to say that all plumbing output is available in a variety of standard outputs is potentially useful. In particular the ability to be able to parse the output of all plumbing commands directly into whatever native language the high-level tool is in using an already That depends really. If you are writing something to parse the output, and you already have a JSON parser available then it's the current output that has different quoting rules. ;) Anyway, I have already converted blame to use the library for both --porcelain and --incremental output, so it'll be in the next version of -- Julian --
Nice. How did you managed to work with a bit non-standard rules of --porcelain format, namely maybe-quoting of filenames, and that not all lines conform to "<header> SP <value> LF" syntax: group definition begins with SHA-1, and contents is indented with TAB? -- Jakub Narebski Poland --
On Wed, 14 Apr 2010 23:16:06 +0200, Jakub Narebski <jnareb@gmail.com> I had to extend the output API for quoting rules. There are now qstr/qstrf output functions that call different backend functions. The normal output then applies git quoting rules to the q versions only, whereas the JSON output treats them both the same. For the rest of it, only the actual data is going through the output_str etc functions the rest is using output_token that is ignored by the JSON backend. I actually started converting blame twice - the first time I added an API function that allowed telling the normal output how to format values, but I decided that I wanted to apply grouping to the structured output, so that e.g. the author and committer information were two objects with name, mail, date and tz members so I never finished the first attempt. -- Julian --
Every once in a while, I have some crazy idea for a short script that is
built around blame output (e.g., counting contributors by line count).
Something that I might do in a little one-off perl script. And my
experience has been that 90% of the script ends up parsing and managing
commit blocks, and not the computation of interest.
Not that it's a lot of lines, mind you, but having to write 20 lines of
parser to do a perl one-liner on the result is annoying. I would be very
happy to have some 1 or 2 line solution where one of the lines is "use
I'll be curious to see it. I hope you will (at least optionally) wrap
the _whole_ output and not just the commit blocks. It would be nice to
just suck it in all at once and walk the data structure. But it may be
tricky because the output suppresses the commit description for commits
that have already been output. You would probably want a list of lines
and a map of commits, like:
{
"lines": [
{ sha1 and line info }
{ sha1 and line info }
...
],
"commits": {
"$sha1": { commit info },
...
}
which is close to what I would parse to in a script, except I would
actually drop the "commits" map and point directly to the commit info
from each line.
Is there a way in JSON to refer to the contents of a previous item
without just outputting the same data again? I assume not, and even if
there is, other output formats like XML wouldn't handle it.
-Peff
--
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown |
