Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCC now supports SARIF as an output format #531

Open
davidmalcolm opened this issue Jun 3, 2022 · 22 comments
Open

GCC now supports SARIF as an output format #531

davidmalcolm opened this issue Jun 3, 2022 · 22 comments
Assignees
Labels
statements-of-use Formal and informal statements of use

Comments

@davidmalcolm
Copy link

I'm not sure how best to contact the SARIF community (the "Ask a Question" link takes me to this issue tracker), but here's a heads-up that I've implemented SARIF output support for GCC trunk, for GCC 13 onwards (with this 3000 line patch):
https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596138.html

As noted above, my generated .sarif files seem to pass the online validator, are viewable via the online React-based viewer, and seem to work OK in the VS Code extension for SARIF. I hope I'm correctly implementing everything.

You can see it running live on Compiler Explorer; here's an example of it emitting SARIF reporting a path-sensitive double-free:
https://godbolt.org/z/nYWMM8Wx7

FWIW I'm now experimenting with GCC accepting SARIF as input (i.e. as a consumer), so that it can "replay" diagnostics serialized in SARIF form.

@dmk42
Copy link
Contributor

dmk42 commented Jun 3, 2022

Awesome, @davidmalcolm ! Thanks for doing this and for letting us know about it!

@sthagen sthagen added the statements-of-use Formal and informal statements of use label Jun 16, 2022
@sthagen sthagen self-assigned this Jun 16, 2022
@davidmalcolm
Copy link
Author

FWIW I've now posted an initial set of patches that let GCC consume SARIF, replaying a .sarif file:
https://gcc.gnu.org/pipermail/gcc-patches/2022-June/597051.html

(very much just a prototype for now, though)

@davidmalcolm
Copy link
Author

The GCC RFE for tracking being able to replay SARIF is: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96032

@michaelcfanning
Copy link
Contributor

@davidmalcolm, this really great! if you'd like, we'd be very glad to take a detailed look at your SARIF and see if we have any suggestions to refine.

i'm very glad to hear that some of the assets we put out on the web were useful crafting this content.

I have looked at the Compiler Explorer example, is that definitive/comprehensive? Do you have other substantive examples available?

[Looks like you are utilizing some very interesting SARIF domains, taxa, logical locations, etc.]

@davidmalcolm
Copy link
Author

Example of GCC SARIF output showing LTO report of a cross-TU type-mismatch in a use of a variadic API:
https://gist.github.com/davidmalcolm/26a546a0cd3ed253648e24510bd17964
(without reformatting, so it's all on one line)

@davidmalcolm
Copy link
Author

davidmalcolm commented Jul 11, 2022

same example (of LTO), reformatted for ease of reading:
https://gist.github.com/davidmalcolm/b0fa6b14909315f45f6f5fe16f623c0f

Note how this example spans two different source files; both have their full content quoted in the "artifacts" section (my producer code adds an artifact with content for any files referenced by any results mentioned in the file)

For reference, the regular (non-SARIF) GCC output for this is:

../../src/gcc/testsuite/gcc.dg/analyzer/stdarg-lto-1-a.c: In function ‘called_by_test_type_mismatch_1’:
../../src/gcc/testsuite/gcc.dg/analyzer/stdarg-lto-1-a.c:19:7: warning: ‘va_arg’ expected ‘const char *’ but received ‘int’ for variadic argument 1 of ‘ap’ [CWE-686] [-Wanalyzer-va-arg-type-mismatch]
   19 |   str = va_arg (ap, const char *); /* { dg-warning "'va_arg' expected '\[^\n\r\]*' but received 'int' for variadic argument 1 of 'ap'" } */
      |       ^
  ‘test_type_mismatch_1’: events 1-2
    |
    |../../src/gcc/testsuite/gcc.dg/analyzer/stdarg-lto-1-b.c:3:6:
    |    3 | void test_type_mismatch_1 (void)
    |      |      ^
    |      |      |
    |      |      (1) entry to ‘test_type_mismatch_1’
    |    4 | {
    |    5 |   called_by_test_type_mismatch_1 (42, 1066);
    |      |   ~   
    |      |   |
    |      |   (2) calling ‘called_by_test_type_mismatch_1’ from ‘test_type_mismatch_1’ with 1 variadic argument
    |
    +--> ‘called_by_test_type_mismatch_1’: events 3-4
           |
           |../../src/gcc/testsuite/gcc.dg/analyzer/stdarg-lto-1-a.c:12:1:
           |   12 | called_by_test_type_mismatch_1 (int placeholder, ...)
           |      | ^
           |      | |
           |      | (3) entry to ‘called_by_test_type_mismatch_1’
           |......
           |   19 |   str = va_arg (ap, const char *); /* { dg-warning "'va_arg' expected '\[^\n\r\]*' but received 'int' for variadic argument 1 of 'ap'" } */
           |      |       ~
           |      |       |
           |      |       (4) ‘va_arg’ expected ‘const char *’ but received ‘int’ for variadic argument 1 of ‘ap’
           |

@davidmalcolm
Copy link
Author

davidmalcolm commented Jul 11, 2022

Another example (reformatted so it's not all on one line, for ease of reading by humans):
https://gist.github.com/davidmalcolm/5771af86039bf57f144e935179bfe20e
This is a reproducer for detecting CVE-2011-2210, which was a flaw in the Linux kernel.

For reference, the non-SARIF GCC output for this one looks like:

../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c: In function ‘sys_osf_getsysinfo’:
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:58:9: warning: state: ‘tainted’
   58 |         __analyzer_dump_state ("taint", nbytes);  /* { dg-warning "tainted" } */
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:67:17: warning: state: ‘has_lb’
   67 |                 __analyzer_dump_state ("taint", nbytes);  /* { dg-warning "has_lb" } */
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:69:21: warning: use of attacker-controlled value ‘nbytes’ as size without upper-bounds checking [CWE-129] [-Wanalyzer-tainted-size]
   69 |                 if (copy_to_user(buffer, hwrpb, nbytes) != 0) /* { dg-warning "use of attacker-controlled value 'nbytes' as size without upper-bounds checking" } */
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  ‘sys_osf_getsysinfo’: event 1
    |
    |   35 |         long sys##name(__SC_DECL##x(__VA_ARGS__))
    |      |              ^~~
    |      |              |
    |      |              (1) function ‘sys_osf_getsysinfo’ marked with ‘__attribute__((tainted_args))’
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:30:9: note: in expansion of macro ‘__SYSCALL_DEFINEx’
    |   30 |         __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
    |      |         ^~~~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:37:36: note: in expansion of macro ‘SYSCALL_DEFINEx’
    |   37 | #define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
    |      |                                    ^~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:53:1: note: in expansion of macro ‘SYSCALL_DEFINE5’
    |   53 | SYSCALL_DEFINE5(osf_getsysinfo, unsigned long, op, void __user *, buffer,
    |      | ^~~~~~~~~~~~~~~
    |
    +--> ‘sys_osf_getsysinfo’: event 2
           |
           |   35 |         long sys##name(__SC_DECL##x(__VA_ARGS__))
           |      |              ^~~
           |      |              |
           |      |              (2) entry to ‘sys_osf_getsysinfo’
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:30:9: note: in expansion of macro ‘__SYSCALL_DEFINEx’
           |   30 |         __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
           |      |         ^~~~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:37:36: note: in expansion of macro ‘SYSCALL_DEFINEx’
           |   37 | #define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
           |      |                                    ^~~~~~~~~~~~~~~
../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:53:1: note: in expansion of macro ‘SYSCALL_DEFINE5’
           |   53 | SYSCALL_DEFINE5(osf_getsysinfo, unsigned long, op, void __user *, buffer,
           |      | ^~~~~~~~~~~~~~~
           |
         ‘sys_osf_getsysinfo’: events 3-6
           |
           |   64 |                 if (nbytes < sizeof(*hwrpb))
           |      |                    ^
           |      |                    |
           |      |                    (3) ‘nbytes’ has its lower bound checked here
           |      |                    (4) following ‘false’ branch (when ‘nbytes > 31’)...
           |......
           |   67 |                 __analyzer_dump_state ("taint", nbytes);  /* { dg-warning "has_lb" } */
           |      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           |      |                 |
           |      |                 (5) ...to here
           |   68 | 
           |   69 |                 if (copy_to_user(buffer, hwrpb, nbytes) != 0) /* { dg-warning "use of attacker-controlled value 'nbytes' as size without upper-bounds checking" } */
           |      |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           |      |                     |
           |      |                     (6) use of attacker-controlled value ‘nbytes’ as size without upper-bounds checking
           |
In file included from ../../src/gcc/testsuite/gcc.dg/analyzer/taint-CVE-2011-2210-1.c:14:
../../src/gcc/testsuite/gcc.dg/analyzer/test-uaccess.h:13:13: note: parameter 3 of ‘copy_to_user’ marked as a size via attribute ‘access (write_only, 1, 3)’
   13 | extern long copy_to_user(void __user *to, const void *from, unsigned long n)
      |             ^~~~~~~~~~~~

@davidmalcolm
Copy link
Author

Let me know if there are other examples you'd like to see.

@davidmalcolm
Copy link
Author

davidmalcolm commented Jul 11, 2022

Note that I'm not expressing the macro expansion information in the SARIF output (is there a way to do that?)

@davidmalcolm
Copy link
Author

Also: I've tried to express metadata about events via threadFlowLocation "kinds" property values, but I have some kinds of events that don't fit well with the existing examples; see #530 for more info.

@michaelcfanning
Copy link
Contributor

Hello! Took a look and there's lots to talk about. This work is really great and we'd love to help keep building it out. Here some starter thoughts:

  • If possible, it'd be best if you could emit your "version" property at the beginning of the file, not the end. The SARIF spec recommends this.
  • I notice that one of your logs contains ../ path segments and SARIF requires that these be removed for security reasons. The spec also advises that consumers reject paths that contain these segment (the language on this will be stronger in the current errata spec revision). See Explain why ".." path segments can't be normalized out #460 and Discourage ".." in file scheme URIs #461 for more. I hope that this change isn't burdensome for you, this requirement has proven tricky for some other compiler tooling.
  • Embedded contents can be very helpful to allow people to troubleshoot logs without having a live enlistment. I can see how this content would be helpful for the 'replay' scenario, allowing that to occur in an arbitrary environment. VS and VS Code SARIF viewers have a feature where they can expand this content into a temporary file, though this isn't working for me. Part of the issue may be that you aren't embedding the index to the artifact in your result locations. You should definitely do this, otherwise consumers will need to build their own map of all URIs to the artifact table (you must be doing this for your scenario?).
  • I notice that you have the beginning of expressing relationships between your results and the CWE taxonomy. Nice! We have started authoring many SARIF taxonomies for various categorization schemes. I notice we don't have CWE 4.6, 4.7 or 4.8 for some reason (and sent mail to some people to get this authored). Also, no one has yet driven a utilization scenario that would lead us to actually publish this data in some stable place (such as various providers publish their JSON schema files). But we could do that with/for you.
  • Minor thing related to the SARIF JSON schema, try updating your reference to https://raw.githubusercontent.com/oasis-tcs/sarif-spec/main/Schemata/sarif-schema-2.1.0.json, we just populated a version at main branch rather than master. The master reference should continue to work, as GitHub itself provides this compatibility, I think
  • Can you explain your question around macro expansion? SARIF defines a originalUriBaseId property for defining absolute paths for non-deterministic source roots, which you're already populating. Would you like other macro/var definitions to be expressed in the format?
  • Another small thing, you might think about the compiler's default behavior populating that absolute base uri, it's a common source of information disclosure. Your specific example with home/david/coding doesn't give much away but aliases/etc. can be problematic to package here. Tools like CodeQL choose to emit everything with relative URLs and a consistent uriBaseId but will omit the actual definition for the base URL, you could do that as well. This may not feel urgent if the compiler already emits lots of other path data in its console output/other logging, however. If you elide this data, then it becomes incrementally harder to automatically replay or simply view the log file results on the very client the file was produced on.
  • So, I notice you're not populating the log-level columnKind property. You stated you ran the validator, and so this means we have a gap in that analysis, which I will communicate to the team. I assume the right value for you to provide here is unicodeCodePoints, yes?

Right now, a couple of people are looking at your log files in various viewers and I can provide them back to you with some updates (though you may be able to just glean what you need from the above).

Looking forward to continuing the discussion!

@davidmalcolm
Copy link
Author

Sorry for the delay in getting back to you, and thanks for taking the time to look at it.

Addressing one specific point:

* Can you explain your question around macro expansion? SARIF defines a `originalUriBaseId` property for defining absolute paths for non-deterministic source roots, which you're already populating. Would you like other macro/var definitions to be expressed in the format?

...when I spoke of macro expansion, I was referring to languages with a preprocessor, such as C/C++, where the question of "where in the source-code-under-analysis are we?" can involve a nested series of macro expansions, potentially involving multiple files (e.g. use of a macro declared in one header, which refers to a macro in another header, etc).

Consider e.g.:


#include <stdlib.h>

#define FREE(X) free(X)
#define REALLY_FREE(X) FREE(X)
#define MAYBE_FREE(X,F) do { if (F) REALLY_FREE(X); } while (0)

void test (void *p, int flag)
{
  MAYBE_FREE(p, flag);
  MAYBE_FREE(p, flag);
}

GCC output: https://godbolt.org/z/87vf1cGKK

where GCC's textual output can emit the chain of macro expansions:

<source>: In function 'test':
<source>:3:17: warning: double-'free' of 'p' [CWE-415] [-Wanalyzer-double-free]
    3 | #define FREE(X) free(X)
      |                 ^~~~~~~
<source>:4:24: note: in expansion of macro 'FREE'
    4 | #define REALLY_FREE(X) FREE(X)
      |                        ^~~~
<source>:5:37: note: in expansion of macro 'REALLY_FREE'
    5 | #define MAYBE_FREE(X,F) do { if (F) REALLY_FREE(X); } while (0)
      |                                     ^~~~~~~~~~~
<source>:10:3: note: in expansion of macro 'MAYBE_FREE'
   10 |   MAYBE_FREE(p, flag);
      |   ^~~~~~~~~~

There didn't seem to be a way to express this within SARIF. Is there one, or did I miss it? Thanks!

@davidmalcolm
Copy link
Author

clang's textual output can express similar macro expansion information, such as:
https://godbolt.org/z/4ab3EEs3T

<source>:9:3: error: call to undeclared function 'misspelled_free'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
  MAYBE_FREE(p, flag);
  ^
<source>:5:37: note: expanded from macro 'MAYBE_FREE'
#define MAYBE_FREE(X,F) do { if (F) REALLY_FREE(X); } while (0)
                                    ^
<source>:4:24: note: expanded from macro 'REALLY_FREE'
#define REALLY_FREE(X) FREE(X)
                       ^
<source>:3:17: note: expanded from macro 'FREE'
#define FREE(X) misspelled_free(X)
                ^

...though it emits the expansion in the opposite order to GCC.

@davidmalcolm
Copy link
Author

FWIW I've extended the GCC SARIF output support so that it now captures crashes of GCC:
https://gcc.gnu.org/pipermail/gcc-patches/2023-March/614032.html
treating them as an "error"-level "notification" relating to the operation of the tool (SARIF v2.1.0 section 3.58), rather than a "result" about the code being analyzed by the tool.

@davidmalcolm
Copy link
Author

davidmalcolm commented Mar 24, 2023

Hello! Took a look and there's lots to talk about. This work is really great and we'd love to help keep building it out. Here some starter thoughts:

* If possible, it'd be best if you could emit your "version" property at the beginning of the file, not the end. The SARIF spec [recommends this](https://docs.oasis-open.org/sarif/sarif/v2.1.0/os/sarif-v2.1.0-os.html#_Toc34317480).

FWIW it turned out that my rather naive JSON code was traversing keys in objects in non-deterministic order when serializing, so that the ordering would vary from run to run.

I've fixed that for GCC 13 here:
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=7f1e15f743357e037d7c4f6f6000863c26f3dfc3

Should the "version" be the very first thing, or should the "$schema" come first?

@davidmalcolm
Copy link
Author

FWIW I've fixed a bug in GCC's SARIF output where it was naively assuming that source code was UTF-8, leading to invalid UTF-8 in the .sarif files when quoting artifacts (either fully, or via snippets). I've fixed this for GCC 13 in https://gcc.gnu.org/pipermail/gcc-patches/2023-March/614600.html which has some "fun" test cases e.g. for reporting invalidly encoded artifacts via a validly encoded .sarif log file.

@sthagen
Copy link
Contributor

sthagen commented Mar 25, 2023

FWIW it turned out that my rather naive JSON code was traversing keys in objects in non-deterministic order when serializing, so that the ordering would vary from run to run.

I've fixed that for GCC 13 here: https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=7f1e15f743357e037d7c4f6f6000863c26f3dfc3

Should the "version" be the very first thing, or should the "$schema" come first?

Well, JSON is catering many host languages. Most of them do still not understand how to implement stable insert order and maintain that or are insisting on not doing it ...

Our toy-like languages Python (dicts) and JavaScript (for string keys in objects) offer insert ordering these days.

But, if there is a JSON library chosen at the consumer side, that libraries authors may have decided to garble the order of keys as they will still comply to the JSON spec.

Long story short, this may be an optimization which is not as important as enforcing correct character encoding.

Nice work, thanks a lot for providing your resources to create the tool.
Much appreciated.

PS: In CSAF v2.0 we therefore provided the schema "properties" in sorted order as this should be trivial for insert order preserving producers and consumers and is trivial to re-establish per sorting in case one link in the processing chain messed with the order.

@davidmalcolm
Copy link
Author

Should the "version" be the very first thing, or should the "$schema" come first?
I've filed #571 about this

@davidmalcolm
Copy link
Author

davidmalcolm commented Jul 31, 2023

I've now added support to gcc trunk (for gcc 14) for capturing timing/profile information about the compile/analysis in SARIF form (via a custom property in the property bag on the invocation object).

@davidmalcolm
Copy link
Author

I've now add support to GCC trunk (for GCC 14) for capturing multithreaded code flows in GCC diagnostics, including in SARIF output (although nothing in GCC currently makes use of this).

@davidmalcolm
Copy link
Author

Also changed in the upcoming GCC 14:

  • the generated SARIF is now formatted by default to reflect the underlying JSON structure (rather than all being on one line)
  • the GCC analyzer generates various property bags within its diagnostics and code flows (containing analyzer-specific data for debugging)

@davidmalcolm
Copy link
Author

FWIW I mentioned SARIF in the GCC 14 release notes in a few places:
https://gcc.gnu.org/gcc-14/changes.html#sarif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
statements-of-use Formal and informal statements of use
Projects
None yet
Development

No branches or pull requests

4 participants