Using Threat Intelligence in your Jupyter Notebooks

ianhelle · ‎Sep 30 2019

Introduction

Using threat intelligence (TI) is vital part of most hunts and investigations. You see a suspicious IP address in your logs and you want to check its reputation: is it a known C2 (Command and Control) address? is it associated with malware drops? You want to check a file hash to see if it is known malware. In many cases, getting a confirmation that your suspected IoC (indicator of compromise) is known to be bad can shortcut hours of investigation time.

Since its early days the msticpy package has had a module to perform look-ups at the popular VirusTotal service. We've recently released an update that gives you a generic TI lookup capability. You can configure multiple TI providers and submit one or more IoCs to them in a single function call.

One of the first set of providers is for Azure Sentinel TI. If you have set up a TI feed from the Microsoft Graph Security API, the indicators are made available as a table in Azure Sentinel Log Analytics store (see more about this below). This can be queried using the msticpy TILookup class alongside other providers such as VirusTotal, AlienVault OTX, and IBM XForce.

Here's an example, looking up a single observable. In the remainder of the article we'll take you through setting up your TI providers, querying IoCs and interpreting the results.

A Note on Terminology

The terms observable, indicator, indicator of compromise (IoC) are almost interchangeable. Strictly speaking, an observable is an identifier such as a domain name, IP address or filehash before you have done any determination on whether it is bad or not. Indicator/IoC are observables that have a known bad determination - i.e. what you get back from a TI provider along with data describing its context.

Why have we built TILookup?

Many TI providers offer sophisticated libraries to perform look-ups against their APIs. We are not trying to compete with these. In fact, for detailed examination of the results from a specific provider, you should consider using their client libraries. TILookup is specifically designed to allow look-ups against multiple providers, minimizing your need to worry about the specifics of each. It was built to quickly answer the question "is there anything suspicious about this item?". As such, we try to return raw data and a quick verdict rather than attempt to render the details of the provider results in high fidelity.

Azure Sentinel and Threat Indicators from the Microsoft Graph Security API

As mentioned above, Microsoft recently announced the ability to pipe TI from the MS Graph Security API into Azure Sentinel. You can stream logs from Threat intelligence providers into Azure Sentinel using the new Threat Intelligence data connector. Read more about how to set this up here.

The msticpy TILookup class is provider-independent but works really well with TI data stored in Azure Sentinel. This is especially true for bulk look-ups, because of the high-performance Log Analytics data store that backs the underlying searches.

Sample Notebook and Documentation

If you want to follow along with real code examples, there is a sample notebook showing more extended usage of TILookup here.

Documentation based on this notebook is also available here.

The full API documents can be found on ReadTheDocs.

Setting up Your Providers

To check which providers are currently supported by msticpy (we intend to add more as time goes on) you need to create an instance of the TILookup class. The available_providers property will return a list of the providers that you can use.

None of these are yet configured so you won't be able to do anything with them just yet. For most providers, you will need to create an account at the TI provider site and obtain an API key that authorizes you to query the data. The Azure Sentinel provider uses your existing workspace credentials, so you do not need additional credentials for this. Once you have these you can enter the details in a configuration file to be read by the respective provider.

TI Providers - Where to sign up

If you do not have accounts with these TI providers, go here to create them. Be sure to save your API keys somewhere safe. Please read and abide by the Terms of Use of each provider.

VirusTotal - https://developers.virustotal.com/reference

AlienVault Open Threat Exchange - https://otx.alienvault.com/api

IBM XForce - https://api.xforce.ibmcloud.com/doc/

Configuring your API keys - msticpyconfig.yaml

The msticpyconfig.yaml file is read from the current directory or its location can be specified with an environment variable MSTICPYCONFIG. The former method is useful if you want to vary the configuration for different investigation patterns (e.g. use a different set of providers). The latter is more useful if you want to keep a single configuration and use it everywhere.

msticpyconfig.yaml.png

Although other things can be specified in the file, you only need the TIProviders section. You can put the values directly into the file or give the name of an environment variable to fetch the value from (this is shown in the XForce example above) .

Note, that the guids shown here are all dummy values, so don't try to use them.

Each provider section has a name (you can use whatever you like for this) and an number of configuration options:

Args
- For HTTP/RESTful providers the values here are usually AuthKey (often referred to as ApiKey). Some providers, such as XForce, also require an account ID value - ApiID. Do not confuse these. If you are only given one key when you create the TI provider account it is almost certainly the AuthKey.
- For the AzureSentinel provider, it requires the workspace and tenant IDs for the workspace where your TI table is stored.
Provider - normally you should not alter this value since it tells TILookup which class to load.
Primary - if this is True the provider will be used by default on all look-ups. You can override this setting by explicitly naming the provider when you do a lookup or passing the prov_scope parameter (valid values are "primary", "secondary" or "all").

Once you have your configuration set up you are ready to use the library. Reloading the new settings as shown below:

The provider_status property will list the currently loaded providers.

Looking up a Single IoC

The method to lookup a single IoC is lookup_ioc(). Let's have a look at the parameters (you can type ti_lookup.lookup_ioc? in a Jupyter cell and execute it to get help for the method).

The observable parameter is the only required parameter - this is the suspected IoC that you want to look up. If you do not specify an ioc_type the type will be inferred. This inference uses the msticpy IoCExtract class to try to determine the IoC type using regular expressions. This is not foolproof though, so if you know the IoC type, it is always better and more efficient to specify it explicitly.

The raw output from lookup_ioc is a little terse so it's better to transform it into a more readable format. There is a built-in method called result_to_df that will convert the output into a pandas DataFrame. The number of rows in the DataFrame will depend on how many providers you have configured. In the example below, we've converted to a DataFrame and also flipped the frame using the dataframe.T (matrix transform) property to show the results from each provider in columns.

The fields returned are:

Ioc - the original observable
IocType - the supplied or inferred type
QuerySubtype - see more about this below
Result - True if we got a response of any kind
Severity - an assessment of the severity of the TI data.
Details - some extracted details from the report
RawResult - the full response from the provider.
Reference - this is either the original URL requested (in the case of HTTP providers) or the KQL query used (in the case of LogAnalytics providers) to retrieve data for this IoC.
Status -
- if this is an HTTP/RESTful provider:
  - 200 means success
  - 404 means not found - many providers use this to indicate that no information was found for the observable
  - Any other code means some error occurred 403 probably means that you are using the wrong key, 401 probably means that you have exceeded your lookup quota. However this varies by provider.
- if it is a LogAnalytics provider a status of 0 means success and -1 means failure.

Note: due to the complexities of parsing data across different providers the accuracy of the Severity and Details may vary. You will normally want to check the full RawResult field for details about the IoC.

For the curious the raw output from lookup_ioc has the following format:

tuple(overall_success, list[prov_results]).
list[prov_results] is a list of responses from each provider.
prov_results is itself a tuple(provider_id, LookupResult).
Finally, LookupResult is a class that contains all of the fields described in the previous paragraph.

Querying a subset of providers

You can use the providers parameter of lookup_ioc to supply a list of providers to use.

What Types of IoCs does a Provider Support?

Most providers support multiple IoC types. Many providers also offer different types of sub-query associated with some of the types. For example, GeoIP, Whois and Passive DNS are commonly provided for IP addresses, even though they are not strictly TI data.

To view the supported types for a single provider, you first need to know the provider short name - you can get this from ti_lookup.provider_status. Using the short name type the following:

You can also just list the usage for all loaded providers more simply using provider_usage().

In this last example you can see some IoC types that have ioc_query_type entries. Where sub-query types are supported you can chose to lookup just the data of this type for an IoC. The string shown here (e.g. "passivedns", "geo", etc.) is value that you supply for the optional ioc_query_type parameter mentioned above.

Here's an example with a query type requesting passive DNS data.

Some Defensive Optimizations

TILookup contains a couple of optimizations to prevent needlessly performing expensive lookups:

A least-recently-used (LRU) cache is kept per-provider. This avoids looking up the same observable multiple times - it will just return the first result for subsequent queries. This is a memory cache only, so disappears when you reset the Jupyter kernel. It is not shared between multiple instances of TILookup.
IoC sanitization tries to weed out observables that are never going to appear in TI. Examples include loopback and private IP addresses and URLs with an unqualified domain name.

Looking up Multiple IoCs

Sometimes you will want to lookup sets of IoCs. You can do this with the lookup_iocs() method. This is similar to the method to lookup a single IoC but takes a data parameter which holds the collection of IoCs to search for.

The input can be a pandas DataFrame, a python dict or a python iterable (such as a list).

In the case of a DataFrame you must also supply the obs_col parameter - the DataFrame column containing the observable to look up - and, optionally, the ioc_type_col - the column holding the IoC type.
For a dict, the format is {ioc: ioc_type}
An Iterable is just a collection of IoCs. This could be a list, tuple or any data type that supports the python iterable interface. IoC Type is always inferred in this case.

You can specify a sub-query type (this will apply to all look-ups), a provider list and a provider scope.

IMPORTANT - Multiple Look-ups Caveats

Most HTTP TI providers have throttle limits for data queries. If you are using a free tier account from these providers, these limits can be easily exceeded if you submit large collections of IoCs. There is currently no throttling mechanism in TILookup. As long as there are things in the queue it will keep submitting them. So please, do yourself and your TI provider a favor and be mindful of these limits when performing bulk look-ups.

A good strategy may be restricting the provider list you use to providers where you have a paid tier supporting high lookup rates when querying large data sets. You can do this using the providers=["prov1", "prov2"] parameter. Once you have done the initial lookup, you can submit smaller numbers to other providers with lower limits for a second opinion.

This is the primary use case for the Primary configuration parameter in the msticpyconfig.yaml file (see Setting up Your Providers earlier in the article). Set your high-bandwidth providers as Primary=True and the remainder as Primary=False. You can then perform secondary look-ups on a subset of your suspected IoCs to get more detail.

The second caveat is performance-related. The HTTP requests are currently not implemented as asynchronous calls (although we plan to change this). All requests are submitted in sequence, blocking until each one returns. While this is not a huge problem for a few tens of look-ups and small numbers of providers, it will be an issue if you are searching for very large sets (provider throttling limits, notwithstanding).

One performance optimization that we have built for the HTTP provider lookup is to add a Least-recently-used cache to the look-ups. This means that you can freely re-run the same query and it will return the result without hitting the provider site. The cache size is 1024 entries and is maintained per-provider. It is a memory-only cache, so will disappear if you reset your kernel and is not shared across notebooks.

Running a Query for Multiple IoCs

lookup_iocs always returns a DataFrame. This has the same schema as the individual IoC lookups (see

the section Looking Up a Single IoC earlier in the article).

The following are a set of examples of using the multi IoC lookup. Although I'm disobeying my earlier rule about specifying the IoC types explicitly, the type inference usually works and makes the examples a bit easier to read.

Here is an example using multiple IP addresses with a single provider - Azure Sentinel TI.

This example shows multiple URLs using all configured providers.

The final example shows that you can send a mixed set of IoCs at once.

Conclusion

Over the next few months we plan to add additional providers. These include some simple, non-standard data sources such as Tor exit nodes and Open Page Rank domain popularity. Feel free to request others that you think would be useful.

For now, I hope you have been able to see how to make use of the TILookup class in your own notebooks (or actually, in any Python code). Although the library is not that refined, it is effective at being able to pull in broad sweeps of data from multiple TI providers with minimal effort.

Please visit the msticpy GitHub to submit issues and requests, and read the documentation on ReadTheDocs to find out more about the package and its other features. Follow me on twitter at @ianhellen for news about updates to our Notebooks, Python packages and related items.

Happy hunting!

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs