Descriptive data formats, and especially security event formats, are based on key:value models, where key can be simple token (in simple variants) or path in directory tree.
Hereby we use terms key, attribute and field interchangeably.
Also, when we want to emphasize that particular key is not a leaf, we use term namespace, tree or node (also interchangeably).
We want format to be easily represented in data types of common programming languages. There already is a basic format, which implemented this requirement – JSON. Its data model is comprised of dictionaries, arrays, strings, numbers, booleans and NULLs and as such it represents subset of majority of data types on the wild today. There is no need to reinvent wheel, we will define IDEA in terms of JSON, however that does not necessary prevent other implementations, using the same data model, but another serialization (XML, YAML or even binary BSON), especially in NoSQL or document database area.
Format will be defined as at most two level deep tree of keys and values. That allows for just one basic level of indirection when represented in relational models (save for arrays), and avoids lack of predictability and discoverability in multiple level or recursive schemes.
On places where it makes sense we will allow references to top level nodes – for example contact describing node. Self contained node can be than reused, instead of repetition in several other nodes. This also partially substitutes recursion.
Known keys are defined in specification and parsers will know their structure, semantics, types, cardinality and partiality. However – producers of messages are free to include (extend format with) arbitrary own keys, if they do not collide with those already defined in specification. It may show up that added attribute has the potential to be globally useful and may appear in next revision of format. It is recommended to add key to main namespace if it has the potential to be globally useful. On the other side, if it is specific for user, security team or organization, producers should add their own namespace and insert own keys in it.
In case of name collision, priority is on the side of receiver, risk on sending side, and name can get lost or rewritten. Sender should choose sufficiently unique name to avoid that.
All that means that any validation should check only on known keys, defined in specification, anything superfluous goes through and remains untouched.
The “Format” key bears constant value for identification of format. In this draft version we use “IDEA0” to recognize proof of concept clients, after sufficient review and discussion we will switch to string “IDEA1”.
As source of the attack we understand the source of the prolem, the side which tries to do the harm or the side which is under control of the attacker, whereas the target is the impaired side. This can be ambiguous, especially in cases, where impaired side may also be the one under attacker's control, thus in the security message we have to consider the course of the attack. When describing security event with clear two opposite sides, we mark the attackee as the target even if we know the attack has been successful and attackee is compromised (and therefore also potentially harmful). Note though, that there may be two (or more) Source nodes concerning one event, for example when C&C server and some of its drones are detected in the same time. However, sometimes the danger of the machine is discovered even without detecting any evil behaviour, so there is only one actor (no two sides) - for example by botnet drone, discovered by host based IDS, malicious code, found by rootkit detection, malware, identified by local antivirus. Then, potential to be harmful is confirmed and machine should be marked as source. Also there is case of vulnerabilities - if the vulnerable machine is usable for direct abuse in attacking third party (DDoS reflection attacks, open DNS resolvers, etc.), it also qualifies for source. However, if there is yet a vulnerability possibility, which have not been compromised, machine can still be considered a (possible) target.
Most of the existing formats define at least timestamp of detection.
However, this is value, which does not necessarily correspond with timeframe of the attack – detector might be also able to deduce, when the attack started and ceased (it it has not been solitary, distinct event), so we have to allow for these fields.
Sometimes there may be lag or delay between detection of the event and its processing, so timestamp of message creation may also be useful.
Also, longer running types of events, such as portscans and other reconnaissance attacks, might get aggregated by the detector within some arbitrary timeframe. 1000 probe portscan within ten seconds might bear very different importance than 1000 probe portscan within one day. We will thus also need description of particular aggregation window.
We cannot stress enough importance of common time, so all timestamps must be represented in UTC – detectors must know their time zone and convert time accordingly. Receivers then can recalculate times into their local timezone, if necessary.
Often binary data are part of the security event. Format will allow that, but in case of text serialization we have to take care of safe transfer. We will allow for Base64 encoded binary strings. However, where underlying transport protocol (for example email, HTTP or plain socket) supports it, we will allow to mark binary data as external attachment, and binary blob may get sent along with event data as another part of same conversation (another MIME part, next part in POST data, etc.).
Sometimes, especially in case of networks behind various kinds of address translation, we are able to identify the aim only imprecisely. That doesn't mean we should disregard this information, even incomplete data can be of great help of security analysts and investigators. We thus allow specification of ranges, which will cover IP address usage scenarios. Also, we will allow boolean property in source and destination specification, which will explicitly mark this node as incomplete – “the aim is not specified, it just belongs into specified range”.
Another important aspect is privacy, especially when targets of attacks are honeypots or other types of traps. Or, in case of attack sources, reporter does not want to explicitly reveal affected critical part of its infrastructure. In textual reports, addresses are usually obscured by replacing of sufficient number of least significant digits by letters, i.e. “192.168.X.X”, or likewise in IPv6 world. If we look at it in another way, it is just means of saying “address is somewhere in the range 192.168.0.0/16”. In the previous paragraph we have just defined ranges, so it is sufficient to just add another boolean property, describing the intention to be imprecise – “we know the aim, but we are saying it just belongs into specified range”.
One more specific situation arises when we are able to identify that addresses of detected event are not valid, possibly intentionally spoofed by attacker. Sufficient way is again to append another boolean property to point out this situation.
In cases where reporter wants to specify different statuses (anonymised, incomplete, spoofed) for different sets of addresses, he can always create several distinct nodes.
Note that existence of these properties or lack thereof shows confidence of reporter – if property is not used, reporter just does not know.
Common practice among security teams is to label events by reasonably unique identifier. The purpose is to be able to reference them in other events, reports, and communication between participating parties, and also for possible identification of duplicates. Similarly to email Message-Id, there is no need to go to great technological lengths in attempts in global uniqueness, however usage of established solutions, which minimise duplicity risk very well, seems wise. We therefore recommend usage of UUID version 4 (random) or version 5 (SHA-1, with careful selection of at least locally unique source data for hashing).
However, for simpler originator backward reference, we also allow key for storing locally significant identifier (such as related ticket number). This information may not have any meaning for receiving parties.
We also need identifiers of the particular detector nodes. In contrary to message identifiers these are stable, long term, and could bear some kind of information. The variation on JAVA identifiers has been chosen, as these are well known, often used for similar purposes, and allow to at least partially reflect on real world hierarchy. Only restricted character set (latin alphabet, numbers and underscore) is allowed for compatibility.
Another kind of identifier are message local handlers. Situations may arise, where originator may want to reference particular node, for example that one attachment is related to one particular attack source. These may be unique only among all handle identifiers among one message.