Our requirements are based mostly on our experience with other formats and are formulated to describe our needs. Your mileage may vary, however our springboard assumptions may explain further choices in design.
We need to gather and exchange communication between automated detection systems and information aggregators of various types. This type of data manifests wide variability, so representation should be extensible enough to be prepared for unexpected or new types of security events.
Format should be simple, complexity and intricacy brings ambiguity. Also, we would like to be able to generate security messages at the closest place of detection, ideally at the detection probe, which might run on resource restricted platform and we do not want to impose great software/hardware requirements or need for complex tools/libraries.
One message should be able to describe all detected facets of one security event in time of detection, not the whole timeframe or modus operandi of security case. For example – one phishing email message may yield information about sender mail exchanger(s), reply-to address, URL of potentially malicious web page – these should be encompassed in one detection message. If these pieces of information are gathered and aggregated from more sources/probes which are not directly related (email message, HTTP harvester), separate messages should be necessary for description.
Format should be searchable. Names of fields should thus be short and descriptive.
That also means that format should avoid recursion – reasonably flat structure greatly simplifies processing, interpretation and storage in structured repositories (for example relational database systems).
Similar things should be represented in similar ways at unambiguous places in structure. Users must clearly know where to put particular information and where to read one.
Data model should have straightforward representation in common programming languages and data models. This would also allow for independence on serialization (JSON, XML, binary, …).
Data types and semantics of data fields should be unambiguous – one attribute should bear just one type and its type should not depend on the value of other attributes.
Format should be able to reasonably cooperate with other existing formats. By reasonable we mean minimum necessary loss of information in translation other than that based on fundamental design differences.
Format should define structure only for information potentially analysable by machine means, information which is meant only for human review does not need to be overly explicitly structured.
We want to allow for explicit anonymization – in security field, privacy plays great role in establishing and keeping trust. Format should also support explicit data incompleteness – information about attack sometimes can be incomplete or imprecise, but we want to know about it anyway.