Introduction DepGraphAPI is a multi-platform library for creating and analyzing dependence graph representations of binary code. These graph representations include a data dependence graph (DDG), control dependence graph (CDG), flow dependence graph (FDG), and program dependence graph (PDG). The current beta of the DepGraphAPI depends on the InstructionAPI library and the DyninstAPI; future versions will depend only on the InstructionAPI and ParsingAPI libraries released as part of the DyninstAPI. Currently we support the IA-32 and AMD-64 architectures as these are the only architectures supported by the InstructionAPI. Future architecture support includes PPC, IA-64, and SPARC. The main goal of this API is to provide the user with abstractions representing the logical dependencies between instructions in a program. An abstract interface provides two benefits: it simplifies the development of a tool since the complexity of a particular architecture is hidden, and it allows tools to easily be ported between platforms. We use graphs to represent fundamental must-happen-before and must-happen-after relationships between instructions. These graphs represent: * DDG: the set of instructions that define a value used by a particular target instruction, or the set of instructions that use the value defined by a particular source instruction. * CDG: The instruction whose execution determines whether a particular target instruction is executed, or the set of instructions whose execution is determined by a particular source instruction. * FDG: ??? * PDG: The union of the DDG, CDG, and FDG. A future goal of this library is to allow users to improve these graph representations through the use of additional analyses. The included analyses used to generate these graph representations are conservative, and may over-represent the actual dependences between instructions. A future release will provide an API for updating these graph structures, either with information known to the user directly or with the results of more sophisticated analysis. This document describes DepGraphAPI, an Application Program Interface (API) for analyzing and representing dependence graphs from binary code. Abstractions DepGraphAPI provides a simple set of abstractions over complicated data structures to make it easy to use. The DepGraphAPI consists of four graph variants: the data dependence graph (DDG), control dependence graph (CDG), flow dependence graph (FDG), and program dependence graph (PDG). The fundamental representation used by this library is the graph. A graph is a set of nodes connected by directed edges; each edge has a distinct (and unique) source and target. A node may be physical or virtual. Physical nodes represent objects within the code; either an instruction or a combination of an instruction and the register or memory location that instruction defines (referred to collectively as abstract locations). Virtual nodes are added to the graph to represent dependences between the graph and code external to the graph (e.g., parameters of a called function). Figure 1 shows the ownership hierarchy for the DepGraphAPI classes. Ownership here is a "contains" relationship; if one class owns another, then instances of the owner class contain an exclusive instance of the other. All references to DepGraphAPI classes are internally reference counted; we do not require the user to perform any manual memory allocation or deletion. * Graphs own nodes * Nodes own edges * Analyzers own graphs(?) Shared Abstractions * Graph: the Graph represents a dependence graph for a particular function. * Node: the Node represents an element within the graph. Nodes are connected by edges. All physical nodes have both in and out edges; virtual nodes may lack one or both. * Edge: Edges connect Nodes. Edges are directed and have a source and target. * Absloc: An Absloc represents a register, memory location, or set of memory locations. Data Dependence Graph The data dependence graph adds several specialized node types. * OperNode: Each physical DDG node represents an operation (a definition of a particular abstract location by a particular instruction instance). We describe operations and our justification of this abstraction below. * FormalParameterNode: These represent input parameters to a function. An input parameter is a definition of an abstract location before the function is executed. * FormalReturnNode: These represent definitions that persist after the function returns. * ActualParameterNode: These represent arguments to a called function. * ActualReturnNode: These represent definitions made by a called function. * ImmediateNode: These represent uses of constant (immediate) values, and exist to ensure that all physical nodes have at least one in edge. Control Dependence Graph *??? Flow Dependence Graph *??? Program Dependence Graph The Program Dependence Graph is the union of these three graphs and is constructed from the same abstractions. Analyzer The analyzer class is responsible for deriving the desired dependence graph from the underlying code. The analyzer is a temporary object that contains intermediate analysis results, and can safely be discarded as soon as graph construction is complete. Annotations The DepGraphAPI makes use of the Annotation infrastructure in the DyninstAPI classes. First, all DepGraphAPI abstractions are annotatable. The internal implementation of the annotation infrastructure is optimized for sparse annotation; that is, when the majority of abstractions are not annotated. Second, the function abstractions are annotated with the completed graphs. Therefore, the user does not need to keep a reference to an analyzed graph. This annotation can be discarded by using the appropriate methods on the Graph class. Examples TODO Definitions and Basic Types The DepGraphAPI supplies four types of dependence graphs. We define these forms of dependence here, along with definitions of the underlying concepts. The following definitions and basic types are referenced throughout the rest of the document. Definitions Instruction Instance: an instruction instance represents a single machine instruction with a unique starting offset. Instruction instances are identified by this offset. For the remainder of this document an instruction represents an instruction instance. Abstract Location: An abstract location represents a machine register, memory location, or set of memory locations. Registers are referred to by their InstructionAPI representation. Memory locations consist of a region and an optional offset within that region. Regions include the stack, the heap, and global memory. Stack locations are assumed to be relative from the top of the stack at the beginning of the function. Our current implementation assumes a single heap location; this may change in future releases. Finally, offsets into global memory represent absolute addresses. Operation: an operation is a pair of an instruction and an abstract location defined by that instruction. An instruction may define more than one abstract location, particularly on CISC architectures. In this case the operation allows us to specify which abstract location is defined. Data Dependence: In general instruction j is data dependent on i if i defines an abstract that j uses and there is an execution path from i to j along which a is not redefined. We use a more precise definition that uses operations as nodes. Let m = (i,a) be an operation representing the definition of a by i, and similarly for n = (j, b). Then n is data dependent on m if i defines a, j uses a to define b, and there exists a path as above. We discuss this more precise formulation below. Control Dependence: An instruction j is control dependent on i if i has multiple successors and j is executed along at least one, but not all, possible execution paths from i. Flow Dependence: not sure what this one means... Transitive Dependence: An instruction j is transitively dependent on i if there is a path (i, ..., j) in the corresponding dependence graph. Basic Types typedef unsigned long Offset An integer value that represents the offset from the base address of the file. typedef unsigned long Address An integer value that represents a unique location in memory used or defined by an instruction. Smart/shared pointers All objects returned to users are transparently wrapped with a reference counted pointer implementation. This smart pointer automatically handles deallocation and garbage collection. These pointers are referred to by the ::Ptr suffix (e.g., Graph::Ptr, Node::Ptr, etc.). Our implementation is derived from the Boost shared_ptr implementation; for more information, please visit www.boost.org. Collections We use STL structures to contain collections of objects. For ease of use each class contains typedefs for these types: ::Set, ::Vector, ... respectively. Each of these collections is defined over shared pointers (see above). Operation-Level Data Dependence Include from paper with example figure. Namespace DepGraphAPI The classes described in Section (API) are under the C++ namespace Dyninst::DepGraphAPI. To access them a user should refer to them using the Dyninst:: and SymtabAPI:: prefixes (e.g., Dyninst::DepGraphAPI::Graph). Alternatively, a user can add the C++ using keyword above any reference to DepGraphAPI objects (e.g., using namespace Dyninst and using namespace DepGraphAPI). Uses of other DyninstAPI libraries by this library handle namespaces appropriately. API Reference This section describes the interface of the DepGraphAPI. Each of the subsections represents a different interface. Shared Classes Each graph type shares certain functionality. To reduce redundancy we describe these shared types here. All derived types can be assumed to provide the same functions. Graph entryNodes(..., bool includeVirtual) - returns a minimal set of nodes such that all nodes are reachable from this set. If useVirtual is false then only physical nodes will be returned. allNodes - return the set of all nodes in the graph. printDOT - generates a representation of the graph in DOT format. getNodes (Offset off) - returns the set of physical nodes that have an offset off. removeAnnotation() - removes the graph from internal storage. Once this is called, if all user handles to the graph are removed the graph will be destroyed and must be re-analyzed. Node Edge::Vector ins() - returns the set of in edges to the node (edges that have the node as a target). Edge::Vector outs() - returns the set of out edges from the node (edges that have the node as a source). std::string format() - returns a textual representation of the node. This representation is guaranteed to be unique. bool isVirtual() - returns whether a node is virtual or physical. PhysicalNode : Node Offset offset() - returns the offset of the instruction (or operation) the node represents. bool isVirtual() - returns false VirtualNode : Node bool isVirtual() - returns true Edge Node::Ptr source() - returns the source node of an edge Node::Ptr target() - returns the target node of an edge Absloc std:string name() - returns a textual representation of the abstract location. This representation is guaranteed to be unique. getAliases() - If more than one Absloc may refer to the same abstract location (e.g., a particular stack slot and the representation of the entire stack) return any such aliases. isPrecise() - Returns true if the Absloc does not contain any others. Analyzer Analyzer createAnalyzer(BPatch_function *func): creates an Analyzer for the specified function. This function will continue to exist in future versions of the DepGraphAPI; in addition, we will add additional Analyzer creation methods. Graph::Ptr createDDG() Graph::Ptr createCDG() Graph::Ptr createFDG() Graph::Ptr createPDG() : Creates and returns the specified graph. In addition, the provided function is annotated with the graph so the user does not have to keep a handle; to remove this annotation use Graph::removeAnnotation(). If an annotation is present the user is returned a handle to the previously analyzed graph. Data Dependence Graph DDG Node::Vector formalParameterNodes() - returns the vector of all formal parameters to the function. Node::Vector formalReturnNodes() - returns the vector of all formal returns from the function. OperationNode : PhysicalNode Absloc::Ptr absloc() - returns the abstract location defined by the operation represented by this node. FormalParameterNode : VirtualNode FormalReturnNode : VirtualNode ActualParameterNode : VirtualNode ActualReturnNode : VirtualNode ImmediateNode : VirtualNode Control Dependence Graph CDG InstructionNode : PhysicalNode