Skip to the content.

reading-papers-19-21

2019 Runtime performance evaluation and optimization of type-2 hypervisor for MIPS64 architecture

However, there are few open-source virtualization solutions available for embedded systems.

HTTM is a MIPS6 based type-2 hypervisor running on a Linux based host system providing full system virtualization.

HTTM is a multithreaded application. Each virtual core/device implementation creates a separate thread. Cores, UART, Central Interrupt Unit (CIU), Virtio, and network devices are maintained as separate threads on the host system.

The execution cycle can simply be described in three steps, which are performed by each core.

  1. Block Fetching: A set of instructions starting with current guest PC and ending with control shifting instruction (called a block) is fetched from the guest code.
  2. Instruction Translation: The fetched block is converted into the translated block, by translating each instruction one by one. These translated assembly instructions are executed on the real hardware on behalf of the guest.
  3. Block Execution: The current host PC is then shifted to the translated block, which then gets directly executed on the host.

4.2 Proposed optimization - reduction in context switching

HTTM keeps different types of caches. Mainly there are three of them. 1) Block Cache: contain the translatied block correspongding to the starting PC of the block, 2) Address Translation Cache: contain the tarnslated address, i.e. Host’s virtual address (HVA) corresponding to the Guest’s virtual address (GVA) and 3) I/O address cache: Translated guest I/O addresses are kept in a separate cache.

Context switching is avoided by executing some handwritten assembly instructions which check and retrievee the content from the cache.

HTTM uses Dynamic Binary Translation (DBT) based technique for ISA virtualization

As pointed out by the block analysis, it’s intuitive that by changing the scope of DBT from one instruction to the whole block will help in optimizing the overall performance of the hypervisor. The main idea is to reduce the redundant load/store instructions in the generated translated block.

2019 x86-64 Instruction Usage among C/C++ Applications [ACM SYSTOR]

This paper presents a study of x86-64 instruction usage across 9,337 C/C++ applications and libraries in the Ubuntu 16.04 GNU/Linux distribution

:star: :star: 2019 cross-ISA Machine Instrumentation using Fast and Scalable Dynamic Binary Translation [VEE]

In this paper we improve cross-ISA emulation and instrumentation performance through three novel techniques.

First, we increase floating point (FP) emulation performance by observing that most FP operations can be correctly emulated by surrounding the use of the host FP unit with a minimal amount of non-FP code.

Our approach is thus to leverage the host FPU but only for a subset of all possible FP operations

We thus accelerate the common case, i.e.: the rounding is round-to-nearest-even, inexact is already set, and the operands are checked to limit the flags that the operation can raise. Otherwise, we defer to a soft-float implementation.

Profiling shows that on average Qelt accelerates (i.e., does not defer to softfloat) 99.18% of FP instructions in SPECfp06 benchmarks.

Second, we introduce the design of a translator with a shared code cache that scales for multi-core guests, even when they generate translated code in parallel at a high rate.

In Qelt, we modify this baseline design, in which a single lock protects both code translation/invalidation as well as code chaining, to make each vCPU thread work—in the fast path—on uncontended cache lines.

We keep the baseline’s single, contiguous (in virtual memory) buffer for the code cahe. However, we partition this buffer so that each vCPU generates code into a separate region.

To serve these lookups we maintain a per-region binary search tree that tracks the beginning and end host addresses of the region’s TBs.

Descriptors of guest memory pagesare kept in the memory map, which is implemented as a radix tree. We modify this tree to support lock-free insertions, and rely on the fact that the tree already supports RCU for lookups

Third, we present an ISA-agnostic instrumentation layer that can instrument guest operations that occur outside of the DBT’s intermediate representation (IR), which are common in full-system emulators.

Cross-ISA TLB Emulation. Guest TLB emulation in crossISA full-system emulators is a large contributor to performance overhead. The “softMMU”, as it is called in QEMU [6], is a table kept in software that maps guest to host virtual addresses.

Indirect branches in DBT.

An approach better suited for full-system emulators is the use of caching [44], which is demonstrated on QEMU by Hong et al. [26]. They add a small cache to each vCPU thread to track cross-page and indirect branch targets, and then modify the target code to inline cache lookups and avoid most costly exits to the dispatcher.

:star: :star: 2019 Scalable Emulation of Heterogeneous Systems [PhD]

The two main contributions of this chapter are:

In this chapter we make the following contributions:

we present a study that compares different accelerator coupling models

The study is enabled by Cargo. a machine simulator that we developed to model systems that integrate accelerators

Our experiments on these accelerators induce the following observations

we present a novel technique to expand the last-level cache of the heterogeneous system by reuing memory from otherwise unused on-chip accelerators

2019 SoK: Using Dynamic Binary Instrumentation for Security [AsiaCCS’19]

A code is said to evade DBI-based analysis when its instruction and data flows eventually start diverging from those that would accompany it in a native execution.

A code is said to escape DBI-based analysis when parts of its instruction and data flows get out of the DBI engine’s control and execute natively on the CPU, in the address space of the analysis process or of a different one

:star: 2019 Optimizing data permutations in structured loads/stores translation and SIMD register maping for a cross-ISA dynamic binary translator [JSA]

Hsu

More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area.

Structured loads/stores, such as VLDn/VSTn in ARM NEON, are one type of strided SIMD data access instructions.

The goal of SIMD register mapping is to determine a configuration which maps guest registers to host registers and minimizes data reorganization overhead.

To simplify SIMD hardware design, microprocessor vendors usually provide efficient data reorganization among a few specific SIMD lanes. Data movement among the other SIMD lanes usually requires multiple hops of data reorganization, thus incurring higher cost.Hence, the register mapper should consider not only the usage of the guest SIMD registers, but also the required data reorganization cost.

GuestMemoryFusion is implemented as a LLVM optimization pass.

It tries to fuse multiple load/store instructions which write/read data to/from consecutive sub-registers and guarantees no possible memory alias

It operates on basic blocks.

2019 Non-Intrusive Hardware Acceleration for Dynamic Binary Translation in Embedded Systems [IEEE ICIT]

This article describes a non-intrusive hardware acceleration approach for Dynamic Binary Translation (DBT) in modern resource-constrained embedded systems, detailing its motivation, design decisions and overall architecture.

This article explains the functionality extension of the DBTOR engine using external hardware support integrated as a bus sniffer, the HWSniffer, which results in a hardware-based and non-intrusive execution flow technique never tried before in the state of the art.

The address and access signals are monitored through the HWSniffer logic, in order to identify accesses to sensitive memory addresses and trigger the respective actions, when needed.

The bus HWSniffer can be used to trigger software or hardware components based on the source architecture memory accesses, providing great flexibility on the type of application to serve without disturbing the base DBT program flow.

Three types of application to the proposed bus sniffer were presented: CC flags handling, source peripherals remapping and interrupt support.

2019 DBTOR: A Dynamic Binary Translation Architecture for Modern Embedded Systems [IEEE ICIT]

This article describes a dynamic binary translation (DBT) system specially tailored to fit resource-constrained embedded systems, detailing its design decisions and architectural components.

:star: 2019 Enhancing Transactional Memory Execution via Dynamic Binary Translation [ACM SIGAPP]

Hsu

Transactional Synchronization Extensions (TSX) have been introduced for hardware transactional memory since the 4th generation Intel Core processors. TSX provides two software programming interfaces: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM).

HLE is easy to use and maintains backward compatibility with processors without TSX support, while RTM is more flexible and scalable.

Previous researches have shown that critical sections protected by RTM with a well-designed retry mechanism as its fallback code path can often achieve better performance than HLE.

More parallel programs may be programmed in HLE, however, using RTM may obtain greater performance.

we present a framework built on QEMU that can dynamically transform HLE instructions in an application binary to fragments of RTM codes with adaptive tuning on the fly.

When the guest and host ISAs are the same, e.g., the guest is x86 + HLE and the host is x86 + RTM in our case, DBT is used for dynamic binary optimization.

Compared to HLE execution, our prototype achieves 1.56× speedup with 8 threads on average.

:star: :star: 2019 Processor-Tracing Guided Region Formation in Dynamic Binary Translation

Hsu

We leverage the branch history information stored in the processor to reconstruct the program execution profile and effectively form high-quality regions with low cost.

Furthermore, we present the designs of lightweight hardware performance monitoring sampling and the branch instruction decode cache to minimize region formation overhead

we first provide an example that demonstrates the process of software decoding and then present the proposed branch instruction decode cache, which can effectively reduce the overhead of software decoding.

2019 Exploiting Vector Processing in Dynamic Binary Translation [ICPP]

Hsu

However, since processor architectures have kept enhancing with new features to improve vector/SIMD performance, legacy application binaries failed to fully exploit new vector/SIMD capabilities in modern architectures.

we study the fundamental issues involved in cross-ISA Dynamic Binary Translation (DBT) to convert non-vectorized loops to vector/SIMD forms to achieve greater computation throughput available in newer processor architectures

Scalar variables in binaries may not always be in the form of scalars, but in the form of memory references because of register spilling

Furthermore, it is also possible to spill the arbitrary variables from register to memory such as array base address and constant stride in binaries.

We adopt vritual register promotion, which promotes spilled variables to virtual register, to simplify loop analysis and enable vectorizations.

There are two types of spilled variables

2019 Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation [ACM Transactions on Architecture and Code Optimization]

Hsu

In this article, we present a novel binary translation technique called spill-aware superword level parallelism (saSLP)

We propose a novel spill-aware cost model that can determine whether combining guest registers would be profitable by weighing the extra unpacking overheads and the benefits of conserving host registers.

We present the vectorization of data-irregular loops by exploiting SIMD gather instructions and propose a new algorithm to vectorize loop reductions in DBTs, where the typical methods used in static compilers do not work.

2019 Improving Startup Performance in Dynamic Binary Translators [PDP]

We propose and assess the potential of a new technique that parallelizes program translations on multi-core machines to reduce its evident run-time costs

our first challenge to parallelize block translations is to find guest binary code blocks before they are reached during execution.

We call this speculative translation of blocks as eager translation.

First, we run each benchmark program (with the test input data-set) with a DynamoRIO configuration that outputs the list of block addresses that are reached during execution.

The later measured runs employ a DynamoRIO setup that uses this list of block addresses to populate the compiler queue during initialization.

2019 Multi-objective Exploration for Practical Optimization Decisions in Binary Translation

we investigate an opportunity to build practical optimization decision models for DBT by using machine learning techniques.

we identify the best classification algorithm for our infrastructure with consideration for both prediction accuracy and cost

74.5% of prediction accuracy for the optimal unroll factor and realizes an average 20.9% reduction in dynamic instruction count during the steady-state translated code execution

We show how machine learning techniques can improve loop unrolling decisions for dynamic binary translation on the mobile processor.

2019 Hybrid-DBT: Hardware/Software Dynamic Binary Translation Targeting VLIW

In this paper, we present Hybrid-DBT, an open-source, hardware accelerated DBT system targeting VLIW cores.

Three different hardware accelerators have been designed to speed-up critical steps of the translation process

Hybrid-DBT system is a hardware software co-designed DBT framework. It translates RISC-V binaries into VLIW binaries and uses dedicated hardware accelerators to reduce the cost of certain steps of the translation/optimization process.

In the system, the accelerator called First-Pass Translator is used in the first optimization level to perform a first translation that needs to be as cheap as possible.

The IR Generator is then used at the next optimization level to generate a higher level representation of the translated binaries, which is used by the IR Scheduler to generate VLIW binaries by performing instruction scheduling and register allocation.

Hybrid-DBT is organized as a three-step translation and optimization flow.

1) Translation of instructions is the optimization level 0. When executing cold-code, the DBT framework tries to spend as little time/energy as possible to translate binaries. In the framework, we first perform a naive translation of each source instruction into one or several instructions from the VLIW ISA. This optimization step does not try to exploit ILP at all. 2) Block building and scheduling is the optimization level 1. When a block has more than a certain number of instructions, it is considered as a candidate for optimization level 1: an IR of the block is built and used for performing instruction scheduling. This optimization level is triggered aggressively, without any profiling information because of the use of hardware accelerators that reduces its cost. Scheduled blocks are then profiled to trigger further optimizations. 3) Interblock optimization are performed at optimization level 2. When a given block has been executed enough, it triggers the interblock optimization. This optimization level analyzes the control flow (place and destination of each jump instruction) to build a control-flow graph of the function (i.e., all blocks that are accessed through direct branches except calls). Then interblock optimizations are performed on this function to extract more ILP.

2019 Implementation of a binary translation for improving ECU performance

3 pages ……

2019 Dynamic Translation Optimization Method Based on Static Pre-Translation [IEEE Access]

数学工程与先进计算国家重点实验室,郑州

An optimization method of dynamic binary translation with static pre-translation was proposed in this paper. By pre-translating the source program and using more in-depth code optimization method to optimize the translated code, most of the code translation time can be cut down and the translated code will be more efficient.

There are two key problems need to be solved to realize the mechanism of static pre-translation.

The first is the availability of static translation code. It needs the code obtained by static translator to have the same function of the code obtained by dynamic translation.

The second is to achieve the use of static translation results by perfecting the query and translation module. Since not all code can be translated by static translator, the query and execution process need to distinguish weather the source code has been pre-translated and not.

The result of the transformation is independent of the translation method—no matter it’s dynamic or static binary translation, When the matching pattern of instruction mapping and the state mapping are the same.

The correctness of the instruction transformation has been ensured. So just execute the results of both dynamic and static translation with the order maintained.

The instruction’s loading address can be used as the unique identification of the instruction, as the value in the instruction pointer register is the loading address of the following instruction in a determinate execution process. 说白了就是 PC

The sequence of source instructions for static translation is usually in code segment that are loaded in read-only mode, so it does not take the self-modifying code into account. 无视自修改代码

The optimization method of using the static pre-translation in dynamic binary translation is mainly to improve the migration of the binary executable program from x86/Linux platform to the Sunway/Linux platform.

And the translator SDQEMU (static pre-translation dynamic quick emulator) is equipped on the Sunway processor.

2019 HyperBench: A Benchmark Suite for Virtualization Capabilities [ACM Meas. Anal. Comput. Syst]

In this paper, we present HyperBench, a benchmark suite that focuses on the capabilities of different virtualization platforms. Currently, we design 15 hypervisor benchmarks covering CPU, memory, and I/O.

We propose a cost model which formulates the time spent in handling hypervisor-level events.

The cost model uses benchmark’s execution time natively on the host as a baseline

The execution time of a guest application includes the necessary time taken on the host machine and additional time consumed by hypervisor-level events.

T direct represents the time spent on instructions that can be executed directly on the host machine.

T virt represents the runtime of instructions that need to be specially handled in a virtualized environment.

he breakdown of T virt is shown in equation 3. T cpu , T memory , and T io represent the time for handling CPU, memory and I/O virtualization respectively.

​ T vir t = T cpu + T memory + T io + η

T cpu represents the CPU virtualization overheads. The cost of CPU virtualization consists of the sensitive instruction cost (T sen) plus the cost caused by virtualization extension instructions (T ext)

​ T cpu = C sen ⊛ T sen + C ex t ⊛ T ex t

T memory represents the cost of memory virtualization.

​ T memory = C swit ch ⊛ T swit ch + C sync ⊛ T sync + C cons ⊛ T cons + C two ⊛ T two

T io represents the cost of I/O virtualization. The latency added to virtual I/O attributes to notifications in the two directions—from the I/O driver in the VM to the I/O device in the hypervisor (T out), and the opposite direction (T in).

​ T io = C in ⊛ T in + C out ⊛ T out

The cost model described in section 3 requires HyperBench benchmarks to run in both native and virtualized environments.

HyperBench kernel is an ELF file in multiboot format, which allows grub to boot it directly.

In the native environment, HyperBench kernel runs directly on the hardware.

In the virtualized environment, HyperBench kernel runs as a test VM.

2019 Raising Binaries to LLVM IR with MCTOLL [ACM SIGPLAN]

This paper presents a static binary raiser that translates binaries to LLVM IR.

Native binaries for a new ISA are generated from the raised LLVM IR using the LLVM compiler backend.

This paper describes an LLVM tool that raises legacy binaries to LLVM IR.

2019 Translating AArch64 Floating-Point Instruction Set to the x86-64 Platform [ICPP]

We extend mc2llvm, which is an LLVM-based retargetable 32-bit binary translator developed in our lab in the past several years, to support 64-bit ARM instruction set.

In this paper, we report the translation of AArch64 floating-point instructions in our mc2llvm.

Our work focuses on translating AArch64 ISA [3] with floating-point instructions statically and dynamically to x86-64 machine code.

For floating-point instructions, due to the lack of floating-point support in LLVM [13, 14], we add support for the flush-to-zero mode, not-a-number processing, floating-point exceptions, and various rounding modes.

We implement software emulation to provide a complete floating-point environment that supports all floating-point exceptions as well as the flush-to-zero mode, NaN processing, and various rounding modes.

In order to improve the performance of the binary translator, we implement three optimization techniques: block chaining, direct returns from function calls, and memory alias analysis.

:star: 2019 The Janus Triad: Exploiting Parallelism through Dynamic Binary Modification [VEE’19]

We present a unified approach for exploiting thread-level, data-level, and memory-level parallelism through a same-ISA dynamic binary modifier guided by static binary analysis.

We demonstrate this framework by exploiting three different kinds of parallelism to perform :

A static binary analyser first examines an executable and determines the operations required to extract parallelism at runtime, encoding them as a series of rewrite rules that a dynamic binary modifier uses to perform binary transformation.

Rules / Rewrite Schedule : {TBA, RULE1}, {TBA, RULE2}, {TBC, RULE1}

Original Executable : TB-A -> TB-B -> TB-C

Rewrite Schedule Interpreter : TB-A => RULE1, RULE2 => Modified TB-A

​ TB-C => RULE1 => Modified TB-C

​ Modified-TBA -> Copied TB-B -> Modified TB-C

Janus is a same-ISA binary modification framework that was initially developed for automatic parallelisation. It combines static binary analysis and dynamic binary modification, controlled through a domain-specific rewrite schedule.

The prefetched address is a prediction of the dynamic address that will be accessed n iterations in the future, where n is the prefetch distance.

We use the Janus static analyser to detect regions that can be optimised through prefetching within the input binary. We implement the detection analysis in Janus’ static analyser based on the algorithm by Ainsworth and Jones [1] and implemented in their LLVM pass.

Scalar to SIMD Translation The VECT_CONVERT rule is associated with every scalar machine instruction that needs to be translated to a SIMD counterpart.

Loop Unrolling and Peeling Loop unrolling is directed by rule VECT_LOOP_UNROLL, which is associated with instructions that update the induction variable in a loop.

Initialisation and Reduction The VECT_BROADCAST rule tags a scalar register, memory operand, or existing SIMD operand to be expanded to a full SIMD register

Runtime Bound Check Once again, the unmodified loop body is required, but this time to provide the original loop in the case that it is deemed unsafe to vectorise at runtime. A VECT_BOUND_CHECK check can be performed on a register’s value using the condition of it being equal to a provided value, or on it being a positive value.

Runtime Symbolic Resolution The VECT_DEP_CHECK is inserted whenever two memory accesses are identified as łmay-aliasž due to ambiguities caused by inputs or function arguments.

2020 Reducing Overheads of Binary Instrumentation on X86_64: Techniques and Applications [PhD]

I present two approaches which can reduce the binary instrumentation overheads and thus, can help bring some of these tooling online and more widely adopted.

The first approach is on-demand instrumentation, which is the ability to enable and disable, aka toggle, instrumentation at run-time. Current instrumentation frameworks lack the primitives for cheap run-time instrumentation toggling. I present several novel instruction patching primitives, Wordpatch , Callpatch and Instruction Punning , which in combination, can be used to implement cheap, rapidly togglable instrumentation. I demonstrate the application of these primitives by developing a lightweight latency profiler.

The second approach is instrumentation elision, which skips inserting instrumentation at certain program locations using a particular instrumentation policy. Using shadow stacks as a case study, I demonstrate how we can elide shadow stack instrumentation on certain safe code regions as determined by binary static analysis. Finally, I present how this application of instrumentation elision leads to significant overhead reductions in shadow stacks.

:star: 2020 Robust Practical Binary Optimization at Run-time using LLVM [Workshop on LLVM-HPC]

Alexis Engelke, Martin Schulz

we describe BinOpt, a novel and robust library for performing application-driven binary optimization and specialization using LLVM.

A machine code function is lifted directly to LLVM-IR, optimized in LLVM while making use of application-specified information, used to generate a new specialization for a function, and integrated back to the original code.

This approach of application guided optimization has been previously explored, e.g., by DBrew [1], which is a library for binary optimization at run-time. It provides a C API for applications, where functions can be specialized for known parameters and associated memory regions.

Based on these previous experiences, we propose a new approach that directly transforms native machine code to LLVM-IR without the need for an intermediate step, like it was necessary in DBrew.

For the actual lifting of machine code semantics, we use Rellume [3], a library designed for producing performant LLVM-IR from x86-64 machine code, providing almost complete coverage of architectural x86-64 instructions including SSE/SSE2.

After optimization, the LLVM-IR code is compiled to a new machine-executable function using the MCJIT [6] compiler provided by LLVM.

:star: :star: 2020 More with Less – Deriving More Translation Rules with Less Training Data for DBTs Using Parameterization

Wenwen Wang

In this paper, we propose a novel parameterization approach to take advantage of the regularity and the well-structured format in most modern ISAs. It allows us to extend the learned translation rules to include instructions or instruction sequences of similar structures or characteristics that are not covered in the training set.

Assume we have learned a translation rule for the add instruction after the verification process, as shown in the left box. We can derive a similar translation rule for the eor instruction with the same addressing mode as shown on the right, even if it is not included in the training set.

We thus classify instructions first into different subsets according to their data types.

For instructions of the same data type, we further classify/divide them according to two guidelines.

After the instructions are classified into subgroups, a parameterization process is used to derive more translation rules from the learned rules. As mentioned earlier, our parameterization is done along two dimensions, namely, opcodes and addressing modes.

2020 Unlocking the Full Potential of Heterogeneous Accelerators by Using a Hybrid Multi-Target Binary Translator

Our HMTBT is capable of transparently translating code to different accelerators: a CGRA and a NEON engine, and automatically dispatching the translation to the most well-suited one, according to the type of the available parallelism at the moment.

The HMTBT supports code optimization to a CGRA, which exploits instruction-level parallelism, and to ARM NEON engine, which is responsible for exploiting data-level parallelism.

A translation built from HMTBT to the NEON is a sequence of SIMD Instructions, while the translation built to the CGRA is named as a Configuration.

A Translation Cache is responsible for storing SIMD instructions and CGRA configurations built by the HMTBT;

2020 Executing ARMv8 Loop Traces on Reconfigurable Accelerator via Binary Translation Framework [FPL]

Nuno Paulino,另有一篇 2021 年的介绍 hardware automatic generation

1 page ……

Performance and power efficiency in edge and embedded systems can benefit from specialized hardware.

To avoid the effort of manual hardware design, we explore the generation of accelerator circuits from binary instruction traces for several Instruction Set Architectures

Using a redesigned binary translation framework, we are currently expanding the applicability of our approach to other Instruction Set Architectures (ISAs), and exploring additional accelerator architecture, such as custom instruction units and nested loop accelerators.

:star: :star: 2020 A Retargetable System-level DBT Hypervisor [ACM Trans Comput Syst]

In this article, we present Captive, our novel system-level DBT hypervisor, where users are relieved of low-level implementation effort for retargeting.

Instead, users provide high-level architecture specifications similar to those provided by processor vendors in their ISA manuals.

In this article, we develop a novel, retargetable DBT hypervisor, which includes guest-specific modules generated from high-level guest machine specifications.

We achieve this by combining offline and online optimizations and exploiting the freedom of a Just-in-time (JIT) compiler operating in a bare-metal environment provided by a Virtual Machine (VM) hypervisor.

Captive applies aggressive optimizations: It combines the offline optimizations of the architecture model with online optimizations performed within the generated JIT compiler, thus reducing the compilation overheads while providing high code quality.

Furthermore, Captive operates in a virtual bare-metal environment provided by a VM hypervisor, which enables us to fully exploit the underlying host architecture, especially its system-related and -privileged features not accessible to other DBT systems operating as user processes.

We generate a DBT hypervisor outperforming Qemu by a factor of 2.21× for SPEC CPU2006 integer applications, and up to 6.49× for floating-point workloads.

2020 PerfDBT: Efficient Performance Regression Testing of Dynamic Binary Translation [ICCD]

Wenwen Wang

Due to the large scale and complexity of a DBT system, a minor code change may lead to unexpected impact on the performance. Therefore, it is necessary to conduct performance regression testing for DBT systems.

we propose PerfDBT, which employs a novel approach to automatically generate test programs from existing long-running benchmarks.

Profiling Block Hotness

Analyzing Blocks for Data Preparation

Test Program Generation

2020 Memory Error Detection Based on Dynamic Binary Translation [IEEE ICCT]

数学工程与先进计算国家重点实验室,郑州

We proposed a dynamic memory error detection method based on dynamic binary translation.

It realized the detection by combining a call stack tracing method and an IR language level memory error detection method.

This method takes advantage of the flexibility of the IR (Intermediate Representation) language in the dynamic binary translation technology. The flexibility of the IR language brings convenience to instrumentation, and the instrumentation at the IR language level can prevent the program under test to detect the abnormality of the running environment.

Stack Frame Memory Error Detection

就是 call 的时候存下 esp 之后 return 的时候对比一下是否一致,不一致则可能出现 stack 错误

when allocating heap memory, set a left red zone adjacent to the lower memory in the range of the allocated heap chunk, and a right red zone adjacent to the higher.

在申请的 heap 区域两边设置 red zone

we insert instrumentations into memory addressing IR language instructions.

在 IR 层面上插桩代码进行动态检查

:star: 2019 DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms

Wenwen Wang

In this paper, we present a distributed DBT framework, called DQEMU, that goes beyond a single-node multicore processor and can be scaled up to a cluster of multi-node servers.

One way to continue scaling the DBT performance is to go beyond one node and allows more cores to be available

In this paper, we discussed the design issues in building such a distributed DBT system.

In such a distributed DBT system, we integrate

  1. a page-level directory-based data coherence protocol,
  2. a hierarchical locking mechanism,
  3. a delegation scheme for system calls, and
  4. a remote thread migration approach that are effective in reducing its overheads.

We also proposed several performance optimization strategies that include

  1. page splitting to mitigate false data sharing among nodes,
  2. data forwarding for latency hiding, and
  3. a hint-based locality-aware scheduling scheme.

In all of our experiments, we take ARM as the guest ISA and x86_64 as the host. The benchmark programs are compiled to ARM binaries with all libraries statically linked.

In general, a guest parallel program can be written in two different programming models:

  1. the shared-memory model

  2. the distributed-memory model

    • In the distributed-memory programming model, a parallel program is partitioned explicitly by the programmer into multiple independent processes.

    • Each process has its own address space and is not visible to other processes.

However, for guest parallel programs using the shared-memory model, we need to maintain a virtual shared-memory multiprocessor system across multiple physical nodes

To emulate a shared-memory multiprocessor on such a distributed-memory system, a data coherence protocol is maintained across all nodes.

Those protocols can generally be classified into

  1. centralized protocols and
  2. decentralized protocols.

In our implementation of the distributed DBT, a centralized page-level directory-based MSI protocol is employed.

本文:中心化的基于目录的页级 MSI 协议

Because of the potential false data sharing when we enforce data coherence at the page level, we propose an adap- tive scheme to adjust the granularity of the coherence en- forcement at runtime.

为了解决假共享问题,提出了一个动态调整粒度的方案

We use a two-level approach in DQEMU.

  1. At the first level, the current QEMU in each node has incorporated a memory consistency enforcement scheme similar to the one used in PICO 单个node内部
  2. At the second level, to enforce memory consistency across nodes, we use sequential consistency because our data coherence scheme is enforced at the page-level through explicit network communication 多个 node 之间保持顺序一致性

We have also avoided the ABA problem by emulating the LL/SC across nodes with a global LL/SC hash table.

When a guest thread needs to be created by an existing guest thread in DQEMU, the creation event is trapped by instrumenting fork(), clone() and vfork() in Linux.

【?】fork 的话就是多个进程了吧,多个 node 上如何共享内存的?大概是 master node 把某个页传送给

As mentioned earlier, the system state in DQEMU is maintained in the master node. Each thread accesses the system state with the help of a delegation mechanism on the master node.

主节点维护整个系统的状态,子节点需要访问时要向主节点请求

The guest thread on the slave node traps its system calls and send the necessary information to the master node that includes guest CPU context, syscall number and the parameters.

:star: 2021 Arbitrary and Variable Precision Floating-Point Arithmetic Support in Dynamic Binary Translation [ASPDAC]

Floating-point hardware support has more or less been settled 35 years ago by the adoption of the IEEE 754 standard. However, many scientific applications require higher accuracy than what can be represented on 64 bits, and to that end make use of dedicated arbitrary precision software libraries.

To reach a good performance/accuracy trade-off, developers use variable precision, requiring e.g. more accuracy as the computation progresses.

Hardware accelerators for this kind of computations do not exist yet, and independently of the actual quality of the underlying arithmetic computations, defining the right instruction set architecture, memory representations, etc, for them is a challenging task.

We investigate in this paper the support for arbitrary and variable precision arithmetic in a dynamic binary translator, to help gain an insight of what such an accelerator could provide as an interface to compilers, and thus programmers. We detail our design and present an implementation in QEMU using the MPRF library for the RISC-V processor.

On the low end side, it is driven by inference in neural network for which representations on a small number of bits are mandatory [8]. On the high end side, scientific applications in nuclear physics, astronomy, geoscience, etc, are seeking kernels able to solve their equations deterministically, with high precision and numerical stability.

We base our work on the QEMU [1] dynamic binary translator, using the RISC-V [24] target architecture, and demonstrate how to handle large number and variable precision by using as backend the MPFR [9] library.

To demonstrate the practicality of the approach, we first define an arbitrary precision ISA that extends the existing RISC-V ISA with new AP registers and AP instructions.

2021 Efficient LLVM-Based Dynamic Binary Translation [VEE]

Alexis Engelke, Dominik Okwieka, Martin Schulz

In this short paper, we generalize Instrew to support different guest and host architectures by refactoring the lifter and by implementing target-independent optimizations to re-use host hardware features for emulated code.

A generalized performance-oriented library for lifting machine code to LLVM-IR that can be easily extended for other architectures.

An approach to exploit the hardware return address stack for the emulated program by re-using host instructions for function calls and returns.

  1. function calls in the original code are lifted to an LLVM-IR function call to the dispatcher
  2. After that LLVM-IR call, the new program counter has to be verified to have the expected value so that the code after the call can be safely executed, otherwise the dispatcher needs to be called again.
  3. Function returns are lifted to LLVM-IR return instructions
  4. indirect jumps are realized as tail calls to the dispatcher

The difference to other approaches utilizing host call/return instructions [11, 15] lies in the increased translation granularity by continuing decoding after call instructions.

:star: 2021 BTMMU: An Efficient and Versatile Cross-ISA Memory Virtualization

:star: 2021 Enhancing Dynamic Binary Translation in Mobile Computing by Leveraging Polyhedral Optimization [WCMC - Hindawi]

数学工程与先进计算国家重点实验室,郑州

In this work, we focus on leveraging ubiquitous multicore processors to improve DBT performance by parallelizing sequential applications during translation.

For that, we propose LLPEMU, a DBT framework that combines binary translation with polyhedral optimization.

Our goal is to enable a DBT system to parallelize loop nests in the sequential guest code without incurring high runtime overhead.

To parallelize the binaries, we must perform complex analysis to recover the CFG and extract the loops.

It is crucial to design the system architecture such that it can extract loops and perform polyhedral optimization without introducing high runtime overhead.

The second issue is how to parallelize loops extracted from guest binaries.

  1. Many polyhedral optimizers have been developed, such as PLUTO [8] and Polly [5].
  2. On the basis of these considerations, we choose Polly as our parallelizer. Polly, which has been developed on the basis of LLVM infrastructure [9], analyzes and parallelizes loops at the IR level.

In the static stage, loops are extracted from the guest application and then translated into the parallelized host machine code.

Related information files are also generated to enable the dynamic binary translator to leverage the static analysis results and load the parallelized code.

When translating the target loops, DBT will redirect the control flow from the original translation-execution loop to the execution of the parallelized code generated in the static stage.

Code fragments in the code cache are not the translated code from the basic block but a prologue to redirect the execution from the code cache to the parallelized code.

To solve this problem, we insert a jump instruction to the succeeding instruction as the end of the basic block when we find that the next instruction is the entry of another basic block based on CFG.

The optimizer is developed on the basis of prior knowledge of DBT and implemented as LLVM passes to bridge the gap between the translator and Polly. Then, IR optimized by the optimizer will be taken into the parallelizer

Polly is a polyhedral optimization tool based on LLVM infrastructure.

  1. It is implemented as a set of LLVM passes, to perform analysis and transformations on IR
  2. Polly is designed to optimize IR generated by Clang, and some passes in Polly rely on results of general pass such as the loop pass and SCEV pass

However, translated IR is far more complicated than such unoptimized IR; hence, Polly always fails by directly taking the translated IR as input.

The key strategy is to recover the loop induction variables and memory access symbolic description. Based on recovered loop structure information, the translated IR is then transformed into an equivalent version with GEP instructions, which is suitable for polyhedral optimization on Polly.

2021 A Binary Translation Framework for Automated Hardware Generation

Nuno Paulino

As applications move to the edge, efficiency in computing power and power/energy consumption is required.

Heterogeneous computing promises to meet these requirements through application-specific hardware accelerators.

Runtime adaptivity might be of paramount importance to realize the potential of hardware specialization, but further study is required on workload retargeting and offloading to reconfigurable hardware.

This article presents our framework for the exploration of both offloading and hardware generation techniques.

The framework is currently able to process instruction sequences from MicroBlaze, ARMv8, and riscv32imaf binaries, and to represent them as Control and Dataflow Graphs for transformation to implementations of hardware modules.

We summarize the technology trends underlying the emergence of heterogeneous platforms and provide an overview of a framework for exploring binary translation-based acceleration techniques applicable to future self-adaptive systems that shift work to available reconfigurable resources at runtime.

These are the instruction stream extraction, segment detection, control and dataflow graph (CDFG) generation, and hardware generation stages.

2021 Helper Function Inlining in Dynamic Binary Translation [ACM SIGPLAN Compiler Construction CC‘21]

Wenwen Wang

To mitigate the engineering effort, helper functions are frequently employed during the development of a DBT system.

To solve this problem, this paper presents a novel approach to inline helper functions in DBT systems.

As a result, the performance overhead introduced by helper function calls can be reduced, and meanwhile, the benefits of helper functions for DBT development are not lost.

The approach automatically transforms the natively-compiled host code of a helper function into a form that is feasible to be smoothly inlined into the translated host binary code and executed safely in the code cache.

To solve this issue, our approach automatically scans the host assembly code to identify potentially problematic instructions and transform them into instructions that can be correctly executed in the code cache. 比如 PC 相关的指令,扫描一遍然后改掉。。。relocate!

we need to insert extra assembly instructions before and after each call instruction to fulfill the duty of context switches. 对于 helper 中的 call 插入 context switch 代码进行调用

we transform each return instruction into a direct branch instruction 对于 return 指令改为直接跳转即可

There are two steps to inline a helper function. First, we copy the host binary code of the helper function from the loaded hfi file to the code cache. Second, we relocate the copied binary code based on the relocation records specified in the section of this helper function.

:star: :star: 2021 Enhancing Atomic Instruction Emulation for Cross-ISA Dynamic Binary Translation

Wenwen Wang

The hash table-based store test scheme (HST) is a software-based scheme. In order to correctly emulate the LL and SC instructions, the violation of their atomicity must be detected.

HST can be further optimized if there is additional hardware support, such as hardware transactional memory (HTM), to implement the critical section for the SC emulation.

Based on the analysis, the shared data can be modified by multiple threads using atomic instructions such as LL/SC during lock contention, and only be updated by the lock-owner thread with normal stores.

To monitor the changes to the atomic variable of LL/SC by the store instructions in non-transactional code regions, we can also employ the page protection mechanism in OS.

When emulating the LL instruction, in addition to access the hash table entry, it will set the page that contains its atomic variable to ”read-only”.

2019 Heuristics to predict and eagerly translate code in DBTs [PhD]

University of Kansas School of Engineering 堪萨斯大学工程学院

We explore the possibility of improving the performance of a DBT named DynamoRIO.

The basic idea is to improve its performance by speeding-up the process of guest code translationg, through multiple threads translating multiple pieces of code concurrently.

The test-inputs with shorter run-times are more signifificantly impacted by the DynamoRIO overhead.

Based on that reasoning and the supporting experimental data, we deviced to focus on test-inputs and to run further experiments without traces.

The idea is to translate basic blocks ahead of time, such that the required target code is ready to run when needed. For that to happen, we opted parallelization of the compilation process through multiple compiler threads.

We call it eager translation.

The design to implement our eager translation entails capturing direct addresses and populating them in a list. That list of addresses would be processed by multiple compiler threads ahead of time, with appropriate synchronization among all threads. 感觉没说清楚:到底是(1)动态的生成这个 list 来被多个翻译线程处理,还是(2)提前扫描程序发现这个 list 之后运行时被多线程拿去处理

Based on the collected data, we found that the percentage of useful eager translations is quite low. 发现有效的提前翻译很低。。。

We opted to expolre data-mining techniques to geenrate heuristics/rule-sets that could be applied at run-time during application run under DynamoRIO.

The idea is to have a set of simple rules/heuristics as they would have to be applied at run-time without incurring much overhead.

Once the input training data was available from ealier benchmark runs, next thing to do was to generate the rule-sets.

2019 Runtime Software Monitoring Based on Binary Code Translation for Real-Time Software [JIPS Journal of Information Processing Systems]

软件测试:静态测试 / 运行时测试 → 自动化:自动生成测试样例 / 自动进行运行测试

Traditional runtime software testing based on handwork is very inefficient and time consuming. Hence, test automation methodologies in runtime is demanding.

In this paper, we introduce a software testing system that translates a real-time software into a monitorable real-time software.

The monitorable real-time software means the software provides the monitoring information in runtime.

The inputs of the system are the executable linkable format (ELF) binary code and the symbol of target function to be monitored.

The output of the system is the monitorable binary code.

2019 一种高效解决间接转移的反馈式静态二进制翻译方法 [计研发]

An Efficient Feedback Static Binary Translator for Solving Indirect Branch

数学工程与先进计算国家重点实验室,郑州

2019 二进制翻译中动静结合的寄存器分配优化方法 [计研发]

A Dynamic and Static Combined Register Mapping Method in Binary Translation

数学工程与先进计算国家重点实验室,郑州

2019 二进制翻译正确性及优化方法的形式化模型

Formal Model of Correctness and Optimization on Binary Translation

数学工程与先进计算国家重点实验室,郑州

2019 块链优化技术在动态二进制翻译中的应用研究

2020 基于 LLVM 的多目标高性能动态二进制翻译框架

2019 基于 QEMU 的动态二进制插桩技术 [计研发]

2019 基于 QEMU 翻译系统 SIMD 指令翻译优化方法 [信息工程大学学报]

以申威处理器为 host 平台优化 QEMU 的 SIMD 指令翻译

  1. 修改已有的 helper 函数
  2. 引入新的向量 IR 来提高翻译效率

2021 基于中间表示规则替换的二进制翻译中间代码优化方法 [国防科技大学学报]

仍然是申威处理器

2019 基于动态二进制翻译和插桩的函数调用跟踪 [计研发]

信息系统安全技术国家重点实验室

(摘要第二句话没有主语…)

动态函数调用跟踪技术是调试 Linux 内核的重要手段。

针对现有动态跟踪工具存在支持平台有限、运行效率低的问题,基于二进制翻译,设计并实现支持多种指令集的动态函数调用跟踪工具。

gprof 仅支持启动用户态可执行程序,不能用于内核分析;

ftrace 需要编译内核时打开编译选项,造成多种编译优化手段无法进行;

systemtap 与 dtrace 需要操作系统运行时提供接口,操作系统启动阶段无法使用

2019 基于译码制导技术的动态二进制翻译优化研究 [电脑知识]

这特么题目是个啥。。。不好好说人话

2021 轻量级嵌入式软件动态二进制插桩算法 [理论研究]

为了跟踪动态执行过程和获取运行时信息,设置了远程监控进程对断点的触发进行监控。

为确定断点插入的位置,设计了静态分析算法以实现对控制流图和桩点集合的计算。

2021 基于申威CPU的RISC-V指令系统实时仿真研究 [电子科大硕士]

从 RISC-V 到 申威 的二进制翻译器的设计与实现(能跑简单的程序)

  1. 研究 RISC-V 和申威两个指令集
  2. 实现二进制翻译器
  3. 编写测试