Commit Graph

196 Commits

Author SHA1 Message Date
Nolan Leake
130d2c7589 Error if function using indirect jmp touches redzone
The indirect jmp mitigation clobbers the redzone, so
verify that that is harmless.
2020-07-07 14:29:31 -07:00
Nolan Leake
fe8082af97 Comment fix. 2020-05-14 16:23:55 -07:00
Nolan Leake
faef3fcd5b Detect rep; movs as a load from memory.
For some reason, LLVM only sees a non-rep'd movs as a load,
so we special case them.
2020-05-14 15:15:50 -07:00
Nolan Leake
c1312e2956 Switch from not; not; lfence; ret to shl; lfence; ret.
Intel updated their guidance. The new mitigation is shorter,
faster, and easier to verify.
2020-05-14 15:09:45 -07:00
Nolan Leake
f6d8aed1a9 Verifier caught a bug in two of the mitigations.
Missed an lfence after a push reads from memory.
2020-05-14 14:31:19 -07:00
Nolan Leake
f40c55f512 Fix detection of loads.
BOLT has an isLoad() function, but it seems to intentionally ignore
some loads from memory. Add a isActualLoad() and use that instead.
2020-05-14 11:20:12 -07:00
Nolan Leake
061fb7d1e7 Extend LFence insertion pass to mitigate LVI. 2020-04-30 16:56:16 -07:00
Nolan Leake
0655e9a71f add opt-in LFenceInsertion pass for spectre mitigation 2019-08-26 15:30:53 -07:00
Jeffrey Griffin
9d4a9c0e82 follow PIC table jump-on register across reg2reg moves 2019-08-26 11:19:26 -07:00
Rafael Auler
7e63dc16ee Fix aggregator w.r.t. split functions
Summary:
We should not rely on split function detection while aggregating
data, but only look up the original function names in the symbol table.
Split function detection should be done by BOLT and not perf2bolt while
writing the profile. Then, BOLT, when reading it, will take care of
combining functions if necessary.

This caused a bug in bolted data collection where samples in cold parts
of a function were being falsely attributed to the hot part of a function
instead of being attributed to the cold part, causing incorrect translation of
addresses.

Reviewed By: maksfb

Differential Revision: D16993065

fbshipit-source-id: 71022fe1184
2019-08-23 14:50:00 -07:00
Maksim Panchenko
7fd4544586 Tighter control of jump table detection
Summary:
We were too permissive by allowing more jump tables during the
preliminary scan of memory. This allowed for jump tables to be
falsely detected. And since we didn't have a way to backtrack
the jump table creation, we had to assert.

This diff refactors the code that analyzes jump table contents.
Preliminary and final passes share the same code. The only difference
should be the detection of instruction boundaries that are available
during the final pass.

This should affect strict relocation mode only.

Reviewed By: rafaelauler

Differential Revision: D16923335

fbshipit-source-id: 9399fa97f57
2019-08-22 16:07:13 -07:00
Maksim Panchenko
753c1e1ee4 Fix misleading output
Summary:
BOLT prints "spawning thread to pre-process profile" message even when
it is not running in the aggregation mode. Fix that.

Reviewed By: WenleiHe

Differential Revision: D16908596

fbshipit-source-id: f788ed59bfa
2019-08-20 16:47:12 -07:00
Rafael Auler
d36e6fc435 Encode instrumentation tables in file
Summary:
Avoid directly allocating string and description tables in
binary's static data region, since they are not needed during runtime
except when writing the profile at exit. Change the runtime library to
open the tables on disk and read only when necessary.

Reviewed By: maksfb

Differential Revision: D16626030

fbshipit-source-id: 16664b1fc03
2019-08-14 14:04:09 -07:00
Rafael Auler
c6fa8fb91d Support instrumentation via runtime library
Summary:
To allow the development of future instrumentation work, this
patch adds support in BOLT for linking arbitrary libraries into the
binary processed by BOLT. We use orc relocation handling mechanism for
that. With this support, this patch also moves code programatically
generated in X86 assembly language by X86MCPlusBuilder to C code written
in a new library called bolt_rt. Change CMake to support this library as
an external project in the same way as clang does with compiler_rt. This
library is installed in the lib/ folder relative to BOLT root
installation and by default instrumentation will look for the library
at that location to finish processing the binary with instrumentation.

Reviewed By: maksfb

Differential Revision: D16572013

fbshipit-source-id: ed9ae63969f
2019-08-14 12:10:20 -07:00
laith sakka
6a339b9949 update parallel parallel_bolt_hhvm.test
Summary: update parallel_bolt_hhvm.test

Reviewed By: maksfb

Differential Revision: D16655093

fbshipit-source-id: 1a305543a2f
2019-08-07 08:39:21 -07:00
laith sakka
f353064d08 Add test for parallel mode
Summary:
Add a flag that disable writing botl-info section
and add a test that run bolt with two modes parallel
and sequential and assert that the resulting binaries
are the same.

Reviewed By: maksfb

Differential Revision: D16575440

fbshipit-source-id: d0fa7c94bdd
2019-08-02 13:09:24 -07:00
laith sakka
5a485bd2d8 Rewrite frame analysis using parallel utilities
Summary: Rewrite frame analysis using parallel utilities

Reviewed By: maksfb

Differential Revision: D16499130

fbshipit-source-id: c89b033be47
2019-08-02 13:09:23 -07:00
laith sakka
1895790cab Rewrite ICF using parallel utilities
Summary: Rewrite ICF using parallel utilities

Reviewed By: maksfb

Differential Revision: D16472975

fbshipit-source-id: 122e0363447
2019-08-02 13:09:23 -07:00
Maksim Panchenko
b782793df5 Add option to verify instruction encoder/decoder
Summary:
Add option `-check-encoding` to verify if the input to LLVM disassembler
matches the output of the assembler. When set, the verification runs on
every instruction in processed functions.

I'm not enabling the option by default as it could be quite noisy on x86
where instruction encoding is ambiguous and can include redundant
prefixes.

Reviewed By: rafaelauler

Differential Revision: D16595415

fbshipit-source-id: efee735d9ac
2019-08-01 14:10:36 -07:00
Maksim Panchenko
490fedfb4b Enforce strict mode for perf2bolt
Summary:
In strict relocation mode, we get better function coverage. However, if
the profile used for optimization was converted using non-strict mode,
then it wouldn't match functions exclusive to strict mode. Hence,
we have to enforce strict relocation mode for profile conversion, so it
can be used for either mode.

I'm also adding parallel profile pre-processing unless `--no-threads` is
specified. This masks the runtime overhead of function disassembly on
multi-core machines.

Reviewed By: rafaelauler

Differential Revision: D16587855

fbshipit-source-id: cc2e352b95f
2019-07-31 16:21:26 -07:00
laith sakka
b2258a5314 Fix race condition in buildCFG
Summary:
switch to sequential execution when print-all is passed.
Since the function getDynoStats have an unsafe access
to the annotation allocators.

Reviewed By: maksfb

Differential Revision: D16503502

fbshipit-source-id: 684b0ebde1f
2019-07-30 16:58:20 -07:00
laith sakka
f069640bf3 Run hfsort+ in parallel
Summary:
hfsort+ performs an expensive analysis to determine the
new order of the functions. 99% of the time during hfsort+
is spent in the function runPassTwo. This diff runs the body
of the hot loop in runPassTwo in parallel speeding up the
total runtime of reorder-functions pass by up to 4x

Reviewed By: maksfb

Differential Revision: D16450780

fbshipit-source-id: f80963655c6
2019-07-30 16:11:16 -07:00
Maksim Panchenko
75a18d4aad Add code padding verification
Summary:
In non-relocation mode, we allow data objects to be embedded in the
code. Such objects could be unmarked, and could occupy an area between
functions, the area which is considered to be code padding.

When we disassemble code, we detect references into the padding area
and adjust it, so that it is not overwritten during the code emission.
We assume the reference to be pointing to the beginning of the object.

However, assembly-written functions may reference the middle of an
object and use negative offsets to reference data fields. Thus,
conservatively, we reduce the possibly-overwritten padding area to
a minimum if the object reference was detected.

Since we also allow functions with unknown code in non-relocation mode,
it is possible that we miss references to some objects in code.
To cover such cases, we need to verify the padding area before we
allow to overwrite it.

Differential Revision: D16477787

fbshipit-source-id: 4dc45a023ee
2019-07-25 11:42:38 -07:00
Maksim Panchenko
9e322d2a6d Fix processing PLT without relocs
Summary:
Some binaries may not have a relocation section corresponding to PLT.
Handle them properly.

Differential Revision: D16477841

fbshipit-source-id: 8349ede6bd0
2019-07-24 23:55:26 -07:00
Maksim Panchenko
e8144c0d6d Fix white space
Reviewed By: modocache

Differential Revision: D16473918

fbshipit-source-id: 16fd805a85a
2019-07-24 18:12:33 -07:00
laith sakka
7c8ea6450e Run findSubprograms in preprocessDebugInfo in parallel
Summary:
While reading debug info the function findSubprograms
runs on each compilation unit. This diff parallelize that loop
reducing its runtime duration by 70%.

Reviewed By: rafaelauler, maksfb

Differential Revision: D16362867

fbshipit-source-id: 65868efaf3a
2019-07-24 17:27:21 -07:00
laith sakka
b965f62407 Lock-based parallelization for updateDebugInfo
Summary:
BOLT spends a decent amount of time creating patches to update
debug information when -update-debug-sections is passed.
In updateDebugInfo patches are created to update .debug_info
and .debug_abbrev sections while .debug_loc and .debug_ranges
contents are populated. This this diff uses a lock-based approach to
parallelize  updateDebugInfo functions and reduces the runtime of the
function by around 30%.

Reviewed By: maksfb

Differential Revision: D16352261

fbshipit-source-id: 58779864b73
2019-07-24 16:47:18 -07:00
Pierre RAMOIN
495f6c1737 Target compilation based on LLVM CMake configuration (#60)
Summary:
Minimalist implementation of target configurable compilation.

Fixes https://github.com/facebookincubator/BOLT/issues/59
Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/60

Reviewed By: maksfb

Differential Revision: D16461879

Pulled By: maksfb

fbshipit-source-id: b7888f8dbd4
2019-07-24 14:11:28 -07:00
Maksim Panchenko
108b67d892 Fix issue printing CTCs without annotations
Summary:
After stripping annotations, conditional tail calls no longer can be
identified by their corresponding tag. We can check the number of basic
block successors instead.

Fixes #58.

Reviewed By: rafaelauler

Differential Revision: D16444718

fbshipit-source-id: cd5f3ddf046
2019-07-23 14:49:50 -07:00
laith sakka
7d5b72ea49 Run shrink wrapping in parallel
Summary:
Shrink wrapping is an expensive part of frame optimizations if
performed on all functions. This diff makes it run in parallel,
reducing wall time.

Reviewed By: rafaelauler

Differential Revision: D16092651

fbshipit-source-id: d8c278dbf3d
2019-07-19 22:37:41 -07:00
laith sakka
b8a5ae9471 Run buildCFG in disassembly in parallel
Summary:
This diff  parallelize the construction of call graph during disassembly.
The diff includes a change to  parallel-utilities where another interface
is added, that support running tasks on binaryFunctions that involves
adding instruction annotations. This pattern is common in different places,
e.g. frame optimizations. And such, pattern justify creating an interface,
that abstract out all the messy details.

Reviewed By: rafaelauler

Differential Revision: D16232809

fbshipit-source-id: bf9261747ce
2019-07-19 22:37:41 -07:00
laith sakka
5fa33f3084 run finalize functions in parallel Summary:
Differential Revision: D16188733

fbshipit-source-id: 26dfcbe623c
2019-07-16 14:21:22 -07:00
laith sakka
a1f4793b70 run aligner pass in parallel
Summary: this diff parallelize the aligner pass

Reviewed By: rafaelauler

Differential Revision: D16176327

fbshipit-source-id: 6a0a21178ba
2019-07-16 14:21:22 -07:00
laith sakka
227b921898 Run reorder blocks in parallel
Summary:
This diff change reorderBasicBlocks pass to run in parallel,
it does so by adding locks to the fix branches function,
and creating temporary MCCodeEmitters when estimating basic block code size.

Differential Revision: D16161149

fbshipit-source-id: ef3774c51cd
2019-07-15 19:31:24 -07:00
Rafael Auler
acb8749ccf Support duplicating jump tables
Summary:
If two indirect branches use the same jump table, we need to
detect this and duplicate dump tables so we can modify this CFG
correctly. This is necessary for instrumentation and shrink wrapping.
For the latter, we only detect this and bail, fixing this old known
issue with shrink wrapping.

Other minor changes to support better instrumentation: add an option
to instrument only hot functions, add LOCK prefix to instrumentation
increment instruction, speed up splitting critical edges by avoiding
calling recomputeLandingPads() unnecessarily.

Differential Revision: D16101312

fbshipit-source-id: afc10555b22
2019-07-11 16:12:15 -07:00
Rafael Auler
e0fe32729a Restrict creation of jump tables
Summary:
Heuristic that creates a jump table for every memory access,
including those we do not match against a pattern in an indirect jump,
is too permissive and has false positives. Guard this logic under
strict mode until we figure out a better strategy.

Differential Revision: D16192205

fbshipit-source-id: 4046f985290
2019-07-11 15:06:44 -07:00
laith sakka
5de692014f Create a general interface to implement parallel tasks easily and apply it to run EliminateUnreachableBlocks in parallel.
Summary:
Each time we run some work in parallel over the list of functions in bolt, we manage a thread pool, task scheduling and perform some work to manage the granularity of the tasks based on the type of the work we do.

In this task, I am creating an interface where all those details are abstracted out, the user provides the function that will run on each  function, and some policy parameters that setup the scheduling and granularity configurations.

This will make it easier to implement parallel tasks, and eliminate redundant coding efforts.

Reviewed By: rafaelauler

Differential Revision: D16116077

fbshipit-source-id: cb91acd8481
2019-07-10 13:53:53 -07:00
laith sakka
bce6479435 Run cleanAnnotations within frame analysis in parallel
Summary: This diff parallelize the function FrameAnalysis::cleanAnnotations()

Reviewed By: rafaelauler

Differential Revision: D16096711

fbshipit-source-id: 957784396ac
2019-07-03 19:05:14 -07:00
laith sakka
01e75a8c9f Clean SPTMap in frame anaylsis in parallel
Summary:
This diff parallelize the STPClean() function reducing its runtime from 5 seconds to 0.4 on HHVM,
Making the runtime for the frame optimizer goes down to 33 seconds on HHVM.

Reviewed By: rafaelauler

Differential Revision: D15914371

fbshipit-source-id: 0b5916e09d4
2019-07-03 18:19:10 -07:00
laith sakka
c8cda8a4be run SPT in parallel, and split annotation allocator
Summary:
This diff includes two main changes:
1) When creating an annotation, a dedicated annotation allocator can be used, instead of the default allocator. This allows some annotation to be deallocated  right after the end of their usage completely. Furthermore, having the ability to use dedicated allocators allows running SPT in parallel where each task uses a different allocator.

2) SPT is parallelized.

Differential Revision: D15913492

fbshipit-source-id: 8c8c06ec2f7
2019-07-03 18:19:10 -07:00
Wenlei He
8dfe2267b8 Prioritize Jump Table ICP target by frequency and indice count
Summary: We select the top hot targets for indirect call promotion. But since we only have frequency for targets, not for actual jump table indices, we have to merge indices that share the same actual target. In order to do that we sort targets by pointer of target symbol before merging, which introduces instability. Later we stable sort merged targets by frequency. Due to the instability of sorting pointers, and depending on how many indices each merged target has, we could end up with unstable ICP. This commit changes the 2nd pass sorting to prioritize targets with fewer indices, and higher mispredicts, in addition to higher frequency. It improves stability of ICP, and also exposes more ICP because targets with fewer indices has better chance of getting promoted.

Reviewed By: rafaelauler

Differential Revision: D16099701

fbshipit-source-id: 04ade87ff82
2019-07-02 23:26:28 -07:00
Maksim Panchenko
5bdb9857c2 Fix out-of-bounds entry points
Summary:
Check that a symbol address is less than the next function
address before considering it for a secondary entry.

Differential Revision: D16056468

fbshipit-source-id: 533417088e3
2019-07-01 09:22:02 -07:00
Maksim Panchenko
303346b57c Introduce strict relocation mode
Summary:
In strict relocation mode we rely on relocations to represent all
possible entry points into a function. Most of the code generated by
tested compilers (gcc and clang) will result in relocations against
any internal labels for jump tables and for computed goto tables.

In situations where we cannot properly reconstruct a jump table, or when
we cannot determine a table that guides an indirect jump, e.g. when
multiple computed goto tables are used, we conservatively assume that
the indirect jump can end up at any possible basic block referenced by
relocations.

In strict mode, simple functions may include the aforementioned
instructions with unknown control flow with a conservative list of
destinations added to the containing basic block. This allows us to
expand coverage of simple functions and to enable code reordering
optimizations for more functions.

The strict mode is recommended when BOLT is used with a well-formed
code generated by a compiler.

To use the strict mode, add "-strict" on the command line.

Another effect of this diff, is that with relocations, we will always
replace the immediate operand of an instruction with a symbol if the
relocation exists against this operand.

Also this diff fixes issues with Clang compiled with -fpic.

Reviewed By: rafaelauler

Differential Revision: D15872849

fbshipit-source-id: e49b1a67f05
2019-06-28 10:22:05 -07:00
Maksim Panchenko
86de981e90 Ignore false function references
Summary:
A relocation can have an addend that makes it look as the relocated
value is in a different section from the symbol being relocated.
E.g., a relocation against a variable in .rodata could have a negative
offset that will make it look like it is against a symbol in .text
(a section that typically precedes .rodata).

Unless the relocation is against a section symbol, we know
exactly the symbol that is being relocated and there is no issue.
However, when the linker leaves only a section relocation (i.e. a
relocation against a section symbol when a temporary original symbol
gets deleted), we have to guess the relocated symbol, and can falsely
detect a function reference in the case described above.

The fix is to keep a section relocation if the corresponding
relocated value falls into a different section, and to detect and
ignore false function reference.

Reviewed By: rafaelauler

Differential Revision: D16030791

fbshipit-source-id: fbe5bca9453
2019-06-27 15:26:30 -07:00
wenlei
9c0b72b76c Force non-relocation mode for heatmap generation
Summary: BOLT operates in relocation mode by default when .reloc is in the binary. This changes disables relocation mode for heatmap generation so we can use that for more cases. There's a small separate change that ignores zero-sized symbol in zero-sized code section during function discovery.

Reviewed By: maksfb

Differential Revision: D16009610

fbshipit-source-id: 4c896321a1a
2019-06-27 15:05:39 -07:00
Rafael Auler
c8aea9f568 Initial experimental instrumentation pass
Summary:
An instrumentation pass that modifies the input binary to
generate a profile after execution finishes. It modifies branches to
increment counters stored in the process memory and injects a new
function that dumps this data to an fdata file, readable by BOLT.

This instrumentation is experimental and currently uses a naive
approach where every branch is instrumented. This is not ideal for
runtime performance, but should be good enough for us to
evaluate/debug LBR profile quality against instrumentation.

Does not support instrumenting indirect calls yet, only direct
calls, direct branches and indirect local branches.

Differential Revision: D15998096

fbshipit-source-id: d79bbb5fb0d
2019-06-27 11:44:56 -07:00
Rafael Auler
e67d72ca13 Ignore empty funcs in relocation mode
Summary:
Make BOLT ignore empty functions (those containing no instructions,
despite having some space allocated to it filled with zeroes).

Reviewed By: ricklavoie

Differential Revision: D15981683

fbshipit-source-id: beaa5e33644
2019-06-24 21:08:18 -07:00
Rafael Auler
9058415d17 Add option to print profile bias stats
Summary:
Profile bias may happen depending on the hardware counter used
to trigger LBR sampling, on the hardware implementation and as an
intrinsic characteristic of relying on LBRs. Since we infer fall-through
execution and these non-taken branches take zero hardware resources to
be represented, LBR-based profile likely overrepresents paths with fall
throughs and underrepresents paths with many taken branches. This patch
adds an option to print statistics about profile bias so we can better
understand these biases.

The goal is to analyze differences in the sum of the frequency of all
incoming edges in a basic block versus the sum of all outgoing. In an
ideally sampled profile, these differences should be close to zero. With
this option, the user gets the mean of these differences in flow as a
percentage of the input flow. For example, if this number is 15%, it
means, on average, a block observed 15% more or less flow going out of
it in comparison with the flow going in. We also print the standard
deviation so we can have an idea of how spread apart are different
measurements of flow differences. If variance is low, it means the
average bias is happening across all blocks, which is compatible with
using LBRs. If the variance is high, it means some blocks in the profile
have a much higher bias than others, which is compatible with using a
biased event such as cycles to sample LBRs because it overrepresents
paths that end in an expensive instruction.

Reviewed By: maksfb

Differential Revision: D15790517

fbshipit-source-id: b304a4f07b9
2019-06-21 14:35:49 -07:00
laith sakka
8c428d62d7 Parallelize ICF Pass
Summary:
ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds.
this diff perform some parallelization for the pass to make it faster.
A 60% reduction in the ICF runtime  is measured on the parallel version for HHVM.

Reviewed By: maksfb

Differential Revision: D15589515

fbshipit-source-id: 412861f510a
2019-06-17 16:26:29 -07:00
Maksim Panchenko
86ed045912 Check instruction boundaries while populating jump tables
Summary:
Now that we populate jump tables after all functions are disassembled,
we can check for instruction boundaries corresponding to jump table
entries. No need to delegate this task to postProcessJumpTables().

Reviewed By: rafaelauler

Differential Revision: D15814762

fbshipit-source-id: 418c58b33e5
2019-06-13 16:29:04 -07:00