Launching Ignition and TurboFan

May 15, 2017, 10:18 am

≪ Previous: Energizing Atom with V8's custom start-up snapshot

Today we are excited to announce the launch of a new JavaScript execution pipeline for V8 5.9 that will reach Chrome Stable in M59. With the new pipeline, we achieve big performance improvements and significant memory savings on real-world JavaScript applications. We’ll discuss the numbers in more detail at the end of this post, but first let’s take a look at the pipeline itself.

The new pipeline is built upon Ignition, V8’s interpreter, and TurboFan, V8’s newest optimizing compiler. These technologies should be familiar to those of you who have followed the V8 blog over the last few years, but the switch to the new pipeline marks a big new milestone for both.

For the first time, Ignition and TurboFan are used universally and exclusively for JavaScript execution in V8 5.9. Furthermore, starting with 5.9, Full-codegen and Crankshaft, the technologies that served V8 well since 2010, are no longer used in V8 for JavaScript execution, since they no longer are able to keep pace with new JavaScript language features and the optimizations those features require. We plan to remove them completely very soon. That means that V8 will have an overall much simpler and more maintainable architecture going forward.

A Long Journey

The combined Ignition and TurboFan pipeline has been in development for almost 3½ years. It represents the culmination of the collective insight that the V8 team has gleaned from measuring real-world JavaScript performance and carefully considering the shortcomings of Full-codegen and Crankshaft. It is a foundation with which we will be able to continue to optimize the entirety of the JavaScript language for years to come.

The TurboFan project originally started in late 2013 to address the shortcomings of Crankshaft. Crankshaft can only optimize a subset of the JavaScript language. For example, it was not designed to optimize JavaScript code using structured exception handling, i.e. code blocks demarcated by JavaScript’s try, catch, and finally keywords. It is difficult to add support for new language features in Crankshaft, since these features almost always require writing architecture-specific code for nine supported platforms. Furthermore, Crankshaft’s architecture is limited in the extent that it can generate optimal machine code. It can only squeeze so much performance out of JavaScript, despite requiring the V8 team to maintain more than ten thousand lines of code per chip architecture.

TurboFan was designed from the beginning not only to optimize all of the language features found in the JavaScript standard at the time, ES5, but also all the future features planned for ES2015 and beyond. It introduces a layered compiler design that enables a clean separation between high-level and low-level compiler optimizations, making it easy to add new language features without modifying architecture-specific code. TurboFan adds an explicit instruction selection compilation phase that makes it possible to write far less architecture-specific code for each supported platform in the first place. With this new phase, architecture-specific code is written once and it rarely needs to be changed. These and other decisions lead to a more maintainable and extensible optimizing compiler for all of the architectures that V8 supports.

The original motivation behind V8’s Ignition interpreter was to reduce memory consumption on mobile devices. Before Ignition, the code generated by V8’s Full-codegen baseline compiler typically occupied almost one third of the overall JavaScript heap in Chrome. That left less space for a web application’s actual data. When Ignition was enabled for Chrome M53 on Android devices with limited RAM, the memory footprint required for baseline, non-optimized JavaScript code shrank by a factor of nine on ARM64-based mobile devices.

Later the V8 team took advantage of the fact that Ignition’s bytecode can be used to generate optimized machine code with TurboFan directly rather than having to re-compile from source code as Crankshaft did. Ignition’s bytecode provides a cleaner and less error-prone baseline execution model in V8, simplifying the deoptimization mechanism that is a key feature of V8’s adaptive optimization. Finally, since generating bytecode is faster than generating Full-codegen’s baseline compiled code, activating Ignition generally improves script startup times and in turn, web page loads.

By coupling the design of Ignition and TurboFan closely, there are even more benefits to the overall architecture. For example, rather than writing Ignition’s high-performance bytecode handlers in hand-coded assembly, the V8 team instead uses TurboFan’s intermediate representation to express the handlers’ functionality and lets TurboFan do the optimization and final code generation for V8’s numerous supported platforms. This ensures Ignition performs well on all of V8’s supported chip architectures while simultaneously eliminating the burden of maintaining nine separate platform ports.

Running the Numbers

History aside, now let’s take a look at the new pipeline’s real-world performance and memory consumption.

The V8 team continually monitors the performance of real-world use cases using the Telemetry - Catapult framework. Previously in this blog we’ve discussed why it’s so important to use the data from real-world tests to drive our performance optimization work and how we use WebPageReplay together with Telemetry to do so. The switch to Ignition and TurboFan shows performance improvements in those real-world test cases. Specifically, the new pipeline results in significant speed-ups on user interaction story tests for well-known websites:

Reduction in time spent in V8 for user interaction benchmarks

Although Speedometer is a synthetic benchmark, we’ve previously uncovered that it does a better job of approximating the real-world workloads of modern JavaScript than other synthetic benchmarks. The switch to Ignition and TurboFan improves V8’s Speedometer score by 5%-10%, depending on platform and device.

The new pipeline also speeds up server-side JavaScript. AcmeAir, a benchmark for Node.js that simulates the server backend implementation of a fictitious airline, runs more than 10% faster using V8 5.9.

Improvements on Web and Node.js benchmarks

Ignition and TurboFan also reduce V8’s overall memory footprint. In Chrome M59, the new pipeline slims V8’s memory footprint on desktop and high-end mobile devices by 5-10%. This reduction is a result of bringing the Ignition memory savings that have been previously covered in this blog to all devices and platforms supported by V8.

These improvements are just the start. The new Ignition and TurboFan pipeline paves the way for further optimizations that will boost JavaScript performance and shrink V8’s footprint in both Chrome and in Node.js for years to come. We look forward to sharing those improvements with you as we roll them out to developers and users. Stay tuned.

Posted by the V8 team

↧

V8 Release 6.0

June 9, 2017, 1:12 pm

≫ Next: Upcoming RegExp Features

≪ Previous: Launching Ignition and TurboFan

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.0, which will be in beta until it is released in coordination with Chrome 60 Stable in several weeks. V8 6.0 is filled will all sorts of developer-facing goodies. We’d like to give you a preview of some of the highlights in anticipation of the release.

SharedArrayBuffers

V8 6.0 introduces support for SharedArrayBuffer, a low-level mechanism to share memory between JavaScript workers and synchronize control flow across workers. SharedArrayBuffers give JavaScript access to shared memory, atomics, and futexes. SharedArrayBuffers also unlock the ability to port threaded applications to the web via asm.js or WebAssembly.

For a brief, low-level tutorial, see the spec tutorial page or consult the Emscripten documentation for porting pthreads.

Object rest/spread properties

This release introduces rest properties for object destructuring assignment and spread properties for object literals. Object rest/spread properties are Stage 3 ES.next features.

Spread properties also offer a terse alternative to Object.assign() in many situations.

// Rest properties for object destructuring assignment:
const person = {
  firstName: 'Sebastian',
  lastName: 'Markbåge',
  country: 'USA',
  state: 'CA',
};
const { firstName, lastName, ...rest } = person;
console.log(firstName); // Sebastian
console.log(lastName); // Markbåge
console.log(rest); // { country: 'USA', state: 'CA' }

// Spread properties for object literals:
const personCopy = { firstName, lastName, ...rest };
console.log(personCopy);
// { firstName: 'Sebastian', lastName: 'Markbåge', country: 'USA', state: 'CA' }

For more information, see the usage post on Web Fundamentals.

ES6 Performance

V8 6.0 continues to improve performance of ES2015 features. This release contains optimizations to language feature implementations that overall result in a roughly 10% improvement in V8’s ARES-6 score.

V8 API

Please check out our summary of API changes. This document is regularly updated a few weeks after each major release.

Developers with an active V8 checkout can use 'git checkout -b 6.0 -t branch-heads/6.0' to experiment with the new features in V8 6.0. Alternatively you can subscribe to Chrome's Beta channel and try the new features out yourself soon.

Posted by the V8 team

↧

Upcoming RegExp Features

July 14, 2017, 12:26 pm

≫ Next: V8 Release 6.1

≪ Previous: V8 Release 6.0

Regular expressions, or RegExps, are an important part of the JavaScript language. When used properly, they can greatly simplify string processing.

ES2015 introduced many new features to the JavaScript language, including significant improvements to the regular expression syntax with the Unicode (/u) and sticky (/y) flags. But development has not stopped since then — in tight collaboration with other members at TC39 (the ECMAScript standards body), the V8 team has proposed and co-designed several new features to make RegExps even more powerful.

These features are currently being proposed for inclusion in the JavaScript specification. Even though the proposals have not been fully accepted, they are already at Stage 3 in the TC39 process. We have implemented these features behind flags (see below) in order to be able to provide timely design and implementation feedback before the specification is finalized.

In this blog post we want to give you a preview of this exciting future. If you'd like to follow along with the upcoming examples, enable experimental JavaScript features at chrome://flags/#enable-javascript-harmony.

Named Captures

Regular expressions can contain so-called captures (or groups), which can capture a portion of the matched text. So far, developers could only refer to these captures by their numeric index, which is determined by the position of the capture within the pattern.

const pattern = /(\d{4})-(\d{2})-(\d{2})/u;
const result = pattern.exec('2017-07-03');
// result[0] === '2017-07-03'
// result[1] === '2017'
// result[2] === '07'
// result[3] === '03'

But regular expressions are already notoriously difficult to read, write, and maintain, and numeric references can add further complications. For instance, in longer patterns it can be tricky to determine the index of a particular capture:

/(?:(.)(.(?<=[^(])(.)))/  // Index of the last capture?

And even worse, changes to a pattern can potentially shift the indices of all existing captures:

/(a)(b)(c)\3\2\1/     // A few simple numbered backreferences.
/(.)(a)(b)(c)\4\3\2/  // All need to be updated.

Named captures are an upcoming feature that helps mitigate these issues by allowing developers to assign names to captures. The syntax is similar to Perl, Java, .Net, and Ruby:

const pattern = /(?\d{4})-(?\d{2})-(?\d{2})/u;
const result = pattern.exec('2017-07-03');
// result.groups.year === '2017'
// result.groups.month === '07'
// result.groups.day === '03'

Named captures can also be referenced by named backreferences and through String.prototype.replace:

// Named backreferences.
/(?x)y\k/.test('xyx');  // true

// String replacement.
const pattern = /(?a)(?b)/;
'ab'.replace(pattern, '$$');  // 'ba'
'ab'.replace(pattern, (m, p1, p2, o, s, {fst, snd}) => fst + snd);  // 'ba'

Full details of this new feature are available in the specification proposal.

dotAll Flag

By default, the . atom in regular expressions matches any character except for line terminators:

/foo.bar/u.test('foo\nbar');  // false

A proposal introduces dotAll mode, enabled through the /s flag. In dotAll mode, . matches line terminators as well.

/foo.bar/su.test('foo\nbar');  // true

Full details of this new feature are available in the specification proposal.

Unicode Property Escapes

Regular expression syntax has always included shorthands for certain character classes. \d represents digits and is really just [0-9]; \w is short for word characters, or [A-Za-z0-9_].

With Unicode awareness introduced in ES2015, there are suddenly many more characters that could be considered numbers, for example the circled digit one: ①; or considered word characters, for example the Chinese character for snow: 雪.

Neither of these can be matched with \d or \w. Changing the meaning of these shorthands would break existing regular expression patterns.

Instead, new character classes are being introduced. Note that they are only available for Unicode-aware RegExps denoted by the /u flag.

/\p{Number}/u.test('①');     // true
/\p{Alphabetic}/u.test('雪');  // true

The inverse can be matched by with \P.

/\P{Number}/u.test('①');     // false
/\P{Alphabetic}/u.test('雪');  // false

The Unicode consortium defines many more ways to classify code points, for example math symbols or Japanese Hiragana characters:

/^\p{Math}+$/u.test('∛∞∉');                            // true
/^\p{Script_Extensions=Hiragana}+$/u.test('ひらがな');  // true

The full list of supported Unicode property classes can be found in the current specification proposal. For more examples, take a look at this informative article.

Lookbehind Assertions

Lookahead assertions have been part of JavaScript’s regular expression syntax from the start. Their counterpart, lookbehind assertions, are finally being introduced. Some of you may remember that this has been part of V8 for quite some time already. We even use lookbehind asserts under the hood to implement the Unicode flag specified in ES2015.

The name already describes its meaning pretty well. It offers a way to restrict a pattern to only match if preceded by the pattern in the lookbehind group. It comes in both matching and non-matching flavors:

/(?<=\$)\d+/.exec('$1 is worth about ¥123');  // ['1']
/(?<!\$)\d+/.exec('$1 is worth about ¥123');  // ['123']

For more details, check out our previous blog post dedicated to lookbehind assertions, and examples in related V8 test cases.

Acknowledgements

This blog post wouldn’t be complete without mentioning some of the people that have worked hard to make this happen: especially language champions Mathias Bynens, Dan Ehrenberg, Claude Pache, Brian Terlson, Thomas Wood, Gorkem Yakin, and Irregexp guru Erik Corry; but also everyone else who has contributed to the language specification and V8’s implementation of these features.

We hope you’re as excited about these new RegExp features as we are!

↧

V8 Release 6.1

August 3, 2017, 11:21 am

≫ Next: About that hash flooding vulnerability in Node.js…

≪ Previous: Upcoming RegExp Features

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.1, which is in beta until its release in coordination with Chrome 61 Stable in several weeks. V8 v6.1 is filled with all sorts of developer-facing goodies. We’d like to give you a preview of some of the highlights in anticipation of the release.

Performance improvements

Visiting all the elements of the Maps and Sets — either via iteration or the Map.prototype.forEach / Set.prototype.forEach methods — became significantly faster, with a raw performance improvement of up to 11× since V8 version 6.0. Check the dedicated blog post for additional information.

In addition to that, work continued on the performance of other language features. For example, the Object.prototype.isPrototypeOf method, which is important for constructor-less code using mostly object literals and Object.create instead of classes and constructor functions, is now always as fast and often faster than using the instanceof operator.

Function calls and constructor invocations with variable number of arguments also got significantly faster. Calls made with Reflect.apply and Reflect.construct received an up to 17× performance boost in the latest version.

Array.prototype.forEach is now inlined in TurboFan and optimized for all major non-holey elements kinds.

Binary size reduction

The V8 team has completely removed the deprecated Crankshaft compiler, giving a significant reduction in binary size. Alongside the removal of the builtins generator, this reduces the deployed binary size of V8 by over 700 KB, depending on the exact platform.

asm.js is now validated and compiled to WebAssembly

If V8 encounters asm.js code it now tries to validate it. Valid asm.js code is then transpiled to WebAssembly. According to V8’s performance evaluations, this generally boosts throughput performance. Due to the added validation step, isolated regressions in startup performance might happen.

Please note that this feature was switched on by default on the Chromium side only. If you are an embedder and want to leverage the asm.js validator, enable the flag --validate-asm.

WebAssembly

When debugging WebAssembly, it is now possible to display local variables in DevTools when a breakpoint in WebAssembly code is hit.

V8 API

Please check out our summary of API changes. This document is regularly updated a few weeks after each major release.

Developers with an active V8 checkout can use git checkout -b 6.1 -t branch-heads/6.1 to experiment with the new features in V8 v6.1. Alternatively you can subscribe to Chrome’s Beta channel and try the new features out yourself soon.

Posted by the V8 team

↧

About that hash flooding vulnerability in Node.js…

August 11, 2017, 9:40 am

≫ Next: Fast Properties in V8

≪ Previous: V8 Release 6.1

Early July this year, Node.js released a security update for all currently maintained branches to address a hash flooding vulnerability. This intermediate fix comes at the cost of a significant startup performance regression. In the meantime, V8 has implemented a solution which avoids the performance penalty.

In this post, we want to give some background and history on the vulnerability and the eventual solution.

Hash flooding attack

Hash tables are one of the most important data structures in computer science. They are widely used in V8, for example to store an object’s properties. On average, inserting a new entry is very efficient at O(1). However, hash collisions could lead to a worst case of O(n). That means that inserting n entries can take up to O(n²).

In Node.js, HTTP headers are represented as JavaScript objects. Pairs of header name and values are stored as object properties. With cleverly prepared HTTP requests, an attacker could perform a denial-of-service attack. A Node.js process would become unresponsive, being busy with worst-case hash table insertions.

This attack has been disclosed as early as December of 2011, and shown to affect a wide range of programming languages. How come it took this long for V8 and Node.js to finally address this issue?

In fact, very soon after the disclosure, V8 engineers worked with the Node.js community on a mitigation. From Node.js v0.11.8 onwards, this issue had been addressed. The fix introduced a so-called hash seed value. The hash seed is randomly chosen at startup and used to seed every hash value in a particular V8 instance. Without the knowledge of the hash seed, an attacker has a hard time to hit the worst-case, let alone come up with an attack that targets all Node.js instances.

This is part of the commit message of the fix:

This version only solves the issue for those that compile V8 themselves or those that do not use snapshots. A snapshot-based precompiled V8 will still have predictable string hash codes.

Startup snapshot

Startup snapshots are a mechanism in V8 to dramatically speed up both engine startup and creating new contexts (i.e. via the vm module in Node.js). Instead of setting up initial objects and internal data structures from scratch, V8 deserializes from an existing snapshot. An up-to-date build of V8 with snapshot starts up in less than 3ms, and requires a fraction of a millisecond to create a new context. Without the snapshot, startup takes more than 200ms, and a new context more than 10ms. This is a difference of two orders of magnitude.

We covered how any V8 embedder can take advantage of startup snapshots in previousposts.

A pre-built snapshot contains hash tables and other hash-value-based data structures. Once initialized from snapshot, the hash seed can no longer be changed without corrupting these data structures. A Node.js release that bundles the snapshot has a fixed hash seed, making the mitigation ineffective.

That is what the explicit warning in the commit message was about.

Almost fixed, but not quite

Fast-forward to 2015, a Node.js issue reports that creating a new context has regressed in performance. Unsurprisingly, this is because the startup snapshot has been disabled as part of the mitigation. But by that time not everyone participating in the discussion was aware of the reason.

As explained in this post, V8 uses a pseudo-random number generator to generate Math.random results. Every V8 context has its own copy of the random number generate state. This is to prevent Math.random results from being predictable across contexts.

The random number generator state is seeded from an external source right after the context is created. It does not matter whether the context is created from scratch, or deserialized from snapshot.

Somehow, the random number generator state has been confused with the hash seed. As result, a pre-built snapshot started being part of the official release since io.js v2.0.2.

Second attempt

It was not until May 2017, during some internal discussions between V8, Google’s Project Zero, and Google’s Cloud Platform, when we realized that Node.js was still vulnerable to hash flooding attacks.

The initial response came from our colleagues Ali and Myles from the team behind Google Cloud Platform's Node.js offerings. They worked with the Node.js community to disable startup snapshot by default, again. This time around, they also added a test case.

But we did not want to leave it at that. Disabling startup snapshot has significant performance impacts. Over the years, we have added many new languagefeatures and sophisticatedoptimizations to V8. Some of these additions made starting up from scratch even more expensive. Immediately after the security release, we started working on a long-term solution. The goal is to be able to re-enable startup snapshot without becoming vulnerable to hash flooding.

From proposed solutions, we chose and implemented the most pragmatic one. After deserializing from snapshot, we would choose a new hash seed. Affected data structures are then rehashed to ensure consistency.

As it turns out, in an ordinary startup snapshot few data structures are actually affected. And to our delight, rehashing hash tables have been made easy in V8 in the meantime. The overhead this adds is insignificant.

The patch to re-enable startup snapshot has been mergedinto Node.js. It is part of the recent Node.js v8.3.0 release.

Posted by Yang Guo, aka @hashseed

↧

Fast Properties in V8

August 30, 2017, 5:29 am

≫ Next: V8 Release 6.2

≪ Previous: About that hash flooding vulnerability in Node.js…

In this blog post we would like to explain how V8 handles JavaScript properties internally. From a JavaScript point of view there are only a few distinctions necessary for properties. JavaScript objects mostly behave like dictionaries, with string keys and arbitrary objects as values. The specification does however treat integer-indexed properties and other properties differently during iteration. Other than that, the different properties behave mostly the same, independent of whether they are integer indexed or not.

However, under the hood V8 does rely on several different representations of properties for performance and memory reasons. In this blog post we are going to explain how V8 can provide fast property access while handling dynamically-added properties. Understanding how properties work is essential for explaining how optimizations such as inline caches work in V8.

This post explains the difference in handling integer-indexed and named properties. After that we show how V8 maintains HiddenClasses when adding named properties in order to provide a fast way to identify the shape of an object. We'll then continue giving insights into how named properties are optimized for fast accesses or fast modification depending on the usage. In the final section we provide details on how V8 handles integer-indexed properties or array indices.

Named Properties vs. Elements

Let's start by analysing a very simple object such as {a: "foo", b: "bar"}. This object has two named properties, "a" and "b". It does not have any integer indices for property names. array-indexed properties, more commonly known as elements, are most prominent on arrays. For instance the array ["foo", "bar"] has two array-indexed properties: 0, with the value "foo", and 1, with the value "bar". This is the first major distinction on how V8 handles properties in general.

The following diagram shows what a basic JavaScript object looks like in memory.

Elements and properties are stored in two separate data structures which makes adding and accessing properties or elements more efficient for different usage patterns.

Elements are mainly used for the various Array.prototype methods such as pop or slice. Given that these functions access properties in consecutive ranges, V8 also represents them as simple arrays internally — most of the time. Later in this post we will explain how we sometimes switch to a sparse dictionary-based representation to save memory.

Named properties are stored in a similar way in a separate array. However, unlike elements, we cannot simply use the key to deduce their position within the properties array; we need some additional metadata. In V8 every JavaScript object has a HiddenClass associated. The HiddenClass stores information about the shape of an object, and among other things, a mapping from property names to indices into the properties. To complicate things we sometimes use a dictionary for the properties instead of a simple array. We will explain this in more detail in a dedicated section.

Takeaway from this section:

Array-indexed properties are stored in a separate elements store.
Named properties are stored in the properties store.
Elements and properties can either be arrays or dictionaries.
Each JavaScript object has a HiddenClass associated that keeps information about the object shape.

HiddenClasses and DescriptorArrays

After explaining the general distinction of elements and named properties we need to have a look at how HiddenClasses work in V8. This HiddenClass stores meta information about an object, including the number of properties on the object and a reference to the object’s prototype. HiddenClasses are conceptually similar to classes in typical object-oriented programming languages. However, in a prototype-based language such as JavaScript it is generally not possible to know classes upfront. Hence, in this case V8, HiddenClasses are created on the fly and updated dynamically as objects change. HiddenClasses serve as an identifier for the shape of an object and as such a very important ingredient for V8's optimizing compiler and inline caches. The optimizing compiler for instance can directly inline property accesses if it can ensure a compatible objects structure through the HiddenClass.

Let's have a look at the important parts of a HiddenClass.

In V8 the first field of a JavaScript object points to a HiddenClass. (In fact, this is the case for any object that is on the V8 heap and managed by the garbage collector.) In terms of properties, the most important information is the third bit field, which stores the number of properties, and a pointer to the descriptor array. The descriptor array contains information about named properties like the name itself and the position where the value is stored. Note that we do not keep track of integer indexed properties here, hence there is no entry in the descriptor array.

The basic assumption about HiddenClasses is that objects with the same structure — e.g. the same named properties in the same order — share the same HiddenClass. To achieve that we use a different HiddenClass when a property gets added to an object. In the following example we start from an empty object and add three named properties.

Every time a new property is added, the object's HiddenClass is changed. In the background V8 creates a transition tree that links the HiddenClasses together. V8 knows which HiddenClass to take when you add, for instance, the property "a" to an empty object. This transition tree makes sure you end up with the same final HiddenClass if you add the same properties in the same order. The following example shows that we would follow the same transition tree even if we add simple indexed properties in between.

However, if we create a new object that gets a different property added, in this case property "d," V8 creates a separate branch for the new HiddenClasses.

Takeway from this section:

Objects with the same structure (same properties in the same order) have the same HiddenClass
By default every new named property added causes a new HiddenClass to be created.
Adding array-indexed properties does not create new HiddenClasses.

The Three Different Kinds of Named Properties

After giving an overview on how V8 uses HiddenClasses to track the shape of objects let’s dive into how these properties are actually stored. As explained in the introduction above, there are two fundamental kind of properties: named and indexed. The following section covers named properties.

A simple object such as {a: 1, b: 2} can have various internal representations in V8. While JavaScript objects behave more or less like simple dictionaries from the outside, V8 tries to avoid dictionaries because they hamper certain optimizations such as inline caches which we will explain in a separate post.

In-object vs. Normal Properties: V8 supports so-called in-object properties which are stored directly on the object themselves. These are the fastest properties available in V8 as they are accessible without any indirection. The number of in-object properties is predetermined by the initial size of the object. If more properties get added than there is space in the object, they are stored in the properties store. The properties store adds one level of indirection but can be grown independently.

Fast vs. Slow Properties: The next important distinction is between fast and slow properties. Typically we define the properties stored in the linear properties store as "fast". Fast properties are simply accessed by index in the properties store. To get from the name of the property to the actual position in the properties store, we have to consult the descriptor array on the HiddenClass, as we've outlined before.

However, if many properties get added and deleted from an object, it can generate a lot of time and memory overhead to maintain the descriptor array and HiddenClasses. Hence, V8 also supports so-called slow properties. An object with slow properties has a self-contained dictionary as a properties store. All the properties meta information is no longer stored in the descriptor array on the HiddenClass but directly in the properties dictionary. Hence, properties can be added and removed without updating the HiddenClass. Since inline caches don’t work with dictionary properties, the latter, are typically slower than fast properties.

Takeaway from this section:

There are three different named property types: in-object, fast and slow/dictionary.

In-object properties are stored directly on the object itself and provide the fastest access.
Fast properties live in the properties store, all the meta information is stored in the descriptor array on the HiddenClass.
Slow properties live in a self-contained properties dictionary, meta information is no longer shared through the HiddenClass.

Slow properties allow for efficient property removal and addition but are slower to access than the other two types.

Elements or array-indexed Properties

So far we have looked at named properties and ignored integer indexed properties commonly used with arrays. Handling of integer indexed properties is no less complex than named properties. Even though all indexed properties are always kept separately in the elements store, there are 20 different types of elements!

Packed or Holey Elements: The first major distinction V8 makes is whether the elements backing store is packed or has holes in it. You get holes in a backing store if you delete an indexed element, or for instance, you don't define it. A simple example is [1,,3] where the second entry is a hole. The following example illustrates this issue:

const o = ["a", "b", "c"];
console.log(o[1]);          // Prints "b".

delete o[1];                // Introduces a hole in the elements store.
console.log(o[1]);          // Prints "undefined"; property 1 does not exist.
o.__proto__ = {1: "B"};     // Define property 1 on the prototype.

console.log(o[0]);          // Prints "a".
console.log(o[1]);          // Prints "B".
console.log(o[2]);          // Prints "c".
console.log(o[3]);          // Prints undefined

In short, if a property is not present on the receiver we have to keep on looking on the prototype chain. Given that elements are self-contained, e.g. we don't store information about present indexed properties on the HiddenClass, we need a special value, called the_hole, to mark properties that are not present. This is crucial for the performance of Array functions. If we know that there are no holes, i.e. the elements store is packed, we can perform local operations without expensive lookups on the prototype chain.

Fast or Dictionary Elements: The second major distinction made on elements is whether they are fast or dictionary-mode. Fast elements are simple VM-internal arrays where the property index maps to the index in the elements store. However, this simple representation is rather wasteful for very large sparse/holey arrays where only few entries are occupied. In this case we used a dictionary-based representation to save memory at the cost of slightly slower access:

const sparseArray = [];
sparseArray[1 << 20] = "foo"; // Creates an array with dictionary elements.

In this example, allocating a full array with 10k entries would be rather wasteful. What happens instead is that V8 creates a dictionary where we store a key-value-descriptor triplets. The key in this case would be 10000 and the value "string" and the default descriptor is used. Given that we don't have a way to store descriptor details on the HiddenClass, V8 resorts to slow elements whenever you define an indexed properties with a custom descriptor:

const array = [];
Object.defineProperty(array, 0, {value: "fixed", configurable: false});
console.log(array[0]);      // Prints "fixed".
array[0] = "other value";   // Cannot override index 0.
console.log(array[0]);      // Still prints "fixed".

In this example we added a non-configurable property on the array. This information is stored in the descriptor part of a slow elements dictionary triplet. It is important to note that Array functions perform considerably slower on objects with slow elements.

Smi and Double Elements: For fast elements there is another important distinction made in V8. For instance if you only store integers in an Array, a common use-case, the GC does not have to look at the array, as integers are directly encoded as so called small integers (Smis) in place. Another special case are Arrays that only contain doubles. Unlike Smis, floating point numbers are usually represented as full objects occupying several words. However, V8 stores raw doubles for pure double arrays to avoid memory and performance overhead. The following example lists 4 examples of Smi and double elements:

const a1 = [1,   2, 3];  // Smi Packed
const a2 = [1,    , 3];  // Smi Holey, a2[1] reads from the prototype
const b1 = [1.1, 2, 3];  // Double Packed
const b2 = [1.1,  , 3];  // Double Holey, b2[1] reads from the prototype

Special Elements: With the information so far we covered 7 out of the 20 different element kinds. For simplicity we excluded 9 element kinds for TypedArrays, two more for String wrappers and last but not least, two more special element kinds for arguments objects.

The ElementsAccessor: As you can imagine we are not exactly keen on writing Array functions 20 times in C++, once for every elements kind. That's where some C++ magic comes into play. Instead of implementing Array functions over and over again, we built the ElementsAccessor where we mostly have to implement only simple functions that access elements from the backing store. The ElementsAccessor relies on CRTP to create specialized versions of each Array function. So if you call something like slice on a array, V8 internally calls a builtin written in C++ and dispatches through the ElementsAccessor to the specialized version of the function:

Takeaway from this section:

There are fast and dictionary-mode indexed properties and elements.
Fast properties can be packed or they can can contain holes which indicate that an indexed property has been deleted.
Elements are specialized on their content to speed up Array functions and reduce GC overhead.

Understanding how properties work is key to many optimizations in V8. For JavaScript developers many of these internal decisions are not visible directly, but they explain why certain code patterns are faster than others. Changing the property or element type typically causes V8 to create a different HiddenClass which can lead to type pollution which prevents V8 from generating optimal code. Stay tuned for further posts on how the VM-internals of V8 work.

Posted by Camillo Bruni (@camillobruni), also author of "fast for-in"

↧

V8 Release 6.2

September 11, 2017, 6:47 am

≫ Next: “Elements kinds” in V8

≪ Previous: Fast Properties in V8

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.2, which is in beta until its release in coordination with Chrome 62 Stable in several weeks. V8 v6.2 is filled with all sorts of developer-facing goodies. This post provides a preview of some of the highlights in anticipation of the release.

Performance improvements

The performance of Object#toString() was previously already identified as a potential bottleneck, since it’s often used by popular libraries like lodash and underscore.js, and frameworks like AngularJS. Various helper functions like _.isPlainObject, _.isDate, angular.isArrayBuffer or angular.isRegExp are often used throughout application and library code to perform runtime type checks.

With the advent of ES2015, Object#toString() became monkey-patchable via the new Symbol.toStringTag symbol, which also made Object#toString() more heavy-weight and more challenging to speed up. In this release we ported an optimization initially implemented in the SpiderMonkey JavaScript engine to V8, speeding up throughput of Object#toString() by a factor of 6.5×.

It also impacts the Speedometer browser benchmark, specifically the AngularJS subtest, where we measured a solid 3% improvement. Read the detailed blog post for additional information.

We’ve also significantly improved the performance of ES2015 proxies, speeding up calling a proxy object via someProxy(params) or new SomeOtherProxy(params) by up to 5×:

And similarly, the performance of accessing a property on a proxy object via someProxy.property improved by almost 6.5×:

This is part of an ongoing internship. Stay tuned for a more detailed blog post and final results.

We’re also excited to announce that thanks to contributions from Peter Wong, the performance of the String#includes() built-in improved by more than 3× since the previous release.

Hashcode lookups for internal hash tables got much faster, resulting in improved performance for Map, Set, WeakMap, and WeakSet. An upcoming blog post will explain this optimization in detail.

The garbage collector now uses a Parallel Scavenger for collecting the so-called young generation of the heap.

Enhanced low-memory mode

Over the last few releases V8’s low-memory mode was enhanced (e.g. by setting initial semi-space size to 512K). Low-memory devices now hit fewer out-of-memory situations. This low-memory behavior might have a negative impact on runtime performance though.

More regular expressions features

Support for the dotAll mode for regular expressions, enabled through the s flag, is now enabled by default. In dotAll mode, the . atom in regular expressions matches any character, including line terminators.

/foo.bar/su.test('foo\nbar'); // true

Lookbehind assertions, another new regular expression feature, are now available by default. The name already describes its meaning pretty well. Lookbehind assertions offer a way to restrict a pattern to only match if preceded by the pattern in the lookbehind group. It comes in both matching and non-matching flavors:

/(?<=\$)\d+/.exec('$1 is worth about ¥123'); // ['1']
/(?<!\$)\d+/.exec('$1 is worth about ¥123'); // ['123']

More details about these features are available in our blog post titled Upcoming regular expression features.

Template literal revision

The restriction on escape sequences in template literals has been loosened per the relevant proposal. This enables new use cases for template tags, such as writing a LaTeX processor.

const latex = (strings) => {

// …

};

const document = latex`

\newcommand{\fun}{\textbf{Fun!}}

\newcommand{\unicode}{\textbf{Unicode!}}

\newcommand{\xerxes}{\textbf{King!}}

Breve over the h goes \u{h}ere // Illegal token!

Increased max string length

The maximum string length on 64-bit platforms increased from 2**28 - 16 to 2**30 - 25 characters.

FullCodeGen is gone

In 6.2 the final major pieces of the old pipeline are gone. More than 30k lines of code were deleted in this release — a clear win for reducing code complexity.

V8 API

Please check out our summary of API changes. This document is regularly updated a few weeks after each major release.

Developers with an active V8 checkout can use git checkout -b 6.2 -t branch-heads/6.2 to experiment with the new features in V8 6.2. Alternatively you can subscribe to Chrome’s Beta channel and try the new features out yourself soon.

Posted by the V8 team

↧

“Elements kinds” in V8

September 12, 2017, 7:07 am

≫ Next: Temporarily disabling escape analysis

≪ Previous: V8 Release 6.2

JavaScript objects can have arbitrary properties associated with them. The names of object properties can contain any character. One of the interesting cases that a JavaScript engine can choose to optimize for are properties whose names are purely numeric, most specifically array indices.

In V8, properties with integer names — the most common form of which are objects generated by the Array constructor — are handled specially. Although in many circumstances these numerically-indexed properties behave just like other properties, V8 chooses to store them separately from non-numeric properties for optimization purposes. Internally, V8 even gives these properties a special name: elements. Objects have properties that map to values, whereas arrays have indices that map to elements.

Although these internals are never directly exposed to JavaScript developers, they explain why certain code patterns are faster than others.

Common elements kinds

While running JavaScript code, V8 keeps track of what kind of elements each array contains. This information allows V8 to optimize any operations on the array specifically for this type of element. For example, when you call reduce, map, or forEach on an array, V8 can optimize those operations based on what kind of elements the array contains.

Take this array, for example:

const array = [1, 2, 3];

What kinds of elements does it contain? If you’d ask the typeof operator, it would tell you the array contains numbers. At the language-level, that’s all you get: JavaScript doesn’t distinguish between integers, floats, and doubles — they’re all just numbers. However, at the engine level, we can make more precise distinctions. The elements kind for this array is PACKED_SMI_ELEMENTS. In V8, the term Smi refers to the particular format used to store small integers. (We’ll get to the PACKED part in a minute.)

Later adding a floating-point number to the same array transitions it to a more generic elements kind:

const array = [1, 2, 3];
// elements kind: PACKED_SMI_ELEMENTS
array.push(4.56);
// elements kind: PACKED_DOUBLE_ELEMENTS

Adding a string literal to the array changes its elements kind once again.

const array = [1, 2, 3];
// elements kind: PACKED_SMI_ELEMENTS
array.push(4.56);
// elements kind: PACKED_DOUBLE_ELEMENTS
array.push('x');
// elements kind: PACKED_ELEMENTS

We’ve seen three distinct elements kinds so far, with the following basic types:

Small integers, also known as Smi.
Doubles, for floating-point numbers and integers that cannot be represented as a Smi.
Regular elements, for values that cannot be represented as Smi or doubles.

Note that doubles form a more general variant of Smi, and regular elements are another generalization on top of doubles. The set of numbers that can be represented as a Smi is a subset of the numbers that can be represented as a double.

What’s important here is that elements kind transitions only go in one direction: from specific (e.g. PACKED_SMI_ELEMENTS) to more general (e.g. PACKED_ELEMENTS). Once an array is marked as PACKED_ELEMENTS, it cannot go back to PACKED_DOUBLE_ELEMENTS, for example.

So far, we’ve learned the following:

V8 assigns an elements kind to each array.
The elements kind of an array is not set in stone — it can change at runtime. In the earlier example, we transitioned from PACKED_SMI_ELEMENTS to PACKED_ELEMENTS.
Elements kind transitions can only go from specific kinds to more general kinds.

`PACKED` vs. `HOLEY` kinds

So far, we’ve only been dealing with dense or packed arrays. Creating holes in the array (i.e. making the array sparse) downgrades the elements kind to its “holey” variant:

const array = [1, 2, 3, 4.56, 'x'];
// elements kind: PACKED_ELEMENTS
array.length; // 5
array[9] = 1; // array[5] until array[8] are now holes
// elements kind: HOLEY_ELEMENTS

V8 makes this distinction because operations on packed arrays can be optimized more aggressively than operations on holey arrays. For packed arrays, most operations can be performed efficiently. In comparison, operations on holey arrays require additional checks and expensive lookups on the prototype chain.

Each of the basic elements kinds we’ve seen so far (i.e. Smis, doubles, and regular elements) comes in two flavors: the packed and the holey version. Not only can we transition from, say, PACKED_SMI_ELEMENTS to PACKED_DOUBLE_ELEMENTS, we can also transition from any PACKED kind to its HOLEY counterpart.

To recap:

The most common elements kinds come in PACKED and HOLEY flavors.
Operations on packed arrays are more efficient than operations on holey arrays.
Elements kinds can transition from PACKED to HOLEY flavors.

The elements kind lattice

V8 implements this tag transitioning system as a lattice. Here’s a simplified visualization of that featuring only the most common elements kinds:

It’s only possible to transition downwards through the lattice. Once a single floating-point number is added to an array of Smis, it is marked as DOUBLE, even if you later overwrite the float with a Smi. Similarly, once a hole is created in an array, it’s marked as holey forever, even when you fill it later.

V8 currently distinguishes 21 different elements kinds, each of which comes with its own set of possible optimizations.

In general, more specific elements kinds enable more fine-grained optimizations. The further down the elements kind is in the lattice, the slower manipulations of that object might be. For optimal performance, avoid needlessly transitioning to less specific types — stick to the most specific one that’s applicable to your situation.

Performance tips

In most cases, elements kind tracking works invisibly under the hood and you don’t need to worry about it. But here are a few things you can do to get the greatest possible benefit from the system.

Avoid creating holes

Let’s say we’re trying to create an array, for example:

const array = newArray(3);
// The array is sparse at this point, so it gets marked as
// `HOLEY_SMI_ELEMENTS`, i.e. the most specific possibility given
// the current information.
array[0] = 'a';
// Hold up, that’s a string instead of a small integer… So the kind
// transitions to `HOLEY_ELEMENTS`.
array[1] = 'b';
array[2] = 'c';
// At this point, all three positions in the array are filled, so
// the array is packed (i.e. no longer sparse). However, we cannot
// transition to a more specific kind such as `PACKED_ELEMENTS`. The
// elements kind remains `HOLEY_ELEMENTS`.

Once the array is marked as holey, it’s holey forever — even if it’s packed later! Any operation on the array from then on is potentially slower than it could be. If you plan on performing lots of operations on the array, and you’d like to optimize those operations, avoid creating holes in the array. V8 can deal with packed arrays more efficiently.

A better way of creating an array is to use a literal instead:

const array = ['a', 'b', 'c'];
// elements kind: PACKED_ELEMENTS

If you don’t know all the values ahead of time, create an array, and later push the values to it.

const array = [];
// …
array.push(someValue);
// …
array.push(someOtherValue);

This approach ensures that the array never transitions to a holey elements kind. As a result, V8 can optimize any future operations on the array more efficiently.

Avoid reading beyond the length of the array

A similar situation to hitting a hole occurs when reading beyond the length of the array, e.g. reading array[42] when array.length === 5. In this case, the array index 42 is out of bounds, the property is not present on the array itself, and so the JavaScript engine has to perform the same expensive prototype chain lookups.

Don’t write your loops like this:

// Don’t do this!
for (let i = 0, item; (item = items[i]) != null; i++) {
  doSomething(item);
}

This code reads all the elements in the array, and then one more. It only ends once it finds an undefined or null element. (jQuery uses this pattern in a few places.)

Instead, write your loops the old-fashioned way, and just keep iterating until you hit the last element.

for (let index = 0; index < items.length; index++) {
const item = items[index];
  doSomething(item);
}

When the collection you’re looping over is iterable (as is the case for arrays and NodeLists), that’s even better: just use for-of.

for (const item of items) {
  doSomething(item);
}

For arrays specifically, you could use the forEach built-in:

items.forEach((item) => {
  doSomething(item);
});

Nowadays, the performance of both for-of and forEach is on par with the old-fashioned for loop.

Avoid reading beyond the array’s length! Doing so is just as bad as hitting a hole in an array. In this case, V8’s bounds check fails, the check to see if the property is present fails, and then we need to look up the prototype chain.

Avoid elements kind transitions

In general, if you need to perform lots of operations on an array, try sticking to an elements kind that’s as specific as possible, so that V8 can optimize those operations as much as possible.

This is harder than it seems. For example, just adding -0 to an array of small integers is enough to transition it to PACKED_DOUBLE_ELEMENTS.

const array = [3, 2, 1, +0];
// PACKED_SMI_ELEMENTS
array.push(-0);
// PACKED_DOUBLE_ELEMENTS

As a result, any future operations on this array are optimized in a completely different way than they would be for Smis.

Avoid -0, unless you explicitly need to differentiate -0 and +0 in your code. (You probably don’t.)

The same thing goes for NaN and Infinity. They are represented as doubles, so adding a single NaN or Infinity to an array of SMI_ELEMENTS transitions it to DOUBLE_ELEMENTS.

const array = [3, 2, 1];
// PACKED_SMI_ELEMENTS
array.push(NaN, Infinity);
// PACKED_DOUBLE_ELEMENTS

If you’re planning on performing lots of operations on an array of integers, consider normalizing -0 and blocking NaN and Infinity when initializing the values. That way, the array sticks to the PACKED_SMI_ELEMENTS kind. This one-time normalization cost can be worth the later optimizations.

In fact, if you’re doing mathematical operations on an array of numbers, consider using a TypedArray. We have specialized elements kinds for those, too.

Prefer arrays over array-like objects

Some objects in JavaScript — especially in the DOM — look like arrays although they aren’t proper arrays. It’s possible to create array-like objects yourself:

const arrayLike = {};
arrayLike[0] = 'a';
arrayLike[1] = 'b';
arrayLike[2] = 'c';
arrayLike.length = 3;

This object has a length and supports indexed element access (just like an array!) but it lacks array methods such as forEach on its prototype. It’s still possible to call array generics on it, though:

Array.prototype.forEach.call(arrayLike, (value, index) => {
console.log(`${ index }: ${ value }`);
});
// This logs '0: a', then '1: b', and finally '2: c'.

This code calls the Array.prototype.forEach built-in on the array-like object, and it works as expected. However, this is slower than calling forEach on a proper array, which is highly optimized in V8. If you plan on using array built-ins on this object more than once, consider turning it into an actual array beforehand:

const actualArray = Array.prototype.slice.call(arrayLike, 0);
actualArray.forEach((value, index) => {
console.log(`${ index }: ${ value }`);
});
// This logs '0: a', then '1: b', and finally '2: c'.

The one-time conversion cost can be worth the later optimizations, especially if you plan on performing lots of operations on the array.

The arguments object, for example, is an array-like object. It’s possible to call array builtins on it, but such operations won’t be fully optimized the way they could be for a proper array.

const logArgs = function() {
Array.prototype.forEach.call(arguments, (value, index) => {
console.log(`${ index }: ${ value }`);
  });
};
logArgs('a', 'b', 'c');
// This logs '0: a', then '1: b', and finally '2: c'.

ES2015 rest parameters can help here. They produce proper arrays that can be used instead of the array-like arguments objects in an elegant way.

const logArgs = (...args) => {
  args.forEach((value, index) => {
console.log(`${ index }: ${ value }`);
  });
};
logArgs('a', 'b', 'c');
// This logs '0: a', then '1: b', and finally '2: c'.

Nowadays, there’s no good reason to use the arguments object directly.

In general, avoid array-like objects whenever possible and use proper arrays instead.

Avoid polymorphism

If you have code that handles arrays of many different elements kinds, it can lead to polymorphic operations that are slower than a version of the code that only operates on a single elements kind.

Consider the following example, where a library function is called with various elements kinds. (Note that this is not the native Array.prototype.forEach, which has its own set of optimizations on top of the elements kinds-specific optimizations discussed in this article.)

const each = (array, callback) => {
for (let index = 0; index < array.length; ++index) {
const item = array[index];
    callback(item);
  }
};
const doSomething = (item) =>console.log(item);

each([], () => {});

each(['a', 'b', 'c'], doSomething);
// `each` is called with `PACKED_ELEMENTS`. V8 uses an inline cache
// (or “IC”) to remember that `each` is called with this particular
// elements kind. V8 is optimistic and assumes that the
// `array.length` and `array[index]` accesses inside the `each`
// function are monomorphic (i.e. only ever receive a single kind
// of elements) until proven otherwise. For every future call to
// `each`, V8 checks if the elements kind is `PACKED_ELEMENTS`. If
// so, V8 can re-use the previously-generated code. If not, more work
// is needed.

each([1.1, 2.2, 3.3], doSomething);
// `each` is called with `PACKED_DOUBLE_ELEMENTS`. Because V8 has
// now seen different elements kinds passed to `each` in its IC, the
// `array.length` and `array[index]` accesses inside the `each`
// function get marked as polymorphic. V8 now needs an additional
// check every time `each` gets called: one for `PACKED_ELEMENTS`
// (like before), a new one for `PACKED_DOUBLE_ELEMENTS`, and one for
// any other elements kinds (like before). This incurs a performance
// hit.

each([1, 2, 3], doSomething);
// `each` is called with `PACKED_SMI_ELEMENTS`. This triggers another
// degree of polymorphism. There are now three different elements
// kinds in the IC for `each`. For every `each` call from now on, yet
// another elements kind check is needed to re-use the generated code
// for `PACKED_SMI_ELEMENTS`. This comes at a performance cost.

Built-in methods (such as Array.prototype.forEach) can deal with this kind of polymorphism much more efficiently, so consider using them instead of userland library functions in performance-sensitive situations.

Another example of monomorphism vs. polymorphism in V8 involves object shapes, also known as the hidden class of an object. To learn about that case, check out Vyacheslav’s article.

Debugging elements kinds

To figure out a given object’s “elements kind”, get a debug build of d8 (see “Building from source”), and run:

$ out.gn/x64.debug/d8 --allow-natives-syntax

This opens a d8 REPL in which special functions such as %DebugPrint(object) are available. The “elements” field in its output reveals the “elements kind” of any object you pass to it.

d8> const array = [1, 2, 3]; %DebugPrint(array);
DebugPrint: 0x1fbbad30fd71: [JSArray]
 - map = 0x10a6f8a038b1 [FastProperties]
 - prototype = 0x1212bb687ec1
 - elements = 0x1fbbad30fd19<FixedArray[3]> [PACKED_SMI_ELEMENTS (COW)]
 - length = 3
 - properties = 0x219eb0702241<FixedArray[0]> {
    #length: 0x219eb0764ac9<AccessorInfo> (const accessor descriptor)
 }
 - elements= 0x1fbbad30fd19<FixedArray[3]> {
0: 1
1: 2
2: 3
 }
[…]

Note that “COW” stands for copy-on-write, which is yet another internal optimization. Don’t worry about that for now — that’s a topic for another blog post!

Another useful flag that’s available in debug builds is --trace-elements-transitions. Enable it to let V8 inform you whenever any elements kind transition takes place.

$ cat my-script.js
const array = [1, 2, 3];
array[3] = 4.56;

$ out.gn/x64.debug/d8 --trace-elements-transitions my-script.js
elements transition [PACKED_SMI_ELEMENTS -> PACKED_DOUBLE_ELEMENTS] in ~+34 at x.js:2 for 0x1df87228c911 <JSArray[3]> from 0x1df87228c889 <FixedArray[3]> to 0x1df87228c941 <FixedDoubleArray[22]>

↧

Temporarily disabling escape analysis

September 22, 2017, 2:19 pm

≫ Next: An Internship on Laziness: Lazy Unlinking of Deoptimized Functions

≪ Previous: “Elements kinds” in V8

In JavaScript, an allocated object escapes if it is accessible from outside the current function. Normally V8 allocates new objects on the JavaScript heap, but using escape analysis, an optimizing compiler can figure out when an object can be treated specially because its lifetime is provably bound to the function’s activation. When the reference to a newly allocated object does not escape the function that creates it, JavaScript engines don’t need to explicitly allocate that object on the heap. They can instead effectively treat the values of the object as local variables to the function. That in turn enables all kinds of optimizations like storing these values on the stack or in registers, or in some cases, optimizing the values away completely. Objects that escape (more accurately, objects that can’t be proven to not escape) must be heap-allocated.

For example, escape analysis enables V8 to effectively rewrite the following code:

functionfoo(a, b) {
const object = { a, b };
return object.a + object.b;
// Note: `object` does not escape.
}

…into this code, which enables several under-the-hood optimizations:

functionfoo(a, b) {
const object_a = a;
const object_b = b;
return object_a + object_b;
}

V8 v6.1 and older used an escape analysis implementation that was complex and generated many bugs since its introduction. This implementation has since been removed and a brand new escape analysis codebase is available in V8 v6.2.

However, a Chrome security vulnerability involving the old escape analysis implementation in V8 v6.1 has been discovered and responsibly disclosed to us. To protect our users, we’ve turned off escape analysis in Chrome 61. Node.js should not be affected as the exploit depends on execution of untrusted JavaScript.

Turning off escape analysis negatively impacts performance because it disables the abovementioned optimizations. Specifically, the following ES2015 features might suffer temporary slowdowns:

destructuring
for-of iteration
array spread
rest parameters

Note that disabling escape analysis is only a temporary measure. With Chrome 62, we’ll ship the brand new — and most importantly, enabled — implementation of escape analysis as seen in V8 v6.2.

↧

An Internship on Laziness: Lazy Unlinking of Deoptimized Functions

October 4, 2017, 12:03 am

≫ Next: Optimizing ES2015 proxies in V8

≪ Previous: Temporarily disabling escape analysis

Roughly three months ago, I joined the V8 team (Google Munich) as an intern and since then I’ve been working on the VM’s Deoptimizer— something completely new to me which proved to be an interesting and challenging project. The first part of my internship focused on improving the VM security-wise. The second part focused on performance improvements. Namely, on the removal of a data-structure used for the unlinking of previously deoptimized functions, which was a performance bottleneck during garbage collection. This blog post describes this second part of my internship. I’ll explain how V8 used to unlink deoptimized functions, how we changed this, and what performance improvements were obtained.

Let’s (very) briefly recap the V8 pipeline for a JavaScript function: V8’s interpreter, Ignition, collects profiling information about that function while interpreting it. Once the function becomes hot, this information is passed to V8’s compiler, TurboFan, which generates optimized machine code. When the profiling information is no longer valid — for example because one of the profiled objects gets a different type during runtime — the optimized machine code might become invalid. In that case, V8 needs to deoptimize it.

Source: JavaScript Start-up Performance

Upon optimization, TurboFan generates a code object, i.e. the optimized machine code, for the function under optimization. When this function is invoked the next time, V8 follows the link to optimized code for that function and executes it. Upon deoptimization of this function, we need to unlink the code object in order to make sure that it won’t be executed again. How does that happen?

For example, in the following code, the function f1 will be invoked many times (always passing an integer as argument). TurboFan then generates machine code for that specific case.

function g() {
  return (i) => i;
}

// Create a closure.
const f1 = g();
// Optimize f1.
for (var i = 0; i < 1000; i++) f1(0);

Each function also has a trampoline to the interpreter — more details in these slides— and will keep a pointer to this trampoline in its SharedFunctionInfo (SFI). This trampoline will be used whenever V8 needs to go back to unoptimized code. Thus, upon deoptimization, triggered by passing an argument of a different type, for example, the Deoptimizer can simply set the code field of the JavaScript function to this trampoline.

Although this seems simple, it forces V8 to keep weak lists of optimized JavaScript functions. This is because it is possible to have different functions pointing to the same optimized code object. We can extend our example as follows, and the functions f1 and f2 both point to the same optimized code.

const f2 = g();
f2(0);

If the function f1 is deoptimized (for example by invoking it with an object of different type {x: 0}) we need to make sure that the invalidated code will not be executed again by invoking f2.

Thus, upon deoptimization, V8 used to iterate over all the optimized JavaScript functions, and would unlink those that pointed to the code object being deoptimized. This iteration in applications with many optimized JavaScript functions became a performance bottleneck. Moreover, other than slowing down deoptimization, V8 used to iterate over these lists upon stop-the-world cycles of garbage collection, making it even worse.

In order to have an idea of the impact of such data-structure in the performance of V8, we wrote a micro-benchmarkthat stresses its usage, by triggering many scavenge cycles after creating many JavaScript functions.

function g() {
  return (i) => i + 1;
}

// Create an initial closure and optimize.
var f = g();

f(0);
f(0);
%OptimizeFunctionOnNextCall(f);
f(0);

// Create 2M closures, those will get the previously optimized code.
var a = [];
for (var i = 0; i < 2000000; i++) {
  var h = g();
  h();
  a.push(h);
}

// Now cause scavenges, all of them are slow.
for (var i = 0; i < 1000; i++) {
  new Array(50000);
}

When running this benchmark, we could observe that V8 spent around 98% of its execution time on garbage collection. We then removed this data structure, and instead used an approach for lazy unlinking, and this was what we observed on x64:

Although this is just a micro-benchmark that creates many JavaScript functions and triggers many garbage collection cycles, it gives us an idea of the overhead introduced by this data structure. Other more realistic applications where we saw some overhead, and which motivated this work, were the router benchmarkimplemented in Node.js and ARES-6 benchmark suite.

Lazy unlinking

Rather than unlinking optimized code from JavaScript functions upon deoptimization, V8 postpones it for the next invocation of such functions. When such functions are invoked, V8 checks whether they have been deoptimized, unlinks them and then continues with their lazy compilation. If these functions are never invoked again, then they will never be unlinked and the deoptimized code objects will not be collected. However, given that during deoptimization, we invalidate all the embedded fields of the code object, we only keep that code object alive.

The committhat removed this list of optimized JavaScript functions required changes in several parts of the VM, but the basic idea is as follows. When assembling the optimized code object, we check if this is the code of a JavaScript function. If so, in its prologue, we assemble machine code to bail out if the code object has been deoptimized. Upon deoptimization we don’t modify the deoptimized code — code patching is gone. Thus, its bit marked_for_deoptimization is still set when invoking the function again. TurboFan generates code to check it, and if it is set, then V8 jumps to a new builtin, CompileLazyDeoptimizedCode, that unlinks the deoptimized code from the JavaScript function and then continues with lazy compilation.

In more detail, the first step is to generate instructions that load the address of the code being currently assembled. We can do that in x64, with the following code:

Label current;
// Load effective address of current instruction into rcx.
__ leaq(rcx, Operand(&current));
__ bind(&current);

After that we need to obtain where in the code object the marked_for_deoptimization bit lives.

int pc = __ pc_offset();
int offset = Code::kKindSpecificFlags1Offset - (Code::kHeaderSize + pc);

We can then test the bit and if it is set, we jump to the CompileLazyDeoptimizedCode built in.

// Test if the bit is set, that is, if the code is marked for deoptimization.
__ testl(Operand(rcx, offset),
         Immediate(1 << Code::kMarkedForDeoptimizationBit));
// Jump to builtin if it is.
__ j(not_zero, /* handle to builtin code here */, RelocInfo::CODE_TARGET);

On the side of this CompileLazyDeoptimizedCode builtin, all that’s left to do is to unlink the code field from the JavaScript function and set it to the trampoline to the Interpreter entry. So, considering that the address of the JavaScript function is in the register rdi, we can obtain the pointer to the SharedFunctionInfo with:

// Field read to obtain the SharedFunctionInfo.
__ movq(rcx, FieldOperand(rdi, JSFunction::kSharedFunctionInfoOffset));

…and similarly the trampoline with:

// Field read to obtain the code object.
__ movq(rcx, FieldOperand(rcx, SharedFunctionInfo::kCodeOffset));

Then we can use it to update the function slot for the code pointer:

// Update the code field of the function with the trampoline.
__ movq(FieldOperand(rdi, JSFunction::kCodeOffset), rcx);
// Write barrier to protect the field.
__ RecordWriteField(rdi, JSFunction::kCodeOffset, rcx, r15,
                    kDontSaveFPRegs, OMIT_REMEMBERED_SET, OMIT_SMI_CHECK);

This produces the same result as before. However, rather than taking care of the unlinking in the Deoptimizer, we need to worry about it during code generation. Hence the handwritten assembly.

The above is how it works in the x64 architecture. We have implemented it for ia32, arm, arm64, mips, and mips64as well.

This new technique is already integrated in V8 and, as we’ll discuss later, allows for performance improvements. However, it comes with a minor disadvantage: Before, V8 would consider unlinking only upon deoptimization. Now, it has to do so in the activation of all optimized functions. Moreover, the approach to check the marked_for_deoptimization bit is not as efficient as it could be, given that we need to do some work to obtain the address of the code object. Note that this happens when entering every optimized function. A possible solution for this issue is to keep in a code object a pointer to itself. Rather than doing work to find the address of the code object whenever the function is invoked, V8 would do it only once, after its construction.

Results

We now look at the performance gains and regressions obtained with this project.

General Improvements on x64

The following plot shows us some improvements and regressions, relative to the previous commit. Note that the higher, the better.

The promises benchmarks are the ones where we see greater improvements, observing almost 33% gain for the bluebird-parallelbenchmark, and 22.40% for wikipedia. We also observed a few regressions in some benchmarks. This is related to the issue explained above, on checking whether the code is marked for deoptimization.

We also see improvements in the ARES-6 benchmark suite. Note that in this chart too, the higher the better. These programs used to spend considerable amount of time in GC-related activities. With lazy unlinking we improve performance by 1.9% overall. The most notable case is the Air steadyState where we get an improvement of around 5.36%.

AreWeFastYet results

The performance results for the Octane and ARES-6 benchmark suites also showed up on the AreWeFastYet tracker. We looked at these performance results on September 5th, 2017, using the provided default machine (macOS 10.10 64-bit, Mac Pro, shell).

Impact on Node.js

We can also see performance improvements in the router-benchmark. The following two plots show the number of operations per second of each tested router. Thus the higher the better. We have performed two kinds of experiments with this benchmark suite. Firstly, we ran each test in isolation, so that we could see the performance improvement, independently from the remaining tests. Secondly, we ran all tests at once, without switching of the VM, thus simulating an environment where each test is integrated with other functionalities.

For the first experiment, we saw that the router and express tests perform about twice as many operations than before, in the same amount of time. For the second experiment, we saw even greater improvement. In some of the cases, such as routr, server-router and router, the benchmark performs approximately 3.80×, 3× and 2× more operations, respectively. This happens because V8 accumulates more optimized JavaScript functions, test after test. Thus, whenever executing a given test, if a garbage collection cycle is triggered, V8 has to visit the optimized functions from the current test and from the previous ones.

Further Optimization

Now that V8 does not keep the linked-list of JavaScript functions in the context, we can remove the field next from the JSFunction class. Although this is a simple modification, it allows us to save the size of a pointer per function, which represent significant savings in several web pages:

Benchmark	Kind	Memory savings (absolute)	Memory savings (relative)
facebook.com	Average effective size	170KB	3.7%
twitter.com	Average size of allocated objects	284KB	1.2%
cnn.com	Average size of allocated objects	788KB	1.53%
youtube.com	Average size of allocated objects	129KB	0.79%

Acknowledgments

Throughout my internship, I had lots of help from several people, who were always available to answer my many questions. Thus I would like to thank the following people: Benedikt Meurer, Jaroslav Sevcik, and Michael Starzinger for discussions on how the Compiler and the Deoptimizer work, Ulan Degenbaev for helping with the Garbage Collector whenever I broke it, and Mathias Bynens, Peter Marshall, Camillo Bruni, and Maya Lekova for proofreading this article.

Finally, this article is my last contribution as a Google intern and I would like to take the opportunity to thank everyone in the V8 team, and especially my host, Benedikt Meurer, for hosting me and for giving me the opportunity to work on such an interesting project — I definitely learned a lot and enjoyed my time at Google!

Juliana Franco, @jupvfranco, Laziness Expert

↧

Optimizing ES2015 proxies in V8

October 5, 2017, 12:42 am

≫ Next: V8 Release 6.3

≪ Previous: An Internship on Laziness: Lazy Unlinking of Deoptimized Functions

Introduction

Proxies have been an integral part of JavaScript since ES2015. They allow intercepting fundamental operations on objects and customizing their behavior. Proxies form a core part of projects like jsdom and the Comlink RPC library. Recently, we put a lot of effort into improving the performance of proxies in V8. This article sheds some light on general performance improvement patterns in V8 and for proxies in particular.

Proxies are “objects used to define custom behavior for fundamental operations (e.g. property lookup, assignment, enumeration, function invocation, etc.)” (definition by MDN). More info can be found in the full specification. For example, the following code snippet adds logging to every property access on the object:

const target = {};
const callTracer = new Proxy(target, {
  get: (target, name, receiver) => {
    console.log(`get was called for: ${name}`);
    return target[name];
  }
});

callTracer.property = 'value';
console.log(callTracer.property);
// get was called for: property
// value

Constructing proxies

The first feature we'll focus on is the construction of proxies. Our original C++ implementation here followed the EcmaScript specification step-by-step, resulting in at least 4 jumps between the C++ and JS runtimes as shown in the following figure. We wanted to port this implementation into the platform-agnostic CodeStubAssembler(CSA), which is executed in the JS runtime as opposed to the C++ runtime.This porting minimizes that number of jumps between the language runtimes. CEntryStub and JSEntryStub represent the runtimes in the figure below. The dotted lines represent the borders between the JS and C++ runtimes. Luckily, lots of helper predicates were already implemented in the assembler, which made the initial version concise and readable.

The figure below shows the execution flow for calling a Proxy with any proxy trap (in this example apply, which is being called when the proxy is used as a function) generated by the following sample code:

function foo(...) {...}
g = new Proxy({...}, {
  apply: foo
});
g(1, 2);

After porting the trap execution to CSA all of the execution happens in the JS runtime, reducing the number of jumps between languages from 4 to 0.

This change resulted in the following performance improvements::

Our JS performance score shows an improvement between 49% and 74%. This score roughly measures how many times the given microbenchmark can be executed in 1000ms. For some tests the code is run multiple times in order to get an accurate enough measurement given the timer resolution. The code for all of the following benchmarks can be found in our js-perf-test directory.

Call and construct traps

The next section shows the results from optimizing call and construct traps (a.k.a. "apply" and "construct").

The performance improvements when calling proxies are significant — up to 500% faster! Still, the improvement for proxy construction is quite modest, especially in cases where no actual trap is defined — only about 25% gain. We investigated this by running the following command with the d8 shell:

$ out/x64.release/d8 --runtime-call-stats test.js
> run: 120.104000

                      Runtime Function/C++ Builtin        Time             Count
========================================================================================
                                         NewObject     59.16ms  48.47%    100000  24.94%
                                      JS_Execution     23.83ms  19.53%         1   0.00%
                              RecompileSynchronous     11.68ms   9.57%        20   0.00%
                        AccessorNameGetterCallback     10.86ms   8.90%    100000  24.94%
      AccessorNameGetterCallback_FunctionPrototype      5.79ms   4.74%    100000  24.94%
                                  Map_SetPrototype      4.46ms   3.65%    100203  25.00%

… SNIPPET …

Where test.js's source is:

function MyClass() {}
MyClass.prototype = {};
const P = new Proxy(MyClass, {});
function run() {
  return new P();
}
const N = 1e5;
console.time('run');
for (let i = 0; i < N; ++i) {
  run();
}
console.timeEnd('run');

It turned out most of the time is spent in NewObject and the functions called by it, so we started planning how to speed this up in future releases.

Get trap

The next section describes how we optimized the other most common operations — getting and setting properties through proxies. It turned out the gettrap is more involved than the previous cases, due to the specific behavior of V8's inline cache. For a detailed explanation of inline caches, you can watch this talk.

Eventually we managed to get a working port to CSA with the following results:

After landing the change, we noticed the size of the Android .apkfor Chrome had grown by ~160KB, which is more than expected for a helper function of roughly 20 lines, but fortunately we track such statistics. It turned out this function is called twice from another function, which is called 3 times, from another called 4 times. The cause of the problem turned out to be the aggressive inlining. Eventually we solved the issue by turning the inline function into a separate code stub, thus saving precious KBs - the end version had only ~19KB increase in .apk size.

Has trap

The next section shows the results from optimizing the hastrap. Although at first we thought it would be easier (and reuse most of the code of the get trap), it turned out to have its own peculiarities. A particularly hard-to-track-down problem was the prototype chain walking when calling the in operator. The improvement results achieved vary between 71% and 428%. Again the gain is more prominent in cases where the trap is present.

Set trap

The next section talks about porting the settrap. This time we had to differentiate between named and indexed properties (elements). These two main types are not part of the JS language, but are essential for V8's efficient property storage. The initial implementation still bailed out to the runtime for elements, which causes crossing the language boundaries again. Nevertheless we achieved improvements between 27% and 438% for cases when the trap is set, at the cost of a decrease of up to 23% when it's not. This performance regression is due to the overhead of additional check for differentiating between indexed and named properties. For indexed properties, there is no improvement yet. Here are the complete results:

Real-world usage

Results from jsdom-proxy-benchmark

The jsdom-proxy-benchmark project compiles the ECMAScript specification using the Ecmarkup tool. As of v11.2.0,the jsdom project (which underlies Ecmarkup) uses proxies to implement the common data structures NodeList and HTMLCollection. We used this benchmark to get an overview of some more realistic usage than the synthetic micro-benchmarks, and achieved the following results, average of 100 runs:

Node v8.4.0 (without Proxy optimizations): 14277 ± 159 ms
Node v9.0.0-v8-canary-20170924 (with only half of the traps ported): 11789 ± 308 ms
Gain in speed around 2.4 seconds which is ~17% better

Converting NamedNodeMap to use Proxy increased processing time by
- 1.9 s on V8 6.0 (Node v8.4.0)
- 0.5 s on V8 6.3 (Node v9.0.0-v8-canary-20170910)

(thanks for the results provided by TimothyGu)

Results from Chai.js

Chai.js is a popular assertion library which makes heavy use of proxies. We've created a kind of real-world benchmark by running its tests with different versions of V8 an improvement of roughly 1s out of more than 4s, average of 100 runs:

Node v8.4.0 (without Proxy optimizations): 4.2863 ± 0.14 s
Node v9.0.0-v8-canary-20170924 (with only half of the traps ported): 3.1809 ± 0.17 s

Optimization approach

We often tackle performance issues using a generic optimization scheme. The main approach that we followed for this particular work included the following steps:

Implement performance tests for the particular sub-feature
Add more specification conformance tests (or write them from scratch)
Investigate the original C++ implementation
Port the sub-feature to the platform-agnostic CodeStubAssembler
Optimize the code even further by hand-crafting a TurboFan implementation
Measure the performance improvement.

This approach can be applied to any general optimization task that you may have.

Written by Maya Lekova (@MayaLekova), Optimizer of Proxies

↧

V8 Release 6.3

October 25, 2017, 5:38 am

≫ Next: V8 💚 developers and their tools

≪ Previous: Optimizing ES2015 proxies in V8

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.3, which is in beta until its release in coordination with Chrome 63 Stable in several weeks. V8 v6.3 is filled with all sorts of developer-facing goodies. This post provides a preview of some of the highlights in anticipation of the release.

Speed

Jank Busters III hit the shelves as part of the Orinoco project. Concurrent marking (70-80% of marking is done on a non-blocking thread) is shipped.

The parser now does not need to preparse a function a second time. This translates to a 14 % median improvement in parse time on our internal startup top25 benchmark.

string.js has been completely ported to CodeStubAssembler. Thanks a lot to @peterwmwong for his awesome contributions! As a developer this means that builtin string functions like String#trim are a lot faster starting with 6.3.

Object.is()'s performance is now roughly on-par with alternatives. In general, 6.3 continues the path to better the ES2015+ performance. Beside other items we boosted the speed of polymorphic access to symbols, polymorphic inlining of constructor calls and (tagged) template literals.

V8's performance over the past six releases

Weak optimized function list is gone. More information can be found in the dedicated blog post.

The mentioned items are a non-exhaustive list of speed improvements. Lot's of other performance-related work has happened.

Memory consumption

Write barriers are switched over to using the CodeStubAssembler. This saves around 100kb of memory per isolate.

ECMAScript language features

V8 shipped the following stage 3 features: Dynamic module import via import(), Promise.prototype.finally and async iterators/generators.

With dynamic module import it is very straightforward to import modules based on runtime conditions. This comes in handy when an application should lazy load certain code modules.

Promise.prototype.finally introduces a way to easily clean up after a promise is settled.

Iterating with async functions got more ergonomic with the introduction of async iterators/generators.

Inspector/Debugging

In Chrome 63 block coverage is also supported in the DevTools UI. Please note that the inspector protocol already supports block coverage since V8 6.2.

V8 API

Please check out our summary of API changes. This document is regularly updated a few weeks after each major release.

Developers with an active V8 checkout can use git checkout -b 6.3 -t branch-heads/6.3 to experiment with the new features in V8 6.3. Alternatively you can subscribe to Chrome’s Beta channel and try the new features out yourself soon.

Posted by the V8 team

↧

V8 💚 developers and their tools

November 6, 2017, 8:20 am

≫ Next: Taming architecture complexity in V8 — the CodeStubAssembler

≪ Previous: V8 Release 6.3

JavaScript performance has always been important to the V8 team, and in this post we would like to discuss a new JavaScript Web Tooling Benchmark that we have been using recently to identify and fix some performance bottlenecks in V8. You may already be aware of V8’s strong commitment to Node.js and this benchmark extends that commitment by specifically running performance tests based on common developer tools built upon Node.js. The tools in the Web Tooling Benchmark are the same ones used by developers and designers today to build modern web sites and cloud-based applications. In continuation of our ongoing efforts to focus on real-world performance rather than artificial benchmarks, we created the benchmark using actual code that developers run every day.

The Web Tooling Benchmark suite was designed from the beginning to cover important developer tooling use cases for Node.js. Because the V8 team focuses on core JavaScript performance, we built the benchmark in a way that focuses on the JavaScript workloads and excludes measurement of Node.js-specific I/O or external interactions. This makes it possible to run the benchmark in Node.js, in all browsers, and in all major JavaScript engine shells, including ch (ChakraCore), d8 (V8), jsc (JavaScriptCore) and jsshell (SpiderMonkey). Even though the benchmark is not limited to Node.js, we are excited that the Node.js benchmarking working group is considering using the tooling benchmark as a standard for Node performance as well (nodejs/benchmarking#138).

The individual tests in the tooling benchmark cover a variety of tools that developers commonly use to build JavaScript-based applications, for example:

The Babel transpiler using the es2015 preset.
The parser used by Babel — named Babylon— running on several popular inputs (including the lodash and Preact bundles).
The acorn parser used by webpack.
The TypeScript compiler running on the typescript-angular example project from the TodoMVC project.

See the in-depth analysis for details on all the included tests.

Based on past experience with other benchmarks like Speedometer, where tests quickly become outdated as new versions of frameworks become available, we made sure it is straight-forward to update each of the tools in the benchmarks to more recent versions as they are released. By basing the benchmark suite on npm infrastructure, we can easily update it to ensure that it is always testing the state of the art in JavaScript development tools. Updating a test case is just a matter of bumping the version in the package.json manifest.

We created a tracking bug and a spreadsheet to contain all the relevant information that we have collected about V8’s performance on the new benchmark up to this point. Our investigations have already yielded some interesting results. For example, we discovered that V8 was often hitting the slow path for instanceof (v8:6971), incurring a 3–4× slowdown. We also found and fixed performance bottlenecks in certain cases of property assignments of the form of obj[name] = val where obj was created via Object.create(null). In these cases, V8 would fall off the fast-path despite being able to utilize the fact that obj has a null prototype (v8:6985). These and other discoveries made with the help of this benchmark improve V8, not only in Node.js, but also in Chrome.

We not only looked into making V8 faster, but also fixed and upstreamed performance bugs in the benchmark’s tools and libraries whenever we found them. For example, we discovered a number of performance bugs in Babel where code patterns like

value = items[items.length - 1];

lead to accesses of the property "-1", because the code didn’t check whether items is empty beforehand. This code pattern causes V8 to go through a slow-path due to the "-1" lookup, even though a slightly modified, equivalent version of the JavaScript is much faster. We helped to fix these issues in Babel (babel/babel#6582, babel/babel#6581 and babel/babel#6580). We also discovered and fixed a bug where Babel would access beyond the length of a string (babel/babel#6589), which triggered another slow-path in V8. Additionally we optimized out-of-bounds reads of arrays and strings in V8. We’re looking forward to continue working with the community on improving the performance of this important use case, not only when run on top of V8, but also when run on other JavaScript engines like ChakraCore.

Our strong focus on real-world performance and especially on improving popular Node.js workloads is shown by the constant improvements in V8’s score on the benchmark over the last couple of releases:

Since V8 5.8, which is the last V8 release before switching to the Ignition+TurboFan architecture, V8’s score on the tooling benchmark has improved by around 60%.

Over the last several years, the V8 team has come to recognize that no one JavaScript benchmark — even a well-intentioned, carefully crafted one — should be used as a single proxy for a JavaScript engine’s overall performance. However, we do believe that the new Web Tooling Benchmark highlights areas of JavaScript performance that are worth focusing on. Despite the name and the initial motivation, we have found that the Web Tooling Benchmark suite is not only representative of tooling workloads, but is representative of a large range of more sophisticated JavaScript applications that are not tested well by front end-focused benchmarks like Speedometer. It is by no means a replacement for Speedometer, but rather a complementary set of tests.

The best news of all is that given how the Web Tooling Benchmark is constructed around real workloads, we expect that our recent improvements in benchmark scores will translate directly into improved developer productivity through less time waiting for things to build. Many of these improvements are already available in Node.js: at the time of writing, Node 8 LTS is at V8 6.1 and Node 9 is at V8 6.2.

The latest version of the benchmark is hosted at https://v8.github.io/web-tooling-benchmark/.

Benedikt Meurer, @bmeurer, JavaScript Performance Juggler

↧

Taming architecture complexity in V8 — the CodeStubAssembler

November 16, 2017, 7:38 am

≫ Next: Orinoco: young generation garbage collection

≪ Previous: V8 💚 developers and their tools

In this post we’d like to introduce the CodeStubAssembler (CSA), a component in V8 that has been a very useful tool in achieving some big performance wins over the last several V8 releases. The CSA also significantly improved the V8 team’s ability to quickly optimize JavaScript features at a low-level with a high degree of reliability, which improved the team’s development velocity.

A brief history of builtins and hand-written assembly in V8

To understand the CSA’s role in V8, it’s important to understand a little bit of the context and history that led to its development.

V8 squeezes performance out of JavaScript using a combination of techniques. For JavaScript code that runs a long time, V8’s TurboFan optimizing compiler does a great job of speeding up the entire spectrum of ES2015+ functionality for peak performance. However, V8 also needs to execute short-running JavaScript efficiently for good baseline performance. This is especially the case for the so-called builtin functions on the pre-defined objects that are available to all JavaScript programs as defined by the ECMAScript specification.

Historically, many of these builtin functions were self-hosted, that is, they were authored by a V8 developer in JavaScript—albeit a special V8-internal dialect. To achieve good performance, these self-hosted builtins rely on the same mechanisms V8 uses to optimize user-supplied JavaScript. As with user-supplied code, the self-hosted builtins require a warm-up phase in which type feedback is gathered and they need to be compiled by the optimizing compiler.

Although this technique provides good builtin performance in some situations, it’s possible to do better. The exact semantics of the pre-defined functions on the Array.prototype are specified in exquisite detail in the spec. For important and common special cases, V8’s implementers know in advance exactly how these builtin functions should work by understanding the specification, and they use this knowledge to carefully craft custom, hand-tuned versions up front. These optimized builtins handle common cases without warm-up or the need to invoke the optimizing compiler, since by construction baseline performance is already optimal upon first invocation.

To squeeze the best performance out of hand-written built-in JavaScript functions (and from other fast-path V8 code that are also somewhat confusingly called builtins), V8 developers traditionally wrote optimized builtins in assembly language. By using assembly, the hand-written builtin functions were especially fast by, among other things, avoiding expensive calls to V8’s C++ code via trampolines and by taking advantage of V8’s custom register-based ABI that it uses internally to call JavaScript functions.

Because of the advantages of hand-written assembly, V8 accumulated literally tens of thousands of lines of hand-written assembly code for builtins over the years… per platform. All of these hand-written assembly builtins were great for improving performance, but new language features are always being standardized, and maintaining and extending this hand-written assembly was laborious and error-prone.

Enter the CodeStubAssembler

V8 developers wrestled with a dilemma for many years: is it possible to create builtins that have the advantage of hand-written assembly without also being fragile and difficult to maintain?

With the advent of TurboFan the answer to this question is finally “yes”. TurboFan’s backend uses a cross-platform intermediate representation (IR) for low-level machine operations. This low-level machine IR is input to an instruction selector, register allocator, instruction scheduler and code generator that produce very good code on all platforms. The backend also knows about many of the tricks that are used in V8’s hand-written assembly builtins—e.g. how to use and call a custom register-based ABI, how to support machine-level tail calls, and how to elide the construction of stack frames in leaf functions. That knowledge makes the TurboFan backend especially well-suited for generating fast code that integrates well with the rest of V8.

This combination of functionality made a robust and maintainable alternative to hand-written assembly builtins feasible for the first time. The team built a new V8 component—dubbed the CodeStubAssembler or CSA—that defines a portable assembly language built on top of TurboFan’s backend. The CSA adds an API to generate TurboFan machine-level IR directly without having to write and parse JavaScript or apply TurboFan’s JavaScript-specific optimizations. Although this fast-path to code generation is something that only V8 developers can use to speed up the V8 engine internally, this efficient path for generating optimized assembly code in a cross-platform way directly benefits all developers’ JavaScript code in the builtins constructed with the CSA, including the performance-critical bytecode handlers for V8’s interpreter, Ignition.

The CSA and JavaScript compilation pipelines

The CSA interface includes operations that are very low-level and familiar to anybody who has ever written assembly code. For example, it includes functionality like “load this object pointer from a given address” and “multiply these two 32-bit numbers”. The CSA has type verification at the IR level to catch many correctness bugs at compile time rather than runtime. For example, it can ensure that a V8 developer doesn’t accidentally use an object pointer that is loaded from memory as the input for a 32-bit multiplication. This kind of type verification is simply not possible with hand-written assembly stubs.

A CSA test-drive

To get a better idea of what the CSA offers, let’s go through a quick example. We’ll add a new internal builtin to V8 that returns the string length from an object if it is a String. If the input object is not a String, the builtin will return undefined.

First, we add a line to the BUILTIN_LIST_BASE macro in V8’s builtin-definitions.h file that declares the new builtin called GetStringLength and specifies that it has a single input parameter that is identified with the constant kInputObject:

TFS(GetStringLength, kInputObject)

The TFS macro declares the builtin as a TurboFan builtin using standard CodeStub linkage, which simply means that it uses the CSA to generate its code and expects parameters to be passed via registers.

We can then define the contents of the builtin in builtins-string-gen.cc:

TF_BUILTIN(GetStringLength, CodeStubAssembler) {
  Label not_string(this);

  // Fetch the incoming object using the constant we defined for
  // the first parameter.
  Node* const maybe_string = Parameter(Descriptor::kInputObject);

  // Check to see if input is a Smi (a special representation
  // of small numbers). This needs to be done before the IsString
  // check below, since IsString assumes its argument is an
  // object pointer and not a Smi. If the argument is indeed a
  // Smi, jump to the label |not_string|.
  GotoIf(TaggedIsSmi(maybe_string), &not_string);

  // Check to see if the input object is a string. If not, jump to
  // the label |not_string|.
  GotoIfNot(IsString(maybe_string), &not_string);

  // Load the length of the string (having ended up in this code
  // path because we verified it was string above) and return it
  // using a CSA "macro" LoadStringLength.
  Return(LoadStringLength(maybe_string));

  // Define the location of label that is the target of the failed
  // IsString check above.
  BIND(&not_string);

  // Input object isn't a string. Return the JavaScript undefined
  // constant.
  Return(UndefinedConstant());
}

Note that in the example above, there are two types of instructions used. There are primitive CSA instructions that translate directly into one or two assembly instructions like GotoIf and Return. There are a fixed set of pre-defined CSA primitive instructions roughly corresponding to the most commonly used assembly instructions you would find on one of V8’s supported chip architectures. Others instructions in the example are macro instructions, like LoadStringLength, TaggedIsSmi, and IsString, that are convenience functions to output one or more primitive or macro instructions inline. Macro instructions are used to encapsulate commonly used V8 implementation idioms for easy reuse. They can be arbitrarily long and new macro instructions can be easily defined by V8 developers whenever needed.

After compiling V8 with the above changes, we can run mksnapshot, the tool that compiles builtins to prepare them for V8’s snapshot, with the --print-code command-line option. This options prints the generated assembly code for each builtin. If we grep for GetStringLength in the output, we get the following result on x64 (the code output is cleaned up a bit to make it more readable):

    test al,0x1
    jz not_string
    movq rbx,[rax-0x1]
    cmpb [rbx+0xb],0x80
    jnc not_string
    movq rax,[rax+0xf]
    retl
not_string:
    movq rax,[r13-0x60]
    retl

On 32-bit ARM platforms, the following code is generated by mksnapshot:

    tst r0, #1
    beq +28 -> not_string
    ldr r1, [r0, #-1]
    ldrb r1, [r1, #+7]
    cmp r1, #128
    bge +12 -> not_string
    ldr r0, [r0, #+7]
    bx lr
not_string:
    ldr r0, [r10, #+16]
    bx lr

Even though our new builtin uses a non-standard (at least non-C++) calling convention, it’s possible to write test cases for it. The following code can be added to test-run-stubs.cc to test the builtin on all platforms:

TEST(GetStringLength) {
  HandleAndZoneScope scope;
  Isolate* isolate = scope.main_isolate();
  Heap* heap = isolate->heap();
  Zone* zone = scope.main_zone();

  // Test the case where input is a string
  StubTester tester(isolate, zone, Builtins::kGetStringLength);
  Handle<String> input_string(
      isolate->factory()->
        NewStringFromAsciiChecked("Oktoberfest"));
  Handle<Object> result1 = tester.Call(input_string);
  CHECK_EQ(11, Handle<Smi>::cast(result1)->value());

  // Test the case where input is not a string (e.g. undefined)
  Handle<Object> result2 =
      tester.Call(factory->undefined_value());
  CHECK(result2->IsUndefined(isolate));
}

For more details about using the CSA for different kinds of builtins and for further examples, see this wiki page.

A V8 developer velocity multiplier

The CSA is more than just an universal assembly language that targets multiple platforms. It enables much quicker turnaround when implementing new features compared to hand-writing code for each architecture as we used to do. It does this by providing all of the benefits of hand-written assembly while protecting developers against its most treacherous pitfalls:

With the CSA, developers can write builtin code with a cross-platform set of low-level primitives that translate directly to assembly instructions. The CSA’s instruction selector ensures that this code is optimal on all of the platforms that V8 targets without requiring V8 developers to be experts in each of those platform’s assembly languages.
The CSA’s interface has optional types to ensure that the values manipulated by the low-level generated assembly are of the type that the code author expects.
Register allocation between assembly instructions is done by the CSA automatically rather than explicitly by hand, including building stack frames and spilling values to the stack if a builtin uses more registers than available or makes call. This eliminates a whole class of subtle, hard-to-find bugs that plagued hand-written assembly builtins. By making the generated code less fragile the CSA drastically reduces the time required to write correct low-level builtins.
The CSA understands ABI calling conventions—both standard C++ and internal V8 register-based ones—making it possible to easily interoperate between CSA-generated code and other parts of V8.
Since CSA code is C++, it’s easy to encapsulate common code generation patterns in macros that can be easily reused in many builtins.
Because V8 uses the CSA to generate the bytecode handlers for Ignition, it is very easy to inline the functionality of CSA-based builtins directly into the handlers to improve the interpreter’s performance.
V8’s testing framework supports testing CSA functionality and CSA-generated builtins from C++ without having to write assembly adapters.

All in all, the CSA has been a game changer for V8 development. It has significantly improved the team’s ability to optimize V8. That means we are able to optimize more of the JavaScript language faster for V8’s embedders.

Posted by Daniel Clifford, CodeStubAssembler assembler

↧

Orinoco: young generation garbage collection

November 29, 2017, 6:48 am

≫ Next: JavaScript code coverage

≪ Previous: Taming architecture complexity in V8 — the CodeStubAssembler

JavaScript objects in V8 are allocated on a heap managed by V8’s garbage collector. In previous blog posts we have already talked about how we reduce garbage collection pause times (more than once) and memory consumption. In this blog post we introduce the parallel Scavenger, one of the latest features of Orinoco, V8’s mostly concurrent and parallel garbage collector and discuss design decisions and alternative approaches we implemented on the way.

V8 partitions its managed heap into generations where objects are initially allocated in the “nursery” of the young generation. Upon surviving a garbage collection, objects are copied into the intermediate generation, which is still part of the young generation. After surviving another garbage collection, these objects are moved into the old generation (see Figure 1). V8 implements two garbage collectors: one that frequently collects the young generation, and one that collects the full heap including both the young and old generation. Old-to-young generation references are roots for the young generation garbage collection. These references are recorded to provide efficient root identification and reference updates when objects are moved.

Figure 1: Generational garbage collection

Since the young generation is relatively small (up to 16MiB in V8) it fills up quickly with objects and requires frequent collections. Until M62, V8 used a Cheney semispace copying garbage collector (see below) that divides the young generation into two halves. During JavaScript execution only one half of the young generation is available for allocating objects, while the other half remains empty. During a young garbage collection, live objects are copied from one half to the other half, compacting the memory on the fly. Live objects that have already been copied once are considered part of the intermediate generation and are promoted to the old generation.

Starting with M62, V8 switched the default algorithm for collecting the young generation to a parallel Scavenger, similar to Halstead’s semispace copying collector with the difference that V8 makes use of dynamic instead of static work stealing across multiple threads. In the following we explain three algorithms: a) the single-threaded Cheney semispace copying collector, b) a parallel Mark-Evacuate scheme, and c) the parallel Scavenger.

Single-threaded Cheney’s Semispace Copy

Until M62, V8 used Cheney’s semispace copying algorithm which is well-suited for both single-core execution and a generational scheme. Before a young generation collection, both semispace halves of memory are committed and assigned proper labels: the pages containing the current set of objects are called from-space while the pages that objects are copied to are called to-space.

The Scavenger considers references in the call stack and references from the old to the young generation as roots. Figure 2 illustrates the algorithm where initially the Scavenger scans these roots and copies objects reachable in the from-space that have not yet been copied to the to-space. Objects that have already survived a garbage collection are promoted (moved) to the old generation. After root scanning and the first round of copying, the objects in the newly allocated to-space are scanned for references. Similarly, all promoted objects are scanned for new references to from-space. These three phases are interleaved on the main thread. The algorithm continues until no more new objects are reachable from either to-space or the old generation. At this point the from-space only contains unreachable objects, i.e., it only contains garbage.

Processing

Figure 2: Cheney’s semispace copying algorithm used for young generation garbage collections in V8

Parallel Mark-Evacuate

We experimented with a parallel Mark-Evacuate algorithm based on the V8’s full Mark-Sweep-Compact collector. The main advantage is leveraging the already existing garbage collection infrastructure from the full Mark-Sweep-Compact collector. The algorithm consists of three phases: marking, copying, and updating pointers, as shown in Figure 3. To avoid sweeping pages in the young generation to maintain free lists, the young generation is still maintained using a semispace that is always kept compact by copying live objects into to-space during garbage collection. The young generation is initially marked in parallel. After marking, live objects are copied in parallel to their corresponding spaces. Work is distributed based on logical pages. Threads participating in copying keep their own local allocation buffers (LABs) which are merged upon finishing copying. After copying, the same parallelization scheme is applied for updating inter-object pointers. These three phases are performed in lockstep, i.e., while the phases themselves are performed in parallel, threads have to synchronize before continuing to the next phase.

Processing

Figure 3: Young Generation Parallel Mark-Evacuate garbage collection in V8

Parallel Scavenge

The parallel Mark-Evacuate collector separates the phases of computing liveness, copying live objects, and updating pointers. An obvious optimization is to merge these phases, resulting in an algorithm that marks, copies, and updates pointers at the same time. By merging those phases we actually get the parallel Scavenger used by V8, which is a version similar to Halstead’s semispace collector with the difference that V8 uses dynamic work stealing and a simple load balancing mechanism for scanning the roots (see Figure 4). Like the single-threaded Cheney algorithm, the phases are: scanning for roots, copying within the young generation, promoting to the old generation, and updating pointers. We found that the majority of the root set is usually the references from the old generation to the young generation. In our implementation, remembered sets are maintained per-page, which naturally distributes the roots set among garbage collection threads. Objects are then processed in parallel. Newly found objects are added to a global work list from which garbage collection threads can steal. This work list provides fast task local storage as well as global storage for sharing work. A barrier makes sure that tasks do not prematurely terminate when the sub graph currently processed is not suitable for work stealing (e.g. a linear chain of objects). All phases are performed in parallel and interleaved on each task, maximizing the utilization of worker tasks.

Processing

Figure 4: Young generation parallel Scavenger in V8

Results and outcome

The Scavenger algorithm was initially designed having optimal single-core performance in mind. The world has changed since then. CPU cores are often plentiful, even on low-end mobile devices. More importantly, often these cores are actually up and running. To fully utilize these cores, one of the last sequential components of V8’s garbage collector, the Scavenger, had to be modernized.

The big advantage of a parallel Mark-Evacuate collector is that exact liveness information is available. This information can e.g. be used to avoid copying at all by just moving and relinking pages that contain mostly live objects which is also performed by the full Mark-Sweep-Compact collector. In practice, however, this was mostly observable on synthetic benchmarks and rarely showed up on real websites. The downside of the parallel Mark-Evacuate collector is the overhead of performing three separate lockstep phases. This overhead is especially noticeable when the garbage collector is invoked on a heap with mostly dead objects, which is the case on many real-world webpages. Note that invoking garbage collections on heaps with mostly dead objects is actually the ideal scenario, as garbage collection is usually bounded by the size of live objects.

The parallel Scavenger closes this performance gap by providing performance that is close to the optimized Cheney algorithm on small or almost empty heaps while still providing a high throughput in case the heaps get larger with lots of live objects.

V8 supports, among many other platforms, as Arm big.LITTLE. While offloading work on little cores benefits battery lifetime, it can lead to stalling on the main thread when work packages for little cores are too big. We observed that page-level parallelism does not necessarily load balance work on big.LITTLE for a young generation garbage collection due to the limited number of pages. The Scavenger naturally solves this issue by providing medium-grained synchronization using explicit work lists and work stealing.

Figure 5: Total young generation garbage collection time (in ms) across various websites

V8 now ships with the parallel Scavenger which reduces the main thread young generation garbage collection total time by about 20%–50% across a large set of benchmarks (details on our perf waterfalls). Figure 5 shows a comparison of the implementations across various real-world websites, showing improvements around 55% (2×). Similar improvements can be observed on maximum and average pause time while maintaining minimum pause time. The parallel Mark-Evacuate collector scheme has still potential for optimization. Stay tuned if you want to find out what happens next.

Posted by friends of TSAN: Ulan Degenbaev, Michael Lippautz, and Hannes Payer

↧

JavaScript code coverage

December 13, 2017, 1:41 am

≫ Next: V8 release v6.4

≪ Previous: Orinoco: young generation garbage collection

What is it?

Code coverage provides information about whether, and optionally how often certain parts of an application have been executed. It’s commonly used to determine how thoroughly a test suite exercises a particular codebase.

Why is it useful?

As a JavaScript developer, you may often find yourself in a situation in which code coverage could be useful. For instance:

Interested in the quality of your test suite? Refactoring a large legacy project? Code coverage can show you exactly which parts of your codebase is covered.
Want to quickly know if a particular part of the codebase is reached? Instead of instrumenting with console.log for printf-style debugging or manually stepping through the code, code coverage can display live information about which parts of your applications have been executed.
Or maybe you’re optimizing for speed and would like to know which spots to focus on? Execution counts can point out hot functions and loops.

JavaScript code coverage in V8

Earlier this year, we added native support for JavaScript code coverage to V8. The initial release in version 5.9 provided coverage at function granularity (showing which functions have been executed), which was later extended to support coverage at block granularity in 6.2 (likewise, but for individual expressions).

Function granularity (left) and block granularity (right)

For JavaScript developers

There are currently two primary ways to access coverage information. For JavaScript developers, Chrome DevTools’ Coverage tab exposes JS (and CSS) coverage ratios and highlights dead code in the Sources panel.

Block coverage in the DevTools Coverage pane. Covered lines are highlighted in green, uncovered in red.

Thanks to Benjamin Coe, there is also ongoing work to integrate V8’s code coverage information into the popular Istanbul.js code coverage tool.

An Istanbul.js report based on V8 coverage data.

For embedders

Embedders and framework authors can hook directly into the Inspector API for more flexibility. V8 offers two different coverage modes:

Best-effort coverage collects coverage information with minimal impact on runtime performance, but might lose data on garbage-collected (GC) functions.
Precise coverage ensures that no data is lost to the GC, and users can choose to receive execution counts instead of binary coverage information; but performance might be impacted by increased overhead (see the next section for more details). Precise coverage can be collected either at function or block granularity.

The Inspector API for precise coverage is as follows:

Profiler.startPreciseCoverage(callCount, detailed) enables coverage collection, optionally with call counts (vs. binary coverage) and block granularity (vs. function granularity);
Profiler.takePreciseCoverage() returns collected coverage information as a list of source ranges together with associated execution counts; and
Profiler.stopPreciseCoverage() disables collection and frees related data structures.

A conversation through the Inspector protocol might look like this:

// The embedder directs V8 to begin collecting precise coverage.
{ "id": 26, "method": "Profiler.startPreciseCoverage",
"params": { "callCount": false, "detailed": true }}
// Embedder requests coverage data (delta since last request).
{ "id": 32, "method":"Profiler.takePreciseCoverage" }
// The reply contains collection of nested source ranges.
{ "id": 32, "result": { "result": [{
"functions": [
    {
"functionName": "fib",
"isBlockCoverage": true,    // Block granularity.
"ranges": [ // An array of nested ranges.
        { 
"startOffset": 50,  // Byte offset, inclusive.
"endOffset": 224,   // Byte offset, exclusive.
"count": 1
        }, {
"startOffset": 97,
"endOffset": 107,
"count": 0
        }, {
"startOffset": 134,
"endOffset": 144,
"count": 0
        }, {
"startOffset": 192,
"endOffset": 223,
"count": 0
        },
      ]},
"scriptId": "199",
"url": "file:///coverage-fib.html"
    }
  ]
}}

// Finally, the embedder directs V8 to end collection and
// free related data structures.
{"id":37,"method":"Profiler.stopPreciseCoverage"}

Similarly, best-effort coverage can be retrieved using Profiler.getBestEffortCoverage().

Behind the scenes

As stated in the previous section, V8 supports two main modes of code coverage: best-effort and precise coverage. Read on for an overview of their implementation.

Best-effort coverage

Both best-effort and precise coverage modes heavily reuse other V8 mechanisms, the first of which is called the invocation counter. Each time a function is called through V8’s Ignition interpreter, we increment an invocation counter on the function’s feedback vector. As the function later becomes hot and tiers up through the optimizing compiler, this counter is used to help guide inlining decisions about which functions to inline; and now, we also rely on it to report code coverage.

The second reused mechanism determines the source range of functions. When reporting code coverage, invocation counts need to be tied to an associated range within the source file. For example, in the example below, we not only need to report that function f has been executed exactly once, but also that f’s source range begins at line 1 and ends in line 3.

functionf() {
console.log('Hello World');
}

f();

Again we got lucky and were able to reuse existing information within V8. Functions already knew their start- and end positions within source code due to Function.prototype.toString, which needs to know the function’s location within the source file to extract the appropriate substring.

When collecting best-effort coverage, these two mechanisms are simply tied together: first we find all live function by traversing the entire heap. For each seen function we report the invocation count (stored on the feedback vector, which we can reach from the function) and source range (conveniently stored on the function itself).

Note that since invocation counts are maintained regardless of whether coverage is enabled, best-effort coverage does not introduce any runtime overhead. It also does not use dedicated data structures and thus neither needs to be explicitly enabled or disabled.

So why is this mode called best-effort, what are its limitations? Functions that go out of scope may be freed by the garbage collector. This means that associated invocation counts are lost, and in fact we completely forget that these functions ever existed. Ergo ‘best-effort’: even though we try our best, the collected coverage information may be incomplete.

Precise coverage (function granularity)

In contrast to the best-effort mode, precise coverage guarantees that the provided coverage information is complete. To achieve this, we add all feedback vectors to V8’s root set of references once precise coverage is enabled, preventing their collection by the GC. While this ensures no information is lost, it increases memory consumption by keeping objects alive artificially.

The precise coverage mode can also provide execution counts. This adds another wrinkle to the precise coverage implementation. Recall that the invocation counter is incremented each time a function is called through V8’s interpreter, and that functions can tier up and be optimized once they become hot. But optimized functions no longer increment their invocation counter, and thus the optimizing compiler must be disabled for their reported execution count to remain accurate.

Precise coverage (block granularity)

Block-granularity coverage must report coverage that is correct down to the level of individual expressions. For example, in the following piece of code, block coverage could detect that the else branch of the conditional expression : c is never executed, while function granularity coverage would only know that the function f (in its entirety) is covered.

functionf(a) {
return a ? b : c;
}

f(true);

You may recall from the previous sections that we already had function invocation counts and source ranges readily available within V8. Unfortunately, this was not the case for block coverage and we had to implement new mechanisms to collect both execution counts and their corresponding source ranges.

The first aspect is source ranges: assuming we have an execution count for a particular block, how can we map them to a section of the source code? For this, we need to collect relevant positions while parsing the source files. Prior to block coverage, V8 already did this to some extent. One example is the collection of function ranges due to Function.prototype.toString as described above. Another is that source positions are used to construct the backtrace for Error objects. But neither of these is sufficient to support block coverage; the former is only available for functions, while the latter only stores positions (e.g. the position of the if token for if-else statements), not source ranges.

We therefore had to extend the parser to collect source ranges. To demonstrate, consider an if-else statement:

if (cond) {
/* Then branch. */
} else {
/* Else branch. */
}

When block coverage is enabled, we collect the source range of the then and else branches and associate them with the parsed IfStatement AST node. The same is done for other relevant language constructs.

After collecting source range collection during parsing, the second aspect is tracking execution counts at runtime. This is done by inserting a new dedicated IncBlockCounter bytecode at strategic positions within the generated bytecode array. At runtime, the IncBlockCounter bytecode handler simply increments the appropriate counter (reachable through the function object).

In the above example of an if-else statement, such bytecodes would be inserted at three spots: immediately prior to the body of the then branch, prior to the body of the else branch, and immediately after the if-else statement (such continuation counters are needed due to possibility of non-local control within a branch).

Finally, reporting block-granularity coverage works similarly to function-granularity reporting. But in addition to invocations counts (from the feedback vector), we now also report the collection of interesting source ranges together with their block counts (stored on an auxiliary data structure that hangs off the function).

If you’d like to learn more about the technical details behind code coverage in V8, see the coverage and block coverage design documents.

Conclusion

We hope you’ve enjoyed this brief introduction to V8’s native code coverage support. Please give it a try and don’t hesitate to let us know what works for you, and what doesn’t. Say hello on Twitter (@schuay and @hashseed) or file a bug at crbug.com/v8/new.

Coverage support in V8 has been a team effort, and thanks are in order to everyone that has contributed: Benjamin Coe, Jakob Gruber, Yang Guo, Marja Hölttä, Andrey Kosyakov, Alexey Kozyatinksiy, Ross McIlroy, Ali Sheikh, Michael Starzinger. Thank you!

↧

V8 release v6.4

December 19, 2017, 1:56 am

≫ Next: Chrome welcomes Speedometer 2.0!

≪ Previous: JavaScript code coverage

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s Git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.4, which is in beta until its release in coordination with Chrome 64 Stable in several weeks. V8 v6.4 is filled with all sorts of developer-facing goodies. This post provides a preview of some of the highlights in anticipation of the release.

Speed

V8 v6.4 improves the performance of the instanceof operator by 3.6×. As a direct result, uglify-js is now 15–20% faster according to V8’s Web Tooling Benchmark.

This release also addresses some performance cliffs in Function.prototype.bind. For example, TurboFan now consistently inlines all monomorphic calls to bind. In addition, TurboFan also supports the bound callback pattern, meaning that instead of the following:

doSomething(callback, someObj);

You can now use:

doSomething(callback.bind(someObj));

This way, the code is more readable, and you still get the same performance.

Thanks to Peter Wong’s latest contributions, WeakMap and WeakSet are now implemented using the CodeStubAssembler, resulting in performance improvements of up to 5× across the board.

As part of V8’s on-going effort to improve the performance of array built-ins, we improved Array.prototype.slice performance ~4× by reimplementing it using the CodeStubAssembler. Additionally, calls to Array.prototype.map and Array.prototype.filter are now inlined for many cases, giving them a performance profile competitive with hand-written versions.

We worked to make out-of-bounds loads in arrays, typed arrays, and strings no longer incur a ~10× performance hit after noticing this coding pattern being used in the wild.

Memory

V8’s built-in code objects and bytecode handlers are now deserialized lazily from the snapshot, which can significantly reduce memory consumed by each Isolate. Benchmarks in Chrome show savings of several hundred KB per tab when browsing common sites.

Look out for a dedicated blog post on this subject early next year.

ECMAScript language features

This V8 release includes support for two new exciting regular expression features.

In regular expressions with the /u flag, Unicode property escapes are now enabled by default.

const regexGreekSymbol = /\p{Script_Extensions=Greek}/u;
regexGreekSymbol.test('π');
// → true

Support for named capture groups in regular expressions is now enabled by default.

const pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u;
const result = pattern.exec('2017-12-15');
// result.groups.year === '2017'
// result.groups.month === '12'
// result.groups.day === '15'

More details about these features are available in our blog post titled Upcoming regular expression features.

Thanks to Groupon, V8 now implements import.meta, which enables embedders to expose host-specific metadata about the current module. For example, Chrome 64 exposes the module URL via import.meta.url, and Chrome plans to add more properties to import.meta in the future.

To assist with local-aware formatting of strings produced by internationalization formatters, developers can now use Intl.NumberFormat.prototype.formatToParts() to format a number to a list of tokens and their type. Thanks to Igalia for implementing this in V8!

V8 API

Please use git log branch-heads/6.3..branch-heads/6.4 include/v8.h to get a list of the API changes.

Developers with an active V8 checkout can use git checkout -b 6.4 -t branch-heads/6.4 to experiment with the new features in V8 v6.4. Alternatively you can subscribe to Chrome’s Beta channel and try the new features out yourself soon.

Posted by the V8 team

↧

Chrome welcomes Speedometer 2.0!

January 24, 2018, 10:27 am

≫ Next: Optimizing hash tables: hiding the hash code

≪ Previous: V8 release v6.4

Ever since its initial release of Speedometer 1.0 in 2014, the Blink and V8 teams have been using the benchmark as a proxy for real-world use of popular JavaScript frameworks and we achieved considerable speedups on this benchmark. We verified independently that these improvements translate to real user benefits by measuring against real-world websites and observed that improvements of page load times of popular websites also improved the Speedometer score.

JavaScript has rapidly evolved in the meantime, adding many new language features with ES2015 and later standards. The same is true for the frameworks themselves, and as such Speedometer 1.0 has become outdated over time. Hence using Speedometer 1.0 as an optimization indicator raises the risk of not measuring newer code patterns that are actively used.

The Blink and V8 teams welcome the recent release of the updated Speedometer 2.0 benchmark. Applying the original concept to a list of contemporary frameworks, transpilers and ES2015 features makes the benchmark a prime candidate for optimizations again. Speedometer 2.0 is a great addition to our real-world performance benchmarking tool belt.

Chrome's mileage so far

The Blink and V8 teams have already completed a first round of improvements, underlying the importance of this benchmark to us and continuing our journey of focusing on real-world performance. Comparing Chrome 60 from July 2017 with the latest Chrome 64 we have achieved about a 21% improvement on the total score (runs per minute) on a mid-2016 Macbook Pro (4 core, 16GB RAM).

Speedometer 2 Scores Comparison between Chrome 60 and 64

Let’s zoom into the individual Speedometer 2.0 line items. We doubled the performance of the React runtime by improving Function.prototype.bind. Vanilla-ES2015, AngularJS, Preact, and VueJS improved by 15%-29% due to speeding up the JSON parsing and various other performance fixes. The jQuery-TodoMVC app's runtime was reduced by improvements to Blink's DOM implementation, including more lightweight form controls and tweaks to our HTML parser. Additional tweaking of V8's inline caches in combination with the optimizing compiler yielded improvements across the board.

A significant change over Speedometer 1.0 is the calculation of the final score. Previously the average of all scores favoured working only on the slowest line items. When looking at the absolute times spent in each line item we see for instance that the EmberJS-Debug version takes roughly 35 times as long as the fastest benchmark. Hence to improve the overall score focusing on EmberJS-Debug has the highest potential.

Speedometer 2.0 uses the geometric mean for the final score, favouring equal investments into each framework. Let us consider our recent 16.5% improvement of Preact from above. It would be rather unfair to forgo the 16.5% improvement just because of its minor contribution to the total time.

We are looking forward to bring further performance improvements to Speedometer 2.0 and through that to the whole web. Stay tuned for more performance high-fives.

Posted by the Blink and V8 teams

↧

Optimizing hash tables: hiding the hash code

January 29, 2018, 10:34 am

≫ Next: V8 release v6.5

≪ Previous: Chrome welcomes Speedometer 2.0!

ECMAScript 2015 introduced several new data structures such as Map, Set, WeakSet, and WeakMap, all of which use hash tables under the hood. This post details the recent improvements in how V8 v6.3+ stores the keys in hash tables.

Hash code

A hash function is used to map a given key to a location in the hash table. A hash code is the result of running this hash function over a given key.

In V8, the hash code is just a random number, independent of the object value. Therefore, we can’t recompute it, meaning we must store it.

For JavaScript objects that were used as keys, previously, the hash code was stored as a private symbol on the object. A private symbol in V8 is similar to a Symbol, except that it’s not enumerable and doesn’t leak to userspace JavaScript.

functionGetObjectHash(key) {
const hash = key[hashCodeSymbol];
if (IS_UNDEFINED(hash)) {
    hash = (MathRandom() * 0x40000000) | 0;
if (hash === 0) hash = 1;
    key[hashCodeSymbol] = hash;
  }
return hash;
}

This worked well because we didn’t have to reserve memory for a hash code field until the object was added to a hash table, at which point a new private symbol was stored on the object.

V8 could also optimize the hash code symbol lookup just like any other property lookup using the IC system, providing very fast lookups for the hash code. This works well for monomorphic IC lookups, when the keys have the same hidden class. However, most real-world code doesn’t follow this pattern, and often keys have different hidden classes, leading to slow megamorphic IC lookups of the hash code.

Another problem with the private symbol approach was that it triggered a hidden class transition in the key on storing the hash code. This resulted in poor polymorphic code not just for the hash code lookup but also for other property lookups on the key and deoptimization from optimized code.

JavaScript object backing stores

A JavaScript object (JSObject) in V8 uses two words (apart from its header): one word for storing a pointer to the elements backing store, and another word for storing a pointer to the properties backing store.

The elements backing store is used for storing properties that look like array indices, whereas the properties backing store is used for storing properties whose keys are strings or symbols. See this V8 blog post by Camillo Bruni for more information about these backing stores.

const x = {};
x[1] = 'bar';      // ← stored in elements
x['foo'] = 'bar';  // ← stored in properties

Hiding the hash code

The easiest solution to storing the hash code would be to extend the size of a JavaScript object by one word and store the hash code directly on the object. However, this would waste memory for objects that aren’t added to a hash table. Instead, we could try to store the hash code in the elements store or properties store.

The elements backing store is an array containing its length and all the elements. There’s not much to be done here, as storing the hashcode in a reserved slot (like the 0th index) would still waste memory when we don’t use the object as a key in a hash table.

Let’s look at the properties backing store. There are two kinds of data structures used as a properties backing store: arrays and dictionaries.

Unlike the array used in the elements backing store which does not have an upper limit, the array used in the properties backing store has an upper limit of 1022 values. V8 transitions to using a dictionary on exceeding this limit for performance reasons. (I’m slightly simplifying this — V8 can also use a dictionary in other cases, but there is a fixed upper limit on the number of values that can be stored in the array.)

So, there are three possible states for the properties backing store:

empty (no properties)
array (can store up to 1022 values)
dictionary

The properties backing store is empty

For the empty case, we can directly store the hash code in this offset on the JSObject.

The properties backing store is an array

V8 represents integers less than 2³¹ (on 32-bit systems) unboxed, as Smis. In a Smi, the least significant bit is a tag used to distinguish it from pointers, while the remaining 31 bits hold the actual integer value.

Normally, arrays store their length as a Smi. Since we know that the maximum capacity of this array is only 1022, we only need 10 bits to store the length. We can use the remaining 21 bits to store the hash code!

The properties backing store is a dictionary

For the dictionary case, we increase the dictionary size by 1 word to store the hashcode in a dedicated slot at the beginning of the dictionary. We get away with potentially wasting a word of memory in this case, because the proportional increase in size isn’t as big as in the array case.

With these changes, the hash code lookup no longer has to go through the complex JavaScript property lookup machinery.

Performance improvements

The SixSpeed benchmark tracks the performance of Map and Set, and these changes resulted in a ~500% improvement.

This change caused a 5% improvement on the Basic benchmark in ARES6 as well.

This also resulted in an 18% improvement in one of the benchmarks in the Emberperf benchmark suite that tests Ember.js.

Posted by Sathya Gunasekaran, keeper of hash codes

↧

V8 release v6.5

February 1, 2018, 1:18 pm

≫ Next: Lazy deserialization

≪ Previous: Optimizing hash tables: hiding the hash code

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s Git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.5, which is in beta until its release in coordination with Chrome 65 Stable in several weeks. V8 v6.5 is filled with all sorts of developer-facing goodies. This post provides a preview of some of the highlights in anticipation of the release.

Untrusted code mode

In response to the latest speculative side-channel attack called Spectre, V8 introduced an untrusted code mode. If you embed V8, consider leveraging this mode in case your application processes user-generated, not-trustworthy code. Please note that the mode is enabled by default, including in Chrome.

Streaming compilation for WebAssembly code

The WebAssembly API provides a special function to support streaming compilation in combination with the fetch() API:

constmodule = await WebAssembly.compileStreaming(fetch('foo.wasm'));

This API has been available since V8 v6.1 and Chrome 61, although the initial implementation didn’t actually use streaming compilation. However, with V8 v6.5 and Chrome 65 we take advantage of this API and compile WebAssembly modules already while we are still downloading the module bytes. As soon as we download all bytes of a single function, we pass the function to a background thread to compile it.

Our measurements show that with this API, the WebAssembly compilation in Chrome 65 can keep up with up to 50 Mbit/sec download speed on high-end machines. This means that if you download WebAssembly code with 50 Mbit/sec, compilation of that code finishes as soon as the download finishes.

For the graph below we measure the time it takes to download and compile a WebAssembly module with 67 MB and about 190,000 functions. We do the measurements with 25 Mbit/sec, 50 Mbit/sec, and 100 Mbit/sec download speed.

When the download time is longer than the compile time of the WebAssembly module, e.g. in the graph above with 25 Mbit/sec and 50 Mbit/sec, then WebAssembly.compileStreaming() finishes compilation almost immediately after the last bytes are downloaded.

When the download time is shorter than the compile time, then WebAssembly.compileStreaming() takes about as long as it takes to compile the WebAssembly module without downloading the module first.

Speed

We continued to work on widening the fast-path of JavaScript builtins in general, adding a mechanism to detect and prevent a ruinous situation called a “deoptimization loop.” This occurs when your optimized code deoptimizes, and there is no way to learn what went wrong. In such scenarios, TurboFan just keeps trying to optimize, finally giving up after about 30 attempts. This would happen if you did something to alter the shape of the array in the callback function of any of our second order array builtins. For example, changing the length of the array — in v6.5, we note when that happens, and stop inlining the array builtin called at that site on future optimization attempts.

We also widened the fast-path by inlining many builtins that were formerly excluded because of a side-effect between the load of the function to call and the call itself, for example a function call. And String.prototype.indexOf got a 10× performance improvement in function calls.

In V8 v6.4, we’d inlined support for Array.prototype.forEach, Array.prototype.map, and Array.prototype.filter. In V8 v6.5 we’ve added inlining support for:

Array.prototype.reduce
Array.prototype.reduceRight
Array.prototype.find
Array.prototype.findIndex
Array.prototype.some
Array.prototype.every

Furthermore, we’ve widened the fast path on all these builtins. At first we would bail out on seeing arrays with floating-point numbers, or (even more bailing out) if the arrays had “holes” in them, e.g. [3, 4.5, , 6]. Now, we handle holey floating-point arrays everywhere except in find and findIndex, where the spec requirement to convert holes into undefined throws a monkey-wrench into our efforts (for now…!).

The following image shows the improvement delta compared to V8 v6.4 in our inlined builtins, broken down into integer arrays, double arrays, and double arrays with holes. Time is in milliseconds.

V8 API

Please use git log branch-heads/6.4..branch-heads/6.5 include/v8.h to get a list of the API changes.

Developers with an active V8 checkout can use git checkout -b 6.5 -t branch-heads/6.5 to experiment with the new features in V8 v6.5. Alternatively you can subscribe to Chrome’s Beta channel and try the new features out yourself soon.

Posted by the V8 team

↧

A Long Journey

Running the Numbers

SharedArrayBuffers

Object rest/spread properties

ES6 Performance

V8 API

Named Captures

dotAll Flag

Unicode Property Escapes

Lookbehind Assertions

Acknowledgements

Performance improvements

Binary size reduction

asm.js is now validated and compiled to WebAssembly

WebAssembly

Hash flooding attack

Startup snapshot

Almost fixed, but not quite

Second attempt

Named Properties vs. Elements

HiddenClasses and DescriptorArrays

The Three Different Kinds of Named Properties

Elements or array-indexed Properties

Performance improvements

Enhanced low-memory mode

More regular expressions features

Template literal revision

Increased max string length

FullCodeGen is gone

V8 API

Common elements kinds

PACKED vs. HOLEY kinds

The elements kind lattice

Performance tips

Avoid creating holes

Avoid reading beyond the length of the array

Avoid elements kind transitions

Prefer arrays over array-like objects

Avoid polymorphism

Debugging elements kinds

Lazy unlinking

Results

General Improvements on x64

AreWeFastYet results

Impact on Node.js

Further Optimization

Acknowledgments

Introduction

Constructing proxies

Call and construct traps

Get trap

Has trap

Set trap

Real-world usage

Results from jsdom-proxy-benchmark

Results from Chai.js

Optimization approach

Speed

Memory consumption

ECMAScript language features

Inspector/Debugging

V8 API

A brief history of builtins and hand-written assembly in V8

Enter the CodeStubAssembler

A CSA test-drive

A V8 developer velocity multiplier

Single-threaded Cheney’s Semispace Copy

Parallel Mark-Evacuate

Parallel Scavenge

Results and outcome

What is it?

Why is it useful?

JavaScript code coverage in V8

For JavaScript developers

For embedders

Behind the scenes

Best-effort coverage

Precise coverage (function granularity)

Precise coverage (block granularity)

Conclusion

`PACKED` vs. `HOLEY` kinds