Android Developers Blog

Showing posts with label Renderscript. Show all posts

18 September 2013

RenderScript in the Android Support Library

The RenderScript Support Library lets you take advantage of the latest RenderScript features on devices running Android 2.2 and later.

Posted by Tim Murray, Android RenderScript team

One of the requests we hear most commonly from developers is to enable more devices to run the latest features of RenderScript. Over the past several releases of Android, we’ve added a ton of functionality to the RenderScript runtime, but the runtime's dependence on the core Android platform version has limited the range of devices that can support that new functionality. We’ve been working on a solution to this since last year, and we’re now ready to share it with all Android developers.

Today we're announcing a new RenderScript Support Library and updated SDK tools that together let you take advantage of RenderScript on plaform versions all the way back to Android 2.2 (Froyo).

With ADT v22.2, SDK Tools v22.2, and Android Build Tools v18.1.0, apps targeting Android 2.2 and later can now make use of almost all of the functionality available natively in RenderScript with Android 4.3. This includes access to the newest RenderScript features such as high-performance intrinsics and the new performance optimizations available to scripts.

Using the RenderScript Support Library

Using the RenderScript Support Library is straightforward. Once you've updated ADT and your SDK tools, there are only two things that you have to do to start using Renderscript in your apps:

In your classes that use RenderScript, import the RenderScript Support Library from android.support.v8.renderscript. If you are already using native RenderScript, you can change your import from android.renderscript to android.support.v8.renderscript.
```
import android.support.v8.renderscript.*;
```
In your project.properties, make sure you’re targeting android-18 and add the following lines:
```
renderscript.target=18
renderscript.support.mode=true
sdk.buildtools=18.1.0
```

That’s it! With the RenderScript Support Library, you can continue to use the same APIs from your app as with the native RenderScript package (with a few minor exceptions that we’ll talk about below), and you can use the same features in your own scripts as you would with the latest RenderScript toolchain.

For complete details on how to set up the RenderScript Support Library, see Accessing RenderScript Java APIs.

API and Implementation details

If you'd like to use RenderScript Support Library in your app, there are few things you should know:

First, the RenderScript Support Library supports almost all of the RenderScript API functions as the native API that's available in API level and higher. The one notable exception is that Allocation.USAGE_IO_INPUT and Allocation.USAGE_IO_OUTPUT are not currently available in the RenderScript Support Library.
Second, devices running Android 4.2 and earlier will always run their RenderScript applications on the CPU, while devices running Android 4.3 or later will run their RenderScript applications on whatever processors are available on that particular device. Because the Support Library versions of the scripts have to be precompiled to support all possible platforms, there is a performance hit when running the precompiled scripts compared to runtime compilation on Android 4.3 due to more restrictions on compiler optimizations.

We’re really pleased with how the RenderScript Support Library has turned out. We've already seen how it performs in a shipping app — it's been part of the photo editor in the Google+ Android app since May 2013, and it’s definitely proven itself in a large and widely used application. We hope you’ll be happy with it too.

Join the discussion on

+Android Developers

29 August 2013

RenderScript Intrinsics

Posted by R. Jason Sams, Android RenderScript Tech Lead

RenderScript has a very powerful ability called Intrinsics. Intrinsics are built-in functions that perform well-defined operations often seen in image processing. Intrinsics can be very helpful to you because they provide extremely high-performance implementations of standard functions with a minimal amount of code.

RenderScript intrinsics will usually be the fastest possible way for a developer to perform these operations. We’ve worked closely with our partners to ensure that the intrinsics perform as fast as possible on their architectures — often far beyond anything that can be achieved in a general-purpose language.

Table 1. RenderScript intrinsics and the operations they provide.

Name	Operation
`ScriptIntrinsicConvolve3x3`, `ScriptIntrinsicConvolve5x5`	Performs a 3x3 or 5x5 convolution.
`ScriptIntrinsicBlur`	Performs a Gaussian blur. Supports grayscale and RGBA buffers and is used by the system framework for drop shadows.
`ScriptIntrinsicYuvToRGB`	Converts a YUV buffer to RGB. Often used to process camera data.
`ScriptIntrinsicColorMatrix`	Applies a 4x4 color matrix to a buffer.
`ScriptIntrinsicBlend`	Blends two allocations in a variety of ways.
`ScriptIntrinsicLUT`	Applies a per-channel lookup table to a buffer.
`ScriptIntrinsic3DLUT`	Applies a color cube with interpolation to a buffer.

Your application can use one of these intrinsics with very little code. For example, to perform a Gaussian blur, the application can do the following:

RenderScript rs = RenderScript.create(theActivity);
ScriptIntrinsicBlur theIntrinsic = ScriptIntrinsicBlur.create(mRS, Element.U8_4(rs));;
Allocation tmpIn = Allocation.createFromBitmap(rs, inputBitmap);
Allocation tmpOut = Allocation.createFromBitmap(rs, outputBitmap);
theIntrinsic.setRadius(25.f);
theIntrinsic.setInput(tmpIn);
theIntrinsic.forEach(tmpOut);
tmpOut.copyTo(outputBitmap);

This example creates a RenderScript context and a Blur intrinsic. It then uses the intrinsic to perform a Gaussian blur with a 25-pixel radius on the allocation. The default implementation of blur uses carefully hand-tuned assembly code, but on some hardware it will instead use hand-tuned GPU code.

What do developers get from the tuning that we’ve done? On the new Nexus 7, running that same 25-pixel radius Gaussian blur on a 1.6 megapixel image takes about 176ms. A simpler intrinsic like the color matrix operation takes under 4ms. The intrinsics are typically 2-3x faster than a multithreaded C implementation and often 10x+ faster than a Java implementation. Pretty good for eight lines of code.

Figure 1. Performance gains with RenderScript intrinsics, relative to equivalent multithreaded C implementations.

Applications that need additional functionality can mix these intrinsics with their own RenderScript kernels. An example of this would be an application that is taking camera preview data, converting it from YUV to RGB, adding a vignette effect, and uploading the final image to a SurfaceView for display.

In this example, we’ve got a stream of data flowing between a source device (the camera) and an output device (the display) with a number of possible processors along the way. Today, these operations can all run on the CPU, but as architectures become more advanced, using other processors becomes possible.

For example, the vignette operation can happen on a compute-capable GPU (like the ARM Mali T604 in the Nexus 10), while the YUV to RGB conversion could happen directly on the camera’s image signal processor (ISP). Using these different processors could significantly improve power consumption and performance. As more these processors become available, future Android updates will enable RenderScript to run on these processors, and applications written for RenderScript today will begin to make use of those processors transparently, without any additional work for developers.

Intrinsics provide developers a powerful tool they can leverage with minimal effort to achieve great performance across a wide variety of hardware. They can be mixed and matched with general purpose developer code allowing great flexibility in application design. So next time you have performance issues with image manipulation, I hope you give them a look to see if they can help.

Join the discussion on

+Android Developers

14 January 2013

Evolution of Renderscript Performance

Posted by R. Jason Sams, Android Renderscript Tech Lead

It’s been a year since the last blog post on Renderscript, and with the release of Android 4.2, it’s a good time to talk about the performance work that we’ve done since then. One of the major goals of this past year was to improve the performance of common image-processing operations with Renderscript.

Figure 1. Renderscript image-processing benchmarks run on different Android platform versions (Android 4.0, 4.1, and 4.2) in CPU only on a Galaxy Nexus device.

Figure 2. Renderscript image-processing benchmarks comparing operations run with GPU + CPU to those run in CPU only on the same Nexus 10 device.

When you set out to improve performance, the first task is to measure it. To do this, we built a image-processing benchmark suite. The tests measure how long it takes to apply a given image processing operation to a roughly 1.7 million pixel bitmap. We then ran the benchmark using the same APK on the Galaxy Nexus and normalized the results from Ice Cream Sandwich to 1.0.

We made a few major improvements between ICS and Jelly Bean, which significantly reduced the overhead of short scripts as well as the cost of getting elements out of allocations. Going from Android 4.1 to Android 4.2, we added a number of performance improvements to the math library. Our hardware partners also made major contributions; ARM in particular provided numerous compiler improvements which greatly improved our ability to generate vector code.

Android 4.2 introduced another much more important change: For the first time on any mobile platform. we can use the GPU as a compute device. When run on a device that supports GPU compute, that same benchmark APK will run on the GPU. The chart in Figure 2 is normalized to the same basis as Figure 1.

The Cortex A15 in Nexus 10 is a very good CPU. However, that doesn’t mean we should leave resources idle. The Mali T604 is a very flexible and capable compute device capable of executing a large subset of RenderScript functionality. The green bar in Figure 2 shows what we can do when the Mali is enabled for RS compute. No effort is required on an app developer's part to enable this acceleration; the device will inspect each script and decide which processor to run things automatically. It’s important to note that some scripts can’t be run on the GPU, and such scripts will automatically run on the CPU.

The best part is it doesn’t end here. Performance work is an ongoing effort. RenderScript performance in applications will continue to improve over time as we continue to improve the platform.

To learn more about using Renderscript, see the Renderscript Computation developer's guide.

Join the discussion on

+Android Developers

10 January 2012

Levels in Renderscript

[This post is by R. Jason Sams, an Android Framework engineer who specializes in graphics, performance tuning, and software architecture. —Tim Bray]

For ICS, Renderscript (RS) has been updated with several new features to simplify adding compute acceleration to your application. RS is interesting for compute acceleration when you have large buffers of data on which you need to do significant processing. In this example we will look at applying a levels/saturation operation on a bitmap.

In this case, saturation is implemented by multiplying every pixel by a color matrix Levels are typically implemented with several operations.

Input levels are adjusted.
Gamma correction.
Output levels are adjusted.
Clamp to the valid range.

A simple implementation of this might look like:

for (int i=0; i < mInPixels.length; i++) {
    float r = (float)(mInPixels[i] & 0xff);
    float g = (float)((mInPixels[i] >> 8) & 0xff);
    float b = (float)((mInPixels[i] >> 16) & 0xff);

    float tr = r * m[0] + g * m[3] + b * m[6];
    float tg = r * m[1] + g * m[4] + b * m[7];
    float tb = r * m[2] + g * m[5] + b * m[8];
    r = tr;
    g = tg;
    b = tb;

    if (r < 0.f) r = 0.f;
    if (r > 255.f) r = 255.f;
    if (g < 0.f) g = 0.f;
    if (g > 255.f) g = 255.f;
    if (b < 0.f) b = 0.f;
    if (b > 255.f) b = 255.f;

    r = (r - mInBlack) * mOverInWMinInB;
    g = (g - mInBlack) * mOverInWMinInB;
    b = (b - mInBlack) * mOverInWMinInB;

    if (mGamma != 1.0f) {
        r = (float)java.lang.Math.pow(r, mGamma);
        g = (float)java.lang.Math.pow(g, mGamma);
        b = (float)java.lang.Math.pow(b, mGamma);
    }

    r = (r * mOutWMinOutB) + mOutBlack;
    g = (g * mOutWMinOutB) + mOutBlack;
    b = (b * mOutWMinOutB) + mOutBlack;

    if (r < 0.f) r = 0.f;
    if (r > 255.f) r = 255.f;
    if (g < 0.f) g = 0.f;
    if (g > 255.f) g = 255.f;
    if (b < 0.f) b = 0.f;
    if (b > 255.f) b = 255.f;

    mOutPixels[i] = ((int)r) + (((int)g) << 8) + (((int)b) << 16)
                    + (mInPixels[i] & 0xff000000);
}

This code assumes a bitmap has been loaded and transferred to an integer array for processing. Assuming the bitmaps are already loaded, this is simple.

        mInPixels = new int[mBitmapIn.getHeight() * mBitmapIn.getWidth()];
        mOutPixels = new int[mBitmapOut.getHeight() * mBitmapOut.getWidth()];
        mBitmapIn.getPixels(mInPixels, 0, mBitmapIn.getWidth(), 0, 0,
                            mBitmapIn.getWidth(), mBitmapIn.getHeight());

Once the data is processed with the loop, putting it back into the bitmap to draw is simple.

        mBitmapOut.setPixels(mOutPixels, 0, mBitmapOut.getWidth(), 0, 0,
                             mBitmapOut.getWidth(), mBitmapOut.getHeight());

The full code of the application is around 232 lines when you include code to compute the constants for the filter kernel, manage the controls, and display the image. On the devices I have laying around this takes about 140-180ms to process an 800x423 image.

What if that is not fast enough?

Porting the kernel of this image processing to RS (available at android-renderscript-samples) is quite simple. The pixel processing kernel above, reimplemented for RS looks like:

void root(const uchar4 *in, uchar4 *out, uint32_t x, uint32_t y) {
    float3 pixel = convert_float4(in[0]).rgb;
    pixel = rsMatrixMultiply(&colorMat, pixel);
    pixel = clamp(pixel, 0.f, 255.f);
    pixel = (pixel - inBlack) * overInWMinInB;
    if (gamma != 1.0f)
        pixel = pow(pixel, (float3)gamma);
    pixel = pixel * outWMinOutB + outBlack;
    pixel = clamp(pixel, 0.f, 255.f);
    out->xyz = convert_uchar3(pixel);
}

It takes far fewer lines of code because of the built-in support for vectors of floats, matrix operations, and format conversions. Also note that there is no loop present.

The setup code is slightly more complex because you also need to load the script.

        mRS = RenderScript.create(this);
        mInPixelsAllocation = Allocation.createFromBitmap(mRS, mBitmapIn,
                                                          Allocation.MipmapControl.MIPMAP_NONE,
                                                          Allocation.USAGE_SCRIPT);
        mOutPixelsAllocation = Allocation.createFromBitmap(mRS, mBitmapOut,
                                                           Allocation.MipmapControl.MIPMAP_NONE,
                                                           Allocation.USAGE_SCRIPT);
        mScript = new ScriptC_levels(mRS, getResources(), R.raw.levels);

This code creates the RS context. It then uses this context to create two memory allocations to hold the RS copy of the bitmap data. Last, it loads the script to process the data.

Also in the source there are a few small blocks of code to copy the computed constants to the script when they change. Because we reflect the globals from the script this is easy.

        mScript.set_inBlack(mInBlack);
        mScript.set_outBlack(mOutBlack);
        mScript.set_inWMinInB(mInWMinInB);
        mScript.set_outWMinOutB(mOutWMinOutB);
        mScript.set_overInWMinInB(mOverInWMinInB);

Earlier we noted that there was no loop to process all the pixels. The RS code that processes the bitmap data and copies the result back looks like this:

        mScript.forEach_root(mInPixelsAllocation, mOutPixelsAllocation);
        mOutPixelsAllocation.copyTo(mBitmapOut);

The first line takes the script and processes the input allocation and places the result in the output allocation. It does this by calling the natively compiled version of the script above once for each pixel in the allocation. However, unlike the dalvik implementation, the primitives will automatically launch extra threads to do the work. This, combined with the performance of native code can produce large performance gains. I’ll show the results with and without the gamma function working because it adds a lot of cost.

800x423 image

Device	Dalvik	RS	Gain
Xoom	174ms	39ms	4.5x
Galaxy Nexus	139ms	30ms	4.6x
Tegra 30 device	136ms	19ms	7.2x

800x423 image with gamma correction

Device	Dalvik	RS	Gain
Xoom	994ms	259ms	3.8x
Galaxy Nexus	787ms	213ms	3.7x
Tegra 30 device	783ms	104ms	7.5x

These large gains represent a large return on the simple coding investment shown above.

10 March 2011

Renderscript Part 2

[This post is by R. Jason Sams, an Android engineer who specializes in graphics, performance tuning, and software architecture. —Tim Bray]

In Introducing Renderscript I gave a brief overview of this technology. In this post I’ll look at “compute” in more detail. In Renderscript we use “compute” to mean offloading of data processing from Dalvik code to Renderscript code which may run on the same or different processor(s).

Renderscript’s Design Goals

Renderscript has three primary goals, given here from most to least important.

Portability: Application code needs to be able to run across all devices, even those with radically different hardware. ARM currently comes in several variants — with and without VFP, with and without NEON, and with various register counts. Beyond ARM, there are other CPU architectures like x86, several GPU architectures, and even more DSP architectures.

Performance: The second objective is to get as much performance as possible within the constraints of Portability. For Renderscript to make sense we need to achieve much greater performance than established solutions.

Usability: The third goal is to simplify development as much as possible. Where possible we automate steps to avoid glue code and other developer busy work.

Those three goals lead to several design trade-offs. It’s these trade-offs that separate Renderscript from the existing approaches on the device, such as Dalvik or the NDK. They should be thought of as different tools intended to solve different problems.

Core Design Choices

The first choice that needed to be made is what language should be used. When it comes to languages there are almost unlimited options. Shading style languages, C, and C++ were considered. In the end the shading style languages were ruled out due to the need to be able to manipulate data structures for graphical applications such as scene graphs. The lack of pointers and recursion were considered crippling limitations for usability. C++ on the other hand was very desirable but ran into issues with portability. The advanced C++ features are very difficult to run on non-cpu hardware. In the end we chose to base Renderscript on C99 because it offers equal performance to the other choices, is very well understood by developers, and poses no issues running on a wide range of hardware.

The next design trade-off was work flow. Specifically we focused on how to convert source code to machine code. We explored several options and actually implemented two different solutions during the development of Renderscript. The older versions (Eclair through Gingerbread) compiled the C source code all the way to machine code on the device. While this had some nice properties such as the ability for applications to generate source on the fly it turned out to be a usability problem. Having to compile your app, install it, run it, then find your syntax error was painful. Also the weaker CPU in devices limited the static analysis and scope of optimizations that could be done.

Then we switched to LLVM, moving to a model where scripts are compiled and analyzed on the host using a modified version of clang. We perform high level optimizations at this stage, then emit LLVM bitcode. The translation of the intermediate bitcode to machine code still happens on the device (along with additional device-specific optimizations).

Our last big trade-off for compute was thread launching. The trade-off here is between performance and portability. Given sufficient knowledge, existing compute solutions allow a developer to tune an application for a specific hardware platform to the detriment of others. Given unlimited time and resources developers could tune for every hardware combination. While testing and tuning a variety of devices is never bad, no amount of work allows them to tune for unreleased hardware they don’t yet have. A more portable solution places the tuning burden on the runtime, providing greater average performance at the cost of peak performance. Given that the number one goal was portability we chose to place this burden on the runtime.

A secondary effect of choosing runtime thread-launch management is that dynamic decisions can be made about where to run a script. For example, some compute hardware can support pointers and recursion while others cannot. We could have chosen to disallow these things and give developers a lowest common denominator API, but we chose to instead let the runtime analyze the scripts. This allows developers to get full use of hardware that supports these features, since there is always a fully featured CPU to fall back upon. In the end, developers can focus on writing good apps and the hardware manufacturers can compete on making the most fully featured and efficient hardware. As new features appear, applications will benefit without application code changes.

Usability was a major driver in Renderscript’s design. Most existing compute and graphics platforms require elaborate glue logic to tie the high performance code back to the core application code. This code is very bug prone and usually painful to write. The static analysis we do in the host Renderscript compiler is helpful in solving this issue. Each user script generates a Dalvik “glue” class. Names for the glue class and its accessors are derived from the contents of the script. This greatly simplifies the use of the scripts from Dalvik.

Example: The Application Level

Given these trade-offs, what does a simple compute application look like? In this very basic example we will take a normal android.graphics.Bitmap object and run a script that copies it to a second bitmap, converting it to monochrome along the way. Let’s look at the application code which invokes the script before we look at the script itself; this comes from the HelloCompute SDK sample:

    private Bitmap mBitmapIn;
    private Bitmap mBitmapOut;
    private RenderScript mRS;
    private Allocation mInAllocation;
    private Allocation mOutAllocation;
    private ScriptC_mono mScript;

    private void createScript() {
        mRS = RenderScript.create(this);

        mInAllocation = Allocation.createFromBitmap(mRS, mBitmapIn,
                                                    Allocation.MipmapControl.MIPMAP_NONE,
                                                    Allocation.USAGE_SCRIPT);
        mOutAllocation = Allocation.createTyped(mRS, mInAllocation.getType());

        mScript = new ScriptC_mono(mRS, getResources(), R.raw.mono);

        mScript.set_gIn(mInAllocation);
        mScript.set_gOut(mOutAllocation);
        mScript.set_gScript(mScript);
        mScript.invoke_filter();
        mOutAllocation.copyTo(mBitmapOut);
    }

This function assumes that the two bitmaps have already been created and are of the same size and format.

The first thing all Renderscript applications need is a context object. This is the core object used to create and manage all other Renderscript objects. This first line of the function creates the object, mRS. This object must be kept alive for the duration the application intends to use it or any objects created with it.

The next two function calls create compute allocations from the Bitmaps. Renderscript has its own memory allocator, because the memory may potentially be shared by multiple processors and possibly exist in more than one memory space. When an allocation is created its potential uses need to be enumerated so the system may choose the correct type of memory for its intended uses.

The first function createFromBitmap() creates a matching Renderscript allocation object and copies the contents of the bitmap into the allocation. Allocations are the basic units of memory used in renderscript. The second Allocation created with createTyped() generates an Allocation identical in structure to the first. The definition of that structure is retrieved from the first with the getType() query. Renderscript types define the structure of an Allocation. In this case the type was generated from the height, width, and format of the incoming bitmap.

The next line loads the script, which is named “mono.rs”. R.raw.mono identifies it; scripts are stored as raw resources in an application’s APK. Note the name of the auto-generated “glue” class, ScriptC_mono.

The next three lines set properties of the script, using generated methods in the “glue” class.

Now we have everything set up. The function invoke_filter() actually does some work for us. This causes the function filter() in the script to be called. If the function had parameters they could be passed here. Return values are not allowed as invocations are asynchronous.

The last line of the function copies the result of our compute script back to the managed bitmap; it has the necessary synchronization code built-in to ensure the script has completed running.

Example: The Script

Here’s the Renderscript stored in mono.rs which the application code above invokes:

#pragma version(1)
#pragma rs java_package_name(com.android.example.hellocompute)

rs_allocation gIn;
rs_allocation gOut;
rs_script gScript;

const static float3 gMonoMult = {0.299f, 0.587f, 0.114f};

void root(const uchar4 *v_in, uchar4 *v_out, const void *usrData, uint32_t x, uint32_t y) {
    float4 f4 = rsUnpackColor8888(*v_in);

    float3 mono = dot(f4.rgb, gMonoMult);
    *v_out = rsPackColorTo8888(mono);
}

void filter() {
    rsForEach(gScript, gIn, gOut, 0);
}

The first line is simply an indication to the compiler which revision of the native Renderscript API the script is written against. The second line controls the package association of the generated reflected code.

The three globals listed correspond to the globals which were set up in our managed code. The fourth global is not reflected because it is marked as static. Non-static, const, globals are also allowed but only generate a get reflected method. This can be useful for keeping constants in sync between scripts and managed code.

The function root() is special to renderscript. Conceptually it’s similar to main() in C. When a script is invoked by the runtime, this is the function that will be called. In this case the parameters are the incoming and outgoing pixels from our allocation. A generic user pointer is also provided along with the address within the allocation that this invocation is processing. This example ignores these parameters.

The three lines of the root function unpack the pixel from RGBA_8888 to a vector of four floats. The second line uses a built-in math function to compute the dot product of the monochrome constants with the incoming pixel data to get our grey level. Note that while dot returns a single float it can be assigned to a float3 which simply copies the value to each of the x, y, and z components of the float3. In the end we use another builtin to repack the floats back to a 32 bit pixel. This is also an example of an overloaded function as there are separate versions of rsPackColorTo8888 which take RGB (float3) or RGBA (float4) data. If A is not provided the overloaded functions assume a value of 1.0f.

The filter() function is called from managed code to do the conversion. It simply does a compute launch on each element of the allocation. The first parameter is the script to be launched - the root function of this script will be invoked for each element in the allocation. The second and third parameters are the input and output data allocations. The last parameter is the pointer for the user data if we desired to pass additional information to the root function.

The forEach will launch across multiple threads if the device has multiple processors. In the future forEach can provide a transition point where control may pass from one processor to another. In this example it is reasonable to expect that in the future filter() would get executed on the CPU and root() would occur on a GPU or DSP.

I hope this gives glimpse into the design behind Renderscript and a simple example of how it can be used.

09 February 2011

Introducing Renderscript

[This post is by R. Jason Sams, an Android engineer who specializes in graphics, performance tuning, and software architecture. —Tim Bray]

Renderscript is a key new Honeycomb feature which we haven’t yet discussed in much detail. I will address this in two parts. This post will be a quick overview of Renderscript. A more detailed technical post with a simple example will be provided later.

Renderscript is a new API targeted at high-performance 3D rendering and compute operations. The goal of Renderscript is to bring a lower level, higher performance API to Android developers. The target audience is the set of developers looking to maximize the performance of their applications and are comfortable working closer to the metal to achieve this. It provides the developer three primary tools: A simple 3D rendering API on top of hardware acceleration, a developer friendly compute API similar to CUDA, and a familiar language in C99.

Renderscript has been used in the creation of the new visually-rich YouTube and Books apps. It is the API used in the live wallpapers shipping with the first Honeycomb tablets.

The performance gain comes from executing native code on the device. However, unlike the existing NDK, this solution is cross-platform. The development language for Renderscript is C99 with extensions, which is compiled to a device-agnostic intermediate format during the development process and placed into the application package. When the app is run, the scripts are compiled to machine code and optimized on the device. This eliminates the problem of needing to target a specific machine architecture during the development process.

Renderscript is not intended to replace the existing high-level rendering APIs or languages on the platform. The target use is for performance-critical code segments where the needs exceed the abilities of the existing APIs.

It may seem interesting that nothing above talked about running code on CPUs vs. GPUs. The reason is that this decision is made on the device at runtime. Simple scripts will be able to run on the GPU as compute workloads when capable hardware is available. More complex scripts will run on the CPU(s). The CPU also serves as a fallback to ensure that scripts are always able to run even if a suitable GPU or other accelerator is not present. This is intended to be transparent to the developer. In general, simpler scripts will be able to run in more places in the future. For now we simply leverage the CPU resources and distribute the work across as many CPUs as are present in the device.

The video above, captured through an Android tablet’s HDMI out, is an example of Renderscript compute at work. (There’s a high-def version on YouTube.) In the video we show a simple brute force physics simulation of around 900 particles. The compute script runs each frame and automatically takes advantage of both cores. Once the physics simulation is done, a second graphics script does the rendering. In the video we push one of the larger balls to show the interaction. Then we tilt the tablet and let gravity do a little work. This shows the power of the dual A9s in the new Honeycomb tablet.

Renderscript Graphics provides a new runtime for continuously rendering scenes. This runtime sits on top of HW acceleration and uses the developers’ scripts to provide custom functionality to the controlling Dalvik code. This controlling code will send commands to it at a coarse level such as “turn the page” or “move the list”. The commands the two sides speak are determined by the scripts the developer provides. In this way it’s fully customizable. Early examples of Renderscript graphics were the live wallpapers and 3d application launcher that shipped with Eclair.

With Honeycomb, we have migrated from GL ES 1.1 to 2.0 as the renderer for Renderscript. With this, we have added programmable shader support, 3D model loading, and much more efficient allocation management. The new compiler, based on LLVM, is several times more efficient than acc was during the Eclair-through-Gingerbread time frame. The most important change is that the Renderscript API and tools are now public.

The screenshot above was taken from one of our internal test apps. The application implements a simple scene-graph which demonstrates recursive script to script calling. The Androids are loaded from an A3D file created in Maya and translated from a Collada file. A3D is an on device file format for storing Renderscript objects.

Later we will follow up with more technical information and sample code.