Recent changes
Table of Contents

Everything Can Be Object

Under Construction [8 March 2005]

Programming is so much easier when everything is an object. And while object can mean a lot of things, I’m referring to the idea of treating all variables and fields like references to heap-allocated objects.

Java’s Solution

In Java, this is mostly how it’s done. Almost everything is a pointer to a heap-allocated object. The exception is the set of primitive types (int, bool, double, etc.). Variables of those types store the actual value instead of a pointer to an object.

This difference causes problems when trying to deal with value generically. You can take String objects and pass them around generically as Objects but to do that with int values, you have to wrap them in Integer objects first. This problem isn’t as great as it used to be because the new compiler will do the conversion for you automatically (it’s called “autoboxing”).

The other problem is that now there are two different semantics for how values are handled. When you pass around regular objects, you’re passing by reference (you’re just passing around the pointer). When you pass around primitive types, you’re passing by value (the value is copied around). What does the above code fragment do?

x = y

Do x and y now point to the same thing, or was the value of y copied to x?

Turns out that a Java programmer doesn’t really care. First of all, there are a limited number of primitive types so it’s possible to remember which is which. This is not the case in C# in which the set of pass-by-value types is an open set. Second of all, anything that is pass-by-value in Java is also immutable. For example:

int x = 5
int y = x
x = 12
print("x = " + x + ", y = " + y)

Notice that we can’t change the value “5”. We can reassign the variable x to refer to something else, but we can’t change the actual value. This is demonstrated by the fact that the final value of y is “5” and not “12”.

Since primitives are immutable, it doesn’t matter whether we pass them by value or by reference. Some Java VM might decide to use pointers for all primitive types and allocate all the values on the heap. This would be very inefficient, but the point is that you can’t tell the difference because the behavior is exactly the same. That’s what’s so great about immutable values.

C# Value Types

The Java 1.5 compiler automatically converts between int and java.lang.Integer (and similarly for the other primitive types). But this mean that there’s a hardcoded list of types to convert between. This is kind of messy.

C# introduced the idea of “value types” which, in essense, let you create your own primitive types. More accuratly, it let you create types whose instances are passed by value instead of by reference. The great thing about this feature is that the autoboxing conversion is more uniform as the runtime system is aware of the relationship between the boxed and unboxed versions of the type.

The problem is that value types can be mutable. So now, you have to be aware of whether you’re doing pass-by-value or pass-by-reference. For example:

Thing x = new Thing()
x.Name = "Abcd"
Thing y = x
x.Name = "Hello"
Print("x = " + x.Name + ", y = " + y.Name)

The final result depends on whether Thing is a regular class or a value type. If it is a value type, the Name properties will be different because the assignment statement will copy the object.

“Well,” some may say, “shouldn’t you know what type of object you are dealing with?” Yeah. “And if you are aware of details about the object, it isn’t that hard to remember whether it is a value type or a reference type.” It is hard. The “pass-by” semantics of a type is a fundamental piece of information that you need to be aware of at all times and you need to be aware of this for every single object you use.

The other semantic difference is that value types cannot be “null”. On one hand, null” is a hack and shouldn’t have been allowed in the first place, but if you’re going to allow them on reference types, the same should hold for value types.

Why Value Types?

So why did they add value types to the CLR? Value types give the compiler some freedom to make certain optimizations.

Stack Allocation

Because value types are passed by value, we can safely allocate them on the stack. For example:

 class Range {
    int start
    int length
 }
 
 // Finds the longest run of any character in a string.
 Range FindLongestRun(String s, char c)
 {
    int pos = 0
10    int end = pos
11    while (...) {
12       ...
13    }
14    Range r = new Range(pos, end - pos)
15    return r
16 }
17 
18 void Main()
19 {
20    while (WeHaveMoreInput()) {
21       String s = ReadInput()
22       Range longest = FindLongestRun(s, '.')
23       Print(longest)
24    }
25 }

Now, the Range type is pretty small. Just two integers. If Range objects are allocated on the heap, we’re giving the garbage collector a lot of extra work because we’re going to allocate and then throw away a block of heap memory every time through the loop.

With a C# value type, you would say “struct Range” instead of “class Range” and this will force the value to be allocated on the stack. So now, on line 22, the value of longest will be copied from the value of FindLongestRun. No more allocations on the heap. No work for the garbage collector.

Note that this doesn’t necessarily mean that value types cannot exist on the heap. When they are embedded within a reference type, then they will be located wherever the reference type is located.

Embedding

If you have the type:

class Rectangle
{
   Point topLeft, bottomRight;
}

class Point
{
   int x, y;
}

If Point is a value type, then a Rectangle object can embed the Point object in it directly instead of containing references to separate Point objects. This results in fewer heap objects to keep track of.

Ignoring the optimization, another reason you may want to embed objects is so that you can map a structure directly on to memory. You have to do this a lot when sending data over a network. Though C# uses structs for both optimization and to allow mapping structures directly to memory, the two are not necessarily related.

Smaller Size

Value types don’t need any of the extra header bytes that are attached to heap-allocated objects because

So the actual representation of a value type is a lot like a C struct: there’s no extra cruft. When you’re dealing with a couple objects it doesn’t make a big difference, but when you have an array of them it does. For example, with the Range type, the storage requirement is 8 bytes (since an int is 4 bytes). But if it’s a heap-allocated object, you need at least a machine word (say, 4-bytes) for the virtual method table pointer and possibly another machine word for the garbage collector. By now you’ve doubled the storage requirements and you’ll feel it when you allocate large arrays.

Faster Access

When accessing heap-allocated objects, you only have to pass around pointers, which are small and fast. However you do have to dereference the pointer every time you want to access the actual values. This is slow and things would be faster if you stored the values directly.

While you usually don’t have to care about the performance hit, the penalty is multiplied when doing computationally-intensive stuff on arrays because you’ll never have enough CPU registers to point directly at the values.

Remote Invocation Optimizations

When you pass an object from your VM to a remote VM, the two VMs have to keep in touch with each other so that they’re both aware of what’s happening to the object. If the object gets modified in one VM, the changes have to be propagated to the other one. With value types, you’re always sending a copy and so one VM doesn’t care about the other VM’s copy and so there’s no need to communicate every time the object is changed.

Why Not Value Types?

As far as your program’s design is concerned, there is no reason to use value types over class types. The only reason to use value types would be for performance. Optimizing for performance is fine. However, this performance optimization has disgusting semantic consequences that can affect every usage of the type.

“But what about the optimizations? I need speed”

Relax, most of the optimizations are possible without screwing with the semantics.

Help… The Compiler… Help… You

Compilers are really, really smart. They try hard to do all sorts of funky analysis but they have to be conservative. If they perform an optimization, it has to work 100% of the time. If there’s a single weird case where the optimization could possibly fail, then they usually have to back off and leave your code alone.

The key is to give the compiler as much freedom as you can.

Immutability

If you have an immutable data type, the compiler is allowed to pass instances by value. It may decide not to, but at least the option is available. Any immutable object can be optimized just like a value type can be optimized. And with immutable types you don’t have to worry about the semantics because they always behave the same.

Basically, if you’re object is immutable, the compiler can decide to use pass-by-value for all instances of that type. This means that all the optimizations for value types apply here – automatically and transparently.

Escape Analysis

In a memory-managed runtime, an object has to be kept around as long as there are live references to it (well, not exactly, but close enough). If a compiler can figure out that all references to an object are gone, then it can also get rid of the object itself.

The flipside of this is that if the compiler knows that all references to a given object are contained on the stack, it can allocate the object on the stack as well (and when the stack is unwound, the object and all its references disappear peacefully). Escape analysis is a way of making determining whether references to an object “escape” some boundary.

void Run1(Thing x)
{
   x.Name = "2"
}

void Run2(Thing x)
{
   x.Name = "1"
   someGlobalArray.Add(x)
}

void DoSomething()
{
   Thing t = new Thing()
   Run(t)
}

Do we really need to allocate the Thing object on the heap or can we do it on the stack? Thing is a mutable type, so the immutability optimization doesn’t apply here.

If we call the Run1 function, then we can allocate Thing on the stack and pass a pointer to it. Why is this OK? Because Run1 doesn’t let the pointer “escape” from the stack. After the function completes, the reference to the Thing we gave it is vaporized, never to be seen again.

Why can’t we do the same thing for Run2? Well, trace through it. We call DoSomething and put the Thing object on the stack. The Run2 function gets a pointer to it and copies that pointer into a global array, which lets the pointer “escape”. Then when the DoSomething function returns, the Thing object’s stack space is no longer valid and so the pointer in the global array points to bogus memory. This would compromise the safety of the runtime environment and can’t be allowed.

So the compiler has to figure out whether a particular object reference can “escape” the stack. If not, it can safely allocate the value on the stack. This analysis can be done statically and efficiently.

The same analysis can be used to perform object embedding. The difference is that references to an embedded object shouldn’t escape from the object’s container.

What’s great is that you don’t have to make sure that all uses of SomeObject never escape the stack (which is a tough requirement to satisfy). The compiler can look at each use of SomeObject and perform the optimizations on a case-by-case basis.

Virtual Methods

All of this can happen behind the scenes without bugging the programmer. But the compiler can’t do this all the time. Both the JVM and CLR support dynamic class loading, which is great. Unfortunately, it does add to our problems. When call a virtual method, you can’t say with certainty that the parameters never escape the stack; some class down the line might decides to override the method and let the parameters escape. Even with global analysis, the VM can’t tell what classes will be loaded later on and so it can’t ever be sure that stack allocation is safe.

One way to handle this is to make the VM “optimistic” and tell it to optimize based on the information it currently has. Then, if some dynamically loaded class causes conflicts, it can undo the optimizations. This may or may not be worth it.

Another possibility is to let virtual methods be. There’s a lot of computationally intensive code that doesn’t use virtual methods at all (because virtual methods themselves are kind of slow). Such code can benefit from the optimization.

Personally, I don’t think I’d mind having to place an annotation for every parameter that can’t escape the stack:

class Set {
   ...
   virtual bool contains(Object ob [stack]) // 'ob' doesn't escape
   virtual bool add(Object ob) // 'ob' might escape
}

The compiler can assume that the contains(…) method will never let its parameter escape, since it is annotated with [stack]. Anyone who overrides the contains(…) method must obey the rules. The problem is that this is a lot of extra work finger and brain work…possibly not worth the payoff.

Optional Pass-by-Value

For the remoting optimization, it would be better to simply allow a case-by-case decision on what objects should be copied across.

Under Construction [8 March 2005]

data/everything_can_be_an_object.txt Last modified: 01.10.2008 00:37
Driven by DokuWiki