ping's blog

New stuff at notes.tst.sh

ping — Tue, 25 Apr 2023 19:59:38 GMT

I've been wanting to share my thoughts more but in a format that is a little less bulky, over the weekend I set up notes.tst.sh which uses Obsidian Mkdocs Publisher to give you a better idea of what I've been up to.

Writing in a cross-platform markdown editor is way nicer than keeping a bunch of drafts in Ghost, and much easier to organize. If it goes well I might even move all of my old blog posts over and set up redirects, stay tuned!

Reverse engineering Flutter apps (Part 2)

ping — Wed, 24 Feb 2021 16:14:12 GMT

This is a continuation of Part 1 which covered how Flutter compiles apps and what snapshots look like internally.

As you have probably guessed so far, reverse engineering its not an easy task.

Calling conventions

Let's first cover some basics about Dart's type system:

void main() {
  void foo() {}
  int bar([int aaa]) {}
  Null biz({int aaa}) {}
  int baz(int aa, {int aaa}) {}
  
  print(foo is void Function());
  print(bar is void Function());
  print(biz is void Function());
  print(baz is void Function());
}

Which functions do you think print true?

It turns out the Dart type system is much more flexible than you might expect, as long as a function takes the same positional arguments and has compatible return type it is a valid function subtype. Because of this, all but the baz print true.

Here's another experiment:

void main() {
  int foo({int a}) {}
  int bar({int a, int b}) {}
  
  print(foo is int Function());
  print(foo is int Function({int a}));
  print(bar is int Function({int a}));
  print(bar is int Function({int b}));
  print(bar is int Function({int b, int c}));
}

Here we check if functions have a valid subtype when they have a subset of named arguments, all but the last prints true.

For a formal description of function types, see "9.3 Type of a Function" in the Dart language specification.

Mixing and matching parameter signatures are a nice feature but pose some problems when implementing them at a low level, for example:

void main() {
  void Function({int a, int c}) foo;
  
  foo = ({int a, int b, int c}) {
    print("Hi $a $b $c");
  };
  
  foo(a: 1, c: 2);
}

In order for this to work, foo needs some way of knowing the caller provided a and c but not b, this piece of information is called an argument descriptor.

Internally argument descriptors are defined by vm/dart_entry.h. The implementation is just an interface over a regular Array object which the callee provides via the argument descriptor register.

For example:

void bar({int a}) {
  print("Hi $a");
}

void foo() {
  bar(a: 42);
}

Rather than using Dart's built-in disassembler I'll be using a custom one that provides proper annotations for calls, object pool entries, and other constants.

Disassembly of foo, the caller:

#lint dartdec-dasm
034 | ...
038 | mov ip, #0x54        // 
03c | str ip, [sp, #-4]!   // push smi 42
040 | add r4, pp, #0x2000  //
044 | ldr r4, [r4, #0x4a3] // load the argument descriptor into r4
048 | bl 0x1b8             // call bar
04C | ...

foo

The argument descriptor for the call to bar is the following RawArray:

[
  0, // type arguments length
  1, // argument count
  0, // positional arg count
  
  // named arguments (name, position)
  "x", 0,
  
  null, // null terminator
]

The descriptor is used in the prologue of the callee to map stack indices to their respective argument slots and verify the proper arguments were received. Here is the disassembly of the callee:

#lint dartdec-dasm
// prologue, polymorphic entry

000 | stmdb sp!, {fp, lr}
004 | add fp, sp, #0
008 | sub sp, sp, #4

// optional parameter handling

00c | ldr r0, [r4, #0x13] // arr[2] (positional arg count)
010 | ldr r1, [r4, #0xf]  // arr[1] (argument count)
014 | cmp r0, #0          // check if we have positional args
018 | bgt 0x74            // jump to 08c

// check named args

01c | ldr r0, [r4, #0x17]  // arr[3] (first arg name)
020 | add ip, pp, #0x2000  // 
024 | ldr ip, [ip, #0x4a7] // string "x"
028 | cmp r0, ip           // check if arg present
02c | bne 0x20             // jump to 04c

030 | ldr r0, [r4, #0x1b]    // arr[4] (first arg position)
034 | sub r2, r1, r0         // r2 = arg_count - position
038 | add r0, fp, r2, lsl #1 // r0 = fp + r2 * 2
    |                        // this is really r2 * 4 because it's an smi
03c | ldr r0, [r0, #4]       // read arg
040 | mov r2, r0             // 
044 | mov r0, #2             // 
048 | b 12                   // jump to 054

04c | ldr r2, [thr, #0x68] // thr->objectNull
050 | mov r0, #0           // 

054 | str r2, [fp, #-4] // store arg in local

// done loading args

058 | cmp r1, r0 // check if we have read all args
05c | bne 0x30   // jump to 08c

// continue prologe

060 | ldr ip, [thr, #0x24] // thr->stackLimit
064 | cmp sp, ip           //
068 | blls -0x5af00        // stackOverflowStubWithoutFpuRegsStub

// rest of function

06c | ...

// incompatible args path

08c | ldr r6, [pp, #0x33] // Code* callClosureNoSuchMethod
090 | sub sp, fp, #0      // 
094 | ldmia sp!, {fp, lr} // exit frame
098 | ldr pc, [r6, #3]    // invoke stub

bar

To summarize, it loops the array assigning slots to any matching arguments, throwing a NoSuchMethodError if any are not part of the function type. Also keep in mind argument checking is only required for polymorphic calls, most (including the hello world example) are monomorphic.

This code is generated at a high level in vm/compiler/frontend/prologue_builder.cc PrologueBuilder::BuildOptionalParameterHandling meaning registers and subroutines may be layed out differently depending on the types of arguments and what optimizations it feels like doing.

Integer arithmetic

The num, int, and double classes are special in the Dart type system, for performance reasons they cannot be extended or implemented.

Because of this restriction we never have to check the type of an int before doing arithmetic, if that wasn't the case the compiler would have to generate relatively expensive method calls instead.

All objects in dart are pointers to RawObject however only pointers tagged with kHeapObjectTag are actual heap objects, objects without the tag are signed ints shifted to the left by one.

Because of pointer tagging you will see a lot of tst r0, #1 and similar instructions in generated code, these are for discriminating between smis and heap objects. You will also see a lot of odd-numbered offset loads and stores to subtract the heap flag.

Fun fact: The core int type used to be a bigint before Dart 2.0, you can find the writeup by the Dart team here: https://github.com/dart-lang/sdk/blob/2.15.1/docs/language/informal/int64.md

Any integer that can fit within the word size minus one bit (31 bits on A32) can be stored as an smi, otherwise larger integers are stored as 64 bit mint (medium int) instances on the heap.

Smis can contain negative numbers too of course, it uses an arithmetic right shift to sign extend the number back into place.

For example, here is a simple function that adds two ints:

int hello(int x, int y) => x + y;

To start, x and y are each unboxed into pairs of registers, Dart ints are 64 bit so two registers are needed for each arg on A32:

#lint dartdec-dasm
024 | ...
028 | ldr r1, [fp, #12]    // load argument x
02c | ldr ip, [thr, #0x68] // thr->objectNull
030 | cmp r1, ip           // check if x is null
034 | bleq -0x50954        // nullErrorStubWithoutFpuRegsStub
038 | ...
048 | mov r3, r1, asr #0x1f // sign-extend top half
04c | movs r4, r1, asr #1   // shift heap flag into carry
050 | bcc 12                // jump to 05c if heap flag is clear
054 | ldr r4, [r1, #7]      // load lower half from mint
058 | ldr r3, [r1, #11]     // load upper half from mint
05c | ...

x + y

After x and y are in pairs of registers it can perform the actual 64 bit add:

#lint dartdec-dasm
070 | adds r7, r4, r6 // bottom half
074 | adcs r2, r3, r1 // carry into top half

x + y

Before returning the result gets re-boxed:

#lint dartdec-dasm
074 | ...
078 | mov r0, r7, lsl #1      // create smi from lower half
07c | cmp r7, r0, asr #1      // check if MSB of smi isn't clobbered
080 | cmpeq r2, r0, asr #0x1f // check if upper half is empty
084 | beq 0x34                // jump to 0b8 if smi is valid
  
// construct mint
  
088 | ldr r0, [thr, #0x3c] // thr->top
08c | adds r0, r0, #0x10   // add size of mint
090 | ldr ip, [thr, #0x40] // thr->end
094 | cmp ip, r0           // check if mint fits in pool
098 | bls 0x28             // jump to 0c0 (slow path)

// construct mint in pool

09c | str r0, [thr, #0x3c] // shift down pool start
0a0 | sub r0, r0, #15      // go back to original top
0a4 | mov ip, #0x2204      // misc tags 
0a8 | movt ip, #0x31       // mint object id
0ac | str ip, [r0, #-1]    // write tags

// store values in new mint

0b0 | str r7, [r0, #7]  // write lower half
0b4 | str r2, [r0, #11] // write upper half

// function epilogue

0b8 | sub sp, fp, #0
0bc | ldmia sp!, {fp, pc}
  
// slow path, invoke mint constructor

0c0 | stmdb sp!, {r2, r7} //
0c4 | bl 0x651f4          // new dart:core::Mint_at_0150898
0c8 | ldmia sp!, {r2, r7} //
0cc | b -0x1c             // jump to 0b0

x + y

Boxing looks more expensive than it actually since the value will be returned immediately as an smi and only hits the slow code paths when the result is larger than 31 bits.

Instances

The code below creates an instance by calling an allocation stub followed by a call to the constructor:

makeFoo() => Foo();

Disassembled:

#lint dartdec-dasm
014 | ...
018 | ldr ip, [pp, #0x93] //
01c | str ip, [sp, #-4]!  // push type args 
020 | bl -0x628           // Foo allocation stub
024 | add sp, sp, #4      // pop arg
028 | str r0, [fp, #-4]   // store object in frame
02c | str r0, [sp, #-4]!  // push object as arg
030 | bl -0x9f0           // Foo::Foo()
034 | add sp, sp, #4      // pop arg
038 | ldr r0, [fp, #-4]   // load object from frame into return reg
03c | ...

makeFoo

Each class has a corresponding allocation stub that allocates and initializes an instance (very similar to how boxing creates an object), these stubs are generated for any classes that can be constructed.

Unfortunately for us, field information is removed from the snapshot so we can't directly get their names. You can however see the names of implicit getter and setter methods (assuming they haven't been inlined).

Offsets for fields are calculated at Class::CalculateFieldOffsets, the rules go as follows:

Start at end of super class, otherwise start at sizeof(RawInstance)
Use the type arguments field of parent, else put it at the start
Lay out remaining (non static) fields sequentially

Because type arguments are shared with the super, instantiating the following class gives us a type arguments field containing :

class Foo extends Bar {}
var x = Foo(); // instance type arguments are

Whereas if the type arguments are the same for parent and child, the list will only contain :

class Foo extends Bar {}
var x = Foo(); // instance type arguments are

Another fun feature of Dart is that all field access is done via setters and getters, this may sound very slow but in practice dart eliminates a ton of overhead with the following optimizations:

Whole-program static analysis
Inlining calls on known types
Code de-duplication
Inline cache (via ICData)

These optimizations apply to all methods including getters and setters, in the following example the setter is inlined:

class Foo {
  int x;
}

Foo bar() => Foo()..x = 42;

Disassembled:

#lint dartdec-dasm
028 | ...
02c | ldr r0, [fp, #-4] // load foo
030 | mov ip, #0x54     // smi 42
034 | str ip, [r0, #3]  // store first field
038 | ...

bar

But when we call this setter through an interface:

abstract class Foo {
  set x(int x);
}

class FooImpl extends Foo {
  int x;
}

void bar(Foo foo) {
  foo.x = 42;
}

Disassembled:

#lint dartdec-dasm
010 | ...
014 | ldr ip, [fp, #8]     // 
018 | str ip, [sp, #-4]!   // push foo
01c | mov ip, #0x54        // 
020 | str ip, [sp, #-4]!   // push smi 42
024 | ldr r0, [sp, #4]     // load foo into receiver
028 | add lr, pp, #0x2000  // 
02c | ldr lr, [lr, #0x4a3] // unlinkedCall stub
030 | add r9, pp, #0x2000  // 
034 | ldr r9, [r9, #0x4a7] // RawUnlinkedCall set:a
038 | blx lr               // invoke stub
03c | ...

bar

Here it invokes an unlinkedCall stub which is a magic bit of code that handles polymorphic method invocation, it will patch its own object pool entry so that further calls are quicker.

I'd love to get into more detail about how this works at runtime but all we need to know is that it invokes the method specified in the RawUnlinkedCall. If you are interested, there is a great article on the internals of DartVM that explains more: https://mrale.ph/dartvm/

Type Checking

Type checking is a fundamental component of polymorphism, dart provides this through the is and as operators.

Both operators do a subtype check with the exception of as allowing null values, here is the is operator in action:

class FooBase {}
class Foo extends FooBase {}
class Bar extends FooBase {}

bool isFoo(FooBase x) => x is Foo;

Disassembled:

#lint dartdec-dasm
024 | ...
028 | ldr r1, [fp, #8]       // load x
02c | ldrh r2, r3, [r1, #1]  // read classid
030 | mov r2, r2, lsl #1     // make smi, suboptimal
034 | cmp r2, #0x12c         // Foo classid (as smi)
038 | ldreq r0, [thr, #0x6c] // thr->boolTrue
03c | ldrne r0, [thr, #0x70] // thr->boolFalse
040 | ...

x is Foo

Since whole-program analysis determined Foo only has one implementer, it can simply check equality of the class ID, but what if it has a child class?

class Baz extends Foo {}

We now get:

#lint dartdec-dasm
028 | ...
02c | ldr r1, [fp, #8]      // load x
030 | ldrh r2, r3, [r1, #1] // read classid
034 | mov r2, r2, lsl #1    // make smi
038 | mov r1, #0x12c        // Foo smi classid
03c | mov r4, r1, asr #1    // unbox smi (redundant)
040 | mov r3, r4, asr #0x1f // 64 bit sign extend (redundant)
044 | mov r6, r2, asr #1    // unbox smi (redundant)
048 | mov r1, r6, asr #0x1f // 64 bit sign extend (redundant)
04c | cmp r1, r3            // always equal since top half is clear
050 | bgt 0x10              // jump to 0x60 (never)
054 | blt 0x40              // jump to 0x94 (never)
058 | cmp r6, r4            // compare x and Foo
05c | blo 0x38              // jump to 0x94 if x < Foo
060 | mov r2, #0x12e        // smi 0x97
064 | mov r4, r2, asr #1    // unbox smi (redundant)
068 | mov r3, r4, asr #0x1f // 64 bit sign extend (redundant)
06c | cmp r1, r3            // always equal since top half is clear
070 | blt 0x18              // jump to 0x88 (never)
074 | bgt 12                // jump to 0x80 (never)
078 | cmp r6, r4            // compare x and 0x97
07c | bls 12                // jump to 0x88
080 | ldr r2, [thr, #0x70]  // thr->boolFalse
084 | b 8                   // jump to 0x8c
088 | ldr r2, [thr, #0x6c]  // thr->boolTrue
08c | mov r0, r2            // 
090 | b 8                   // jump to 0x98
094 | ldr r0, [thr, #0x70]  // thr->boolFalse
098 | ...

x is Foo

Gah! This code is awful so here is a basic translation:

bool isFoo(FooBase* x) {
  if (x.classId < FooClassId) return false;
  return x.classId <= BazClassId;
}

All it is doing here is checking if the class id falls within a set of ranges, in this case there is only one range to check.

This is definitely a place where DartVM could improve on ARM, it's doing 64 bit smi range checks for 16 bit class ids instead of just comparing it directly.

The range checks also do not take into consideration the super type its comparing from which can cause a range to be split by a type that does not implement the super, perhaps as a result of unsoundness.

Control flow

Dart uses a relatively advanced flow graph, represented as an SSA (Single Static Assignment) intermediate similar to modern compilers like gcc and clang. It can perform many optimizations that change the control flow structure of the program, making reasoning about its generated code a bit harder.

Here is a simple if statement:

void hello(bool condition) {
  if (condition) {
    print("foo");
  } else {
    print("bar");
  }
}

Disassembled:

#lint dartdec-dasm
010 | ...
014 | ldr r0, [fp, #8]      // load condition
018 | ldr ip, [thr, #0x68]  // thr->objectNull
01c | cmp r0, ip            // 
020 | bne 0x18              // jump to 038 if condition != null
024 | str r0, [sp, #-4]!    // push condition
028 | ldr r9, [thr, #0x178] // thr->nonBoolTypeErrorEntryPoint
02c | mov r4, #1            // entry argument count
030 | ldr ip, [thr, #0xd0]  // thr->callToRuntimeEntryPoint
034 | blx ip                // invoke stub
038 | ldr r0, [fp, #8]      // load condition
03c | ldr ip, [thr, #0x6c]  // thr->boolTrue
040 | cmp r0, ip            // 
044 | bne 0x1c              // jump to 060 if condition != true
048 | add ip, pp, #0x2000   // 
04c | ldr ip, [ip, #0x4a3]  // 
050 | str ip, [sp, #-4]!    // push string "foo"
054 | bl -0x33b1c           // call print
058 | add sp, sp, #4        // pop arg
05c | b 0x18                // jump to 074
060 | add ip, pp, #0x2000   // 
064 | ldr ip, [ip, #0x4a7]  // 
068 | str ip, [sp, #-4]!    // push string "bar"
06c | bl -0x33b34           // call print
070 | add sp, sp, #4        // pop arg
074 | ...

hello

That null check is an example of a "runtime entry" dynamic call, this is the bridge from dart code to subroutines defined in vm/runtime_entry.cc.

In this case it is a specialized entry that throws a Failed assertion: boolean expression must not be null, as you would expect if the condition of an if statement is null.

Whole program optimization (and sound non-nullability in the future) allows this null check to be elided, for example if hello never gets called with a possible null value then it won't do the check at all:

void main() {
  hello(true);
  hello(false);
}

void hello(bool condition) {
  if (condition) {
    print("foo");
  } else {
    print("bar");
  }
}

Disassembled:

#lint dartdec-dasm
010 | ...
014 | ldr r0, [fp, #8]     // load condition
018 | ldr ip, [thr, #0x6c] // thr->boolTrue
01c | cmp r0, ip           //
020 | bne 0x1c             // jump to 03c if condition != true
024 | add ip, pp, #0x2000  // 
028 | ldr ip, [ip, #0x4a3] // 
02c | str ip, [sp, #-4]!   // push string "foo"
030 | bl -0x33a90          // call print
034 | add sp, sp, #4       // pop arg
038 | b 0x18               // jump to 050
03c | add ip, pp, #0x2000  // 
040 | ldr ip, [ip, #0x4a7] // 
044 | str ip, [sp, #-4]!   // push string "bar"
048 | bl -0x33aa8          // call print
04c | add sp, sp, #4       // pop arg
050 | ldr r0, [thr, #0x68] // thr->objectNull
054 | ...

hello

Closures

Closures are the implementation of first-class functions under the Function type, you can acquire one by creating an anonymous function or extracting a method.

A simple function hi that returns an anonymous function:

void Function() hi() {
  return (x) { print("Hi $x"); };
}

Disassembled:

#lint dartdec-dasm
024 | ...
028 | bl 0x3a6e0           // new dart:core::Closure_at_0150898
02c | add ip, pp, #0x2000  //
030 | ldr ip, [ip, #0x2cf] // RawFunction instance
034 | str ip, [r0, #15]    // RawClosure->function
038 | ...

And to call the closure:

#lint dartdec-dasm
// Call hi

010 | ...
014 | bl -0x6c           // call hi
018 | str r0, [sp, #-4]! // push arg

// Null check

01c | ldr ip, [thr, #0x68] // thr->objectNull
020 | cmp r0, ip
024 | bleq -0x3fdc4 // nullErrorStubWithoutFpuRegsStub

// Call closure

028 | ldr r1, [r0, #15]   // RawClosure->function
02c | mov r0, r1          //
030 | ldr r4, [pp, #0xfb] // arg desc [0, 1, 1, null]
034 | ldr r6, [r0, #0x2b] // RawFunction->code
038 | ldr r2, [r0, #7]    // RawFunction->uncheckedEntryPoint
03c | mov r9, #0          // null ICData
040 | blx r2              // invoke entry point
044 | add sp, sp, #4      // pop arg
048 | ...

hi()()

Pretty simple, but what if the lambda depends on a local variable from the parent function?

int Function() hi() {
  int i = 123;
  return () => ++i;
}

Disassembled:

#lint dartdec-dasm
014 | ...
018 | mov r1, #1          // number of variables
01c | ldr r6, [pp, #0x1f] // RawCode allocateContext
020 | ldr lr, [r6, #3]    // RawCode->entryPoint
024 | blx lr              // invoke stub
028 | str r0, [fp, #-4]   //
02c | ...

040 | ldr r0, [fp, #-4]    // load context
044 | mov ip, #0xf6        // smi 123
048 | str ip, [r0, #11]    // store first variable

04c | bl 0x3a84c           // new dart:core::Closure_at_0150898
050 | add ip, pp, #0x2000  //
054 | ldr ip, [ip, #0x2cf] // RawFuntion instance
058 | str ip, [r0, #15]    // RawClosure->function
05c | ldr r1, [fp, #-4]    // load context
060 | str r1, [r0, #0x13]  // RawClosure->context
064 | ...

Instead of storing the variable i in the stack frame like a regular local variable, the function will store it in a RawContext and pass that context to the closure.

When called, the closure can access that variable from the closure argument:

#lint dartdec-dasm
028 | ...
02c | ldr r1, [fp, #8]    // load first arg
030 | ldr r2, [r1, #0x13] // RawClosure->context
034 | ldr r0, [r2, #11]   // load first variable
038 | ...

() => ++i

Another way to get a closure is method extraction:

class Potato {
  int _foo = 0;
  int foo() => _foo++;
}

int Function() extractFoo() => Potato().foo;

When you call get:foo on Potato, Dart will generate that getter method as follows:

#lint dartdec-dasm
010 | ...
014 | mov r4, #0           // entry args
018 | add ip, pp, #0x2000  // 
01c | ldr r1, [ip, #0x303] // RawFunction Potato.foo
020 | add ip, pp, #0x2000  //
024 | ldr r6, [ip, #0x2ff] // buildMethodExtractorCode
028 | ldr pc, [r6, #11]    // direct jump

get:foo

get:foo invokes buildMethodExtractor, which eventually returns a RawClosure and stores the receiver (this) in its context and loads that back into r0 when called, just like a regular instance call.

Where the fun starts

With a good starting point to reverse engineer real world applications, the first big Flutter app that comes to mind is Stadia.

So let's take a crack at it, first step is to grab an APK off of apkmirror, in this case version 2.2.289534823:

(I don't recommend downloading apps from third party websites, it's just the easiest way to grab an apk file without a compatible android device)

The important part here is that the version information contains arm64-v8a + armeabi-v7a which are A64 and A32 respectively.

The interesting bits are in the lib folder like libflutter.so which is the flutter engine, and libproduction_android_library.so which is just a renamed libapp.so.

Before being able to do anything with the snapshot we must know the exact version of Dart that was used to build the app, a quick search of libflutter.so in a hex editor gives us a version string:

That c547f5d933 is a commit hash in the Dart SDK which you can view on GitHub: https://github.com/dart-lang/sdk/tree/c547f5d933, after some digging this corresponds to Flutter version v1.13.6 or commit 659dc8129d.

Knowing the exact version of dart is important because it gives you a reference to know how objects are layed out and provides a testbed.

Once decoded, the next step is to search for the root library, in this version of dart it's located at index 66 of the root objects list:

Neat, we can see the package name of this app is chrome.cloudcast.client.mobile.app, which you might notice is not actually valid for a pub package, what's going on here?

The reason for the weird package name is that Google doesn't actually use pub for internal projects and instead uses it's internal Google3 repository. You can occasionally see issues on the Flutter GitHub labelled customer: ... (g3), this is what it refers to.

By extracting uris from every library defined in the app, we can view the complete file structure for every packages it contains.

As you might expect from a large project, it depends on quite a few packages: https://gist.github.com/pingbird/7503be142607df38582b454f3b1e8153

We can gather it uses some of the following technologies:

protobuf
markdown boilerplate (from AdWords?)
firebase
rx
bloc
provider

Most of which appear to be internal implementations.

Going deeper, here is the root of the lib folder: https://gist.github.com/pingbird/f2029cd88d5343c0991f706403012f62

I picked a random widget to look at, SocialNotificationCard from profile/view/social_notification_card.dart.

The library containing this widget is structured as follows:

enum SocialNotificationIconType {
  avatarUrl,
  apiImage,
  defaultIcon,
  partyIcon,
}

class SocialNotificationCard extends StatelessWidget {
  SocialNotificationCard({
    dynamic socialNotificationIconType,
    dynamic title,
    dynamic body,
    dynamic timestamp,
    dynamic avatarUrl,
    dynamic apiImage,
  }) { }
  
  NessieString get title { }

  Widget _buildNotificationMessage(dynamic arg1) { }
  Widget _buildNotificationTimestamp(dynamic arg1) { }
  Widget _buildGeneralNotificationIcon() { }
  Widget _buildPartyIcon(dynamic arg1) { }
  Widget _buildAvatarUrlIcon(dynamic arg1) { }
  Widget _buildApiImage(dynamic arg1) { }
  Widget _buildNotificationIconImage(dynamic arg1) { }
  Widget _buildNotificationIcon(dynamic arg1) { }
  Widget build(dynamic arg1) { }
}

profile/view/social_notification_card.dart

The type information on these parameters is missing, but since they are build methods we can assume they all take a BuildContext.

The full disassembly of the _buildPartyIcon method goes as follows:

#lint dartdec-dasm
// prologue

000 | tst r0, #1
004 | ldrhne ip, [r0, #1]
008 | moveq ip, #0x30
00c | cmp r9, ip, lsl #1
010 | ldrne pc, [thr, #0x108]
014 | stmdb sp!, {fp, lr}
018 | add fp, sp, #0

// function body

01c | bl 0x27ef84          // Image allocation stub
020 | add ip, pp, #0x54000 //
024 | ldr ip, [ip, #0xdc3] // const AssetImage{"assets/social/party_invite.png", null, null}
028 | str ip, [r0, #7]     // assign field 1 (image)
02c | ldr ip, [thr, #0x70] // thr->false
030 | str ip, [r0, #0x43]  // assign field 16 (excludeFromSemantics)
034 | add ip, pp, #0x44000 // 
038 | ldr ip, [ip, #0x6b]  // int 48
03c | str ip, [r0, #0x13]  // assign field 4 (width)
040 | add ip, pp, #0x44000 //
044 | ldr ip, [ip, #0x6b]  // int 48
048 | str ip, [r0, #0x17]  // assign field 5 (height)
04c | add ip, pp, #0x46000 //
050 | ldr ip, [ip, #0xe2b] // const BoxFit.cover
054 | str ip, [r0, #0x27]  // assign field 9 (fit)
058 | add ip, pp, #0xd000  //
05c | ldr ip, [ip, #0xacf] // const Alignment{0, 0}
060 | str ip, [r0, #0x2b]  // assign field 10 (alignment)
064 | add ip, pp, #0x38000 //
068 | ldr ip, [ip, #0x653] // const ImageRepeat.noRepeat
06c | str ip, [r0, #0x2f]  // assign field 11 (repeat)
070 | ldr ip, [thr, #0x70] // thr->false
074 | str ip, [r0, #0x37]  // assign field 13 (matchTextDirection)
078 | ldr ip, [thr, #0x70] // thr->false
07c | str ip, [r0, #0x3b]  // assign field 14 (gaplessPlayback)
080 | add ip, pp, #0x38000 //
084 | ldr ip, [ip, #0x657] // const FilterQuality.low
088 | str ip, [r0, #0x1f]  // assign field 7 (filterQuality)

// epilogue

08c | sub sp, fp, #0
090 | ldmia sp!, {fp, pc}

SocialNotificationCard._buildPartyIcon

This one is quite easy to turn back into code by hand since it constructs a single Image widget:

Widget _buildPartyIcon(BuildContext context) {
  return Image.asset(
    // The `name` parameter is converted into the const AssetImage we
    // saw above at compile-time, by the Image.asset constructor.
    "assets/social/party_invite.png",
    fit: BoxFit.cover,
    width: 48,
    height: 48,
    // All of the other fields were assigned to their default value
  );
}

Note that object construction generally happens in 3 parts:

Invoke allocation stub, passing type arguments if needed
Evaluate parameter expressions and assigning them to fields in-order
Calling the constructor body, if any

The initializer list and default parameters seem to be unconditionally inlined to the caller, leading to a bit more noise.

Finally let's disassemble the actual build method of SocialNotificationCard:

#lint dartdec-dasm
// prologue

000 | tst r0, #1              //
004 | ldrhne ip, [r0, #1]     //
008 | moveq ip, #0x30         //
00c | cmp r9, ip, lsl #1      //
010 | ldrne pc, [thr, #0x108] // thr->monomorphicMissEntry
014 | stmdb sp!, {fp, lr}     //
018 | add fp, sp, #0          //
01c | sub sp, sp, #0x14       // allocate space for local variables
020 | ldr ip, [thr, #0x24]    // thr->stackLimit
024 | cmp sp, ip              //
028 | blls -0x52ee40          // stack overflow

// construct Padding

02c | bl 0x3e3ce4          // Padding allocation stub
030 | str r0, [fp, #-4]    // assign Padding to local 0

// construct EdgeInsets

034 | bl 0x3e11f4          // EdgeInsets allocation stub
038 | str r0, [fp, #-8]    // assign EdgeInsets to local 1
03c | add ip, pp, #0x5000  //
040 | ldr ip, [ip, #0x9db] // int 0
044 | str ip, [r0, #3]     // assign field 0 (left)
048 | add ip, pp, #0xf000  //
04c | ldr ip, [ip, #0xd7]  // int 16
050 | str ip, [r0, #7]     // assign field 1 (top)
054 | add ip, pp, #0x5000  //
058 | ldr ip, [ip, #0x9db] // int 0
05c | str ip, [r0, #11]    // assign field 2 (right)
060 | add ip, pp, #0xf000  //
064 | ldr ip, [ip, #0xd7]  // int 16
068 | str ip, [r0, #15]    // assign field 3 (bottom)

// construct Row

06c | bl 0x217984          // Row allocation stub
070 | str r0, [fp, #-12]   // assign Row to local 2

// construct List

074 | add ip, pp, #0x31000 //
078 | ldr ip, [ip, #0x677] // type args 
07c | str ip, [sp, #-4]!   // push to stack, this is used at 1cc
080 | add r1, pp, #0x31000 //
084 | ldr r1, [r1, #0x677] // type args 
088 | mov r2, #6           // smi 3 (new List length)
08c | ldr r6, [pp, #7]     // code allocateArray
090 | ldr lr, [r6, #3]     //
094 | blx lr               // call stub, putting new List into r0
098 | str r0, [fp, #-0x10] // assign List to local 3

// call _buildNotificationIcon

09c | ldr ip, [fp, #12]    // argument 0 (self)
0a0 | str ip, [sp, #-4]!   // push
0a4 | ldr ip, [fp, #8]     // argument 1 (context)
0a8 | str ip, [sp, #-4]!   // push
0ac | bl -0x180            // call _buildNotificationIcon + 0x14
0b0 | add sp, sp, #8       // pop arguments

// add notification icon to list

0b4 | ldr r1, [fp, #-0x10]   // load local 3 (List)
0b8 | add r9, r1, #11        //
0bc | str r0, [r9]           // list[0] = r0

// garbage collection stuff

0c0 | tst r0, #1             //
0c4 | beq 0x1c               // skip if smi
0c8 | ldrb ip, [r1, #-1]     //
0cc | ldrb lr, [r0, #-1]     //
0d0 | and ip, lr, ip, lsr #2 //
0d4 | ldr lr, [thr, #0x30]   // thr->writeBarrierMask
0d8 | tst ip, lr             //
0dc | blne -0x52f1ac         // call arrayWriteBarrier stub

// construct Expanded

0e0 | add ip, pp, #0x31000 //
0e4 | ldr ip, [ip, #0xaab] // type args  (it extends ParentDataWidget)
0e8 | str ip, [sp, #-4]!   // push type arguments
0ec | bl 0x39ae5c          // Expanded allocation stub
0f0 | add sp, sp, #4       // pop type arguments
0f4 | str r0, [fp, #-0x14] // assign Expanded to local 4

// call _buildNotificationMessage

0f8 | ldr ip, [fp, #12]  // argument 0 (self)
0fc | str ip, [sp, #-4]! // push
100 | ldr ip, [fp, #8]   // argument 1 (context)
104 | str ip, [sp, #-4]! // push
108 | bl -0x1ae634       // _buildNotificationMessage + 0x14
10c | add sp, sp, #8     // pop arguments

// fill in Expanded

110 | ldr r1, [fp, #-0x14] // load local 4 (Expanded)
114 | mov ip, #2           // smi 1
118 | str ip, [r1, #15]    // assign field 3 (flex)
11c | add ip, pp, #0x31000 //
120 | ldr ip, [ip, #0xab7] // const FlexFit.tight
124 | str ip, [r1, #0x13]  // assign field 4 (fit)
128 | str r0, [r1, #7]     // assign field 1 (child)

// garbage collection stuff

12c | ldrb ip, [r1, #-1]     //
130 | ldrb lr, [r0, #-1]     //
134 | and ip, lr, ip, lsr #2 //
138 | ldr lr, [thr, #0x30]   // thr->writeBarrierMask
13c | tst ip, lr             //
140 | blne -0x52f020         // call WriteBarrierWrappers stub (r1 object)

// finalize construction of Expanded, this is simply a call to the
// Diagnosticable constructor since none of its other ancestors
// have constructor bodies

144 | str r1, [sp, #-4]!   // push (Expanded)
148 | bl -0x529018         // call Diagnosticable ctor
14c | add sp, sp, #4       // pop

// add Expanded to list

150 | ldr r1, [fp, #-0x10] // load local 3 (List)
154 | ldr r0, [fp, #-0x14] // load local 4 (Expanded)
158 | add r9, r1, #15      //
15c | str r0, [r9]         // list[1] = r0

// garbage collection stuff

160 | tst r0, #1             //
164 | beq 0x1c               // skip if smi
168 | ldrb ip, [r1, #-1]     //
16c | ldrb lr, [r0, #-1]     //
170 | and ip, lr, ip, lsr #2 //
174 | ldr lr, [thr, #0x30]   // thr->writeBarrierMask
178 | tst ip, lr             //
17c | blne -0x52f24c         // call arrayWriteBarrier stub

// call _buildNotificationTimestamp

180 | ldr ip, [fp, #12]  // argument 0 (self)
184 | str ip, [sp, #-4]! // push
188 | ldr ip, [fp, #8]   // argument 1 (context)
18c | str ip, [sp, #-4]! // push
190 | bl -0x894          // call _buildNotificationTimestamp + 0x14
194 | add sp, sp, #8     // pop arguments

// add result to list

198 | ldr r1, [fp, #-0x10] // load local 3 (List)
19c | add r9, r1, #0x13    //
1a0 | str r0, [r9]         // list[2] = r0

// garbage collection stuff

1a4 | tst r0, #1             //
1a8 | beq 0x1c               // skip if smi
1ac | ldrb ip, [r1, #-1]     //
1b0 | ldrb lr, [r0, #-1]     //
1b4 | and ip, lr, ip, lsr #2 //
1b8 | ldr lr, [thr, #0x30]   // thr->writeBarrierMask
1bc | tst ip, lr             //
1c0 | blne -0x52f290         // call arrayWriteBarrier stub

// finalize construction of list (note push at 07c)

1c4 | ldr ip, [fp, #-0x10] // load local 3 (List)
1c8 | str ip, [sp, #-4]!   // push
1cc | bl -0x52948c         // call List._fromLiteral
1d0 | add sp, sp, #8       // pop arguments

// fill in Row

1d4 | ldr r1, [fp, #-12]   // load local 2 (Row)
1d8 | add ip, pp, #0x3a000 //
1dc | ldr ip, [ip, #0x1a3] // const Axis.horizontal
1e0 | str ip, [r1, #11]    // assign field 2 (direction)
1e4 | add ip, pp, #0x31000 //
1e8 | ldr ip, [ip, #0xa8b] // const MainAxisAlignment.start
1ec | str ip, [r1, #15]    // assign field 3 (mainAxisAlignment)
1f0 | add ip, pp, #0x31000 //
1f4 | ldr ip, [ip, #0xa93] // const MainAxisSize.max
1f8 | str ip, [r1, #0x13]  // assign field 4 (mainAxisSize)
1fc | add ip, pp, #0x31000 //
200 | ldr ip, [ip, #0xa77] // const CrossAxisAlignment.start
204 | str ip, [r1, #0x17]  // assign field 5 (crossAxisAlignment)
208 | add ip, pp, #0x31000 //
20c | ldr ip, [ip, #0xa9b] // VerticalDirection.down
210 | str ip, [r1, #0x1f]  // assign field 7 (verticalDirection)
214 | str r0, [r1, #7]     // assign List to field 1 (children)

// garbage collection stuff

218 | tst r0, #1             //
21c | beq 0x1c               // skip if smi
220 | ldrb ip, [r1, #-1]     //
224 | ldrb lr, [r0, #-1]     //
228 | and ip, lr, ip, lsr #2 //
22c | ldr lr, [thr, #0x30]   // thr->writeBarrierMask
230 | tst ip, lr             //
234 | blne -0x52f114         // call WriteBarrierWrappers stub (r1 object)

// finalize construction of Row

238 | str r1, [sp, #-4]! // push Row
23c | bl -0x52910c       // call Diagnosticable ctor
240 | add sp, sp, #4     // pop

// fill in Padding

244 | ldr r0, [fp, #-8]  // load local 1 (EdgeInsets)
248 | ldr r1, [fp, #-4]  // load local 0 (Padding)
24c | str r0, [r1, #11]  // assign field 2 (padding)

// garbage collection stuff

250 | ldrb ip, [r1, #-1]     //
254 | ldrb lr, [r0, #-1]     //
258 | and ip, lr, ip, lsr #2 //
25c | ldr lr, [thr, #0x30]   // thr->writeBarrierMask
260 | tst ip, lr             //
264 | blne -0x52f144         // call WriteBarrierWrappers stub (r1 object)

// fill in Padding

268 | ldr r0, [fp, #-12] // load local 2 (Row)
26c | str r0, [r1, #7]   // assign field 1 (child)

// garbage collection stuff

270 | ldrb ip, [r1, #-1]     //
274 | ldrb lr, [r0, #-1]     //
278 | and ip, lr, ip, lsr #2 //
27c | ldr lr, [thr, #0x30]   // thr->writeBarrierMask
280 | tst ip, lr             //
284 | blne -0x52f164         // call WriteBarrierWrappers stub (r1 object)

// epilogue

288 | mov r0, r1          // return Padding
28c | sub sp, fp, #0      //
290 | ldmia sp!, {fp, pc} //

SocialNotificationCard.build

There was a bit more GC related code this time, if you are interested these write barriers are required due to the tri-color invariant. Hitting a write barrier is actually pretty rare, so it has minimal impact on performance with the benefit of allowing parallel garbage collection.

The equivalent dart code:

Widget build(BuildContext context) {
  return Padding(
    padding: EdgeInsets.symmetric(vertical: 16.0),
    child: Row(
      crossAxisAlignment: CrossAxisAlignment.start,
      children: [
        _buildNotificationIcon(context),
        Expanded(
          child: _buildNotificationMessage(context),
        ),
        _buildNotificationTimestamp(context),
      ],
    ),
  );
}

A little more tedious to reverse due to the amount of code, but still relatively easy given tools to identify object pool entries and call targets.

Conclusion

This was a super fun project and I thoroughly enjoyed picking apart assembly code. I hope this series inspires others to also learn more about compilers and the internals of Dart.

Can someone steal my app?

Technically was always possible, given enough time and resources.

In practice this is not something you should worry about (yet), we are far off from having a full decompilation suite that allows someone to steal an entire app.

Are my tokens and API keys safe?

Nope!

There will never be a way to fully hide secrets in any client-side application. Note that things like the google_maps_flutter API key is not actually private.

If you are currently using hard coded credentials or tokens for third party apis in your app you should switch to a real backend or cloud function ASAP.

Will obfuscation help?

Yes and no.

Obfuscation will randomize identifier names for things like classes and methods, but it won't prevent us from viewing class structure, library structure, strings, assembly code, etc.

A competent reverse engineer can still look for common patterns like http API layers, state management, and widgets. It is also possible to partially symbolize code that uses publicly available packages, e.g. you can build signatures for functions in package:flutter and correlate them to ones in an obfuscated snapshot.

I generally don't recommend obfuscating Flutter apps because it makes reading error messages harder without doing much for security, you can read more about it here.

Building a kernel: CS60 pset3

ping — Sat, 02 Jan 2021 18:13:07 GMT

Background

This assignment can be found here, and the starting code can be found in the pset3 directory here.

The goal is to implement functioning kernel virtual memory management, and use it to implement syscalls such as fork.

What we start off with is a basic kernel with example programs loaded in, along with a display that visualizes page tables:

By default, WeensyOS spawns 4 programs which map pages sequentially until its memory is exhausted.

You may notice physical memory looks nearly identical to the virtual memory of each process. This is because there is also no isolation between processes or the kernel, aka an 'identity mapping'.

Kernel and process isolation

The first step of this assignment is to implement basic isolation between processes, preventing them from touching memory they shouldn't.

I chose the strategy of first applying default protections to the kernel's page table, then copying it it to each process:

void kernel_start(const char* command) {
  ...

  // Initialize the kernel page table
  for (vmiter it(kernel_pagetable); it.va() < MEMSIZE_VIRTUAL; it += PAGESIZE)  {
    if (it.va() != 0) {
      int flags;
      // Console and process memory should be user accessible
      if (it.va() == CONSOLE_ADDR || it.va() >= PROC_START_ADDR) {
        flags = PTE_P | PTE_W | PTE_U;
      } else {
        flags = PTE_P | PTE_W;
      }
      it.map(it.va(), flags);
    } else {
      it.map(it.va(), 0);
    }
  }
  
  ...
}

void process_setup(pid_t pid, const char* program_name) {
  ...

  // Allocate a new page table
  x86_64_pagetable* pagetable = kalloc_pagetable();

  // Iterate both our new page table and the kernel page table
  vmiter it2(pagetable);
  for (vmiter it(kernel_pagetable); it.va() < MEMSIZE_VIRTUAL; it += PAGESIZE) {
  	// Copy 1:1 from kernel to process
    it2.map(it.va(), it.perm());
    it2 += PAGESIZE;
  }
  
  // Assign to the process, where it will be loaded by exception_return
  ptable[pid].pagetable = pagetable;
  
  ...
}

kernel.cc

This gives us the desired isolation where processes can only read and write to their own memory:

Kernel page allocator

Processes are given dedicated space in physical memory (defined by PROC_START_ADDR), a limitation which prevents them from using more than PROC_SIZE bytes of memory.

First we need to implement a simple page allocator, I've chosen to implement an extremely simple free list.

How a free list works is that chunks of unallocated pages form a single-linked list:

struct free_page {
  free_page* next;
};

extern free_page* next_free_page;

kernel.hh

To fill this list, we use allocatable_physical_address to check if a physical page is allocatable, pushing them all to the list in kernel_start:

free_page* next_free_page = nullptr;

void kernel_start(const char* command) {
  ...
  
  // Set up page allocator
  for (int i = NPAGES; i != 0; i--) {
    uintptr_t kmem = (i - 1) * PAGESIZE;
    if (allocatable_physical_address(kmem)) {
      auto newPage = (free_page*)kmem;
      newPage->next = next_free_page;
      next_free_page = newPage;
    }
    pages[i - 1].refcount = 0;
  }
  
  ...
}

kernel.cc

kalloc and kfree now have a trivial implementation, pushing and popping from the list:

void* kalloc() {
  if (next_free_page == nullptr) {
    return nullptr;
  } else {
    free_page* page = next_free_page;
    auto& info = pages[(uintptr_t)page / PAGESIZE];
    assert(info.refcount == 0);
    info.refcount = 1;
    next_free_page = next_free_page->next;
    return page;
  }
}

void kfree(void* kptr) {
  if (kptr == nullptr) return;
  check_valid_kptr(kptr);
  auto page = (free_page*)kptr;
  auto& info = pages[(uintptr_t)page / PAGESIZE];
  assert(info.refcount == 1);
  info.refcount = 0;
  page->next = next_free_page;
  next_free_page = page;
}

kernel.cc

The pages array provided by WeensyOS is how the memory viewer keeps track of physical pages, including a reference count:

struct pageinfo {
  uint8_t refcount;
  bool visited;

  bool used() const { return this->refcount != 0; }
};

kernel.hh

This reference count is useful because multiple processes can share the same physical memory, whether it be from COW or read-only executable sections.

To facilitate reference counting I've made functions to acquire and release references to shared physical pages:

void kacquire(void* kptr) {
  if (kptr == nullptr) return;
  pageinfo& info = pages[(uintptr_t)kptr / PAGESIZE];
  assert_gt(info.refcount, 0);
  info.refcount++;
}

void krelease(void* kptr) {
  if (kptr == nullptr) return;
  pageinfo& info = pages[(uintptr_t)kptr / PAGESIZE];
  assert_gt(info.refcount, 0);
  if (info.refcount == 1) {
    kfree(kptr);
  } else {
    info.refcount--;
  }
}

kernel.cc

The kacquire function simply increments the reference count of the page, and krelease decrements it, freeing the page when it hits zero.

Virtual memory

In order to implement virtual memory we need to replace the primitive 1:1 page mapping with ones that takes pages from kalloc, but first we will add support for execution protection.

No-execute (NX) protection

This wasn't part of the assignment but I thought it was a cool feature to add.

Without NX protection all memory is executable, this is problematic as it makes buffer overflow exploits extremely powerful. To solve this security issue we will mark pages that do not contain code as not executable with the NX bit.

In several places WeensyOS mistakenly uses a 32 bit integer to store page protection flags, this truncates the top 32 bits that contain NX and some other useful OS flags.

The fix is fairly simple, just replace all of these instances of int with uint64_t:

-int vmiter::try_map(uintptr_t pa, int perm) {
+int vmiter::try_map(uintptr_t pa, uint64_t perm) {

k-vmiter.cc

-  inline void map(uintptr_t pa, int perm);
+  inline void map(uintptr_t pa, uint64_t perm);

-  inline void map(void* kptr, int perm);
+  inline void map(void* kptr, uint64_t perm);

-  [[gnu::warn_unused_result]] int try_map(uintptr_t pa, int perm);
-  [[gnu::warn_unused_result]] inline int try_map(void* kptr, int perm);
+  [[gnu::warn_unused_result]] int try_map(uintptr_t pa, uint64_t perm);
+  [[gnu::warn_unused_result]] inline int try_map(void* kptr, uint64_t perm);

-  int perm_;
+  uint64_t perm_;

-  static constexpr int initial_perm = 0xFFF;
+  static constexpr uint64_t initial_perm = 0xFFE0000000000FFF;

-inline void vmiter::map(uintptr_t pa, int perm) {
+inline void vmiter::map(uintptr_t pa, uint64_t perm) {

-inline void vmiter::map(void* kp, int perm) {
+inline void vmiter::map(void* kp, uint64_t perm) {

-inline int vmiter::try_map(void* kp, int perm) {
+inline int vmiter::try_map(void* kp, uint64_t perm) {

Some extra defines can be added to x86-64.h:

// More useful flags

#define PTE_PU 0x7UL // PTE_P | PTE_U

#define PTE_OS4  0x0020000000000000UL
#define PTE_OS5  0x0040000000000000UL
#define PTE_OS6  0x0080000000000000UL
#define PTE_OS7  0x0100000000000000UL
#define PTE_OS8  0x0200000000000000UL
#define PTE_OS9  0x0400000000000000UL
#define PTE_OS10 0x0800000000000000UL
#define PTE_OS12 0x1000000000000000UL
#define PTE_OS13 0x2000000000000000UL
#define PTE_OS14 0x4000000000000000UL

#define PTE_PXD   0x8000000000000001UL // PTE_P | PTE_XD
#define PTE_PUXD  0x8000000000000005UL // PTE_P | PTE_U | PTE_XD
#define PTE_PWXD  0x8000000000000003UL // PTE_P | PTE_W | PTE_XD
#define PTE_PWUXD 0x8000000000000007UL // PTE_P | PTE_W | PTE_U | PTE_XD

x86-64.h

In order to track which segments of memory are executable, a few more modifications to the kernel and its linker script are necessary:

// Returns the start of the kernel rodata segment.
uintptr_t rodata_start_addr();

// Returns the start of the kernel (r/w) data segment.
uintptr_t data_start_addr();

// Returns the end of the kernel address space.
uintptr_t kernel_end_addr();

struct program_image_segment {
  ...

  // Return true iff the segment is executable.
  bool executable() const;

  ...
}

kernel.hh

// Make these offsets accessible to the rest of the kernel
uintptr_t rodata_start_addr() {
  extern char _rodata_start[];
  return (uintptr_t)_rodata_start;
}

uintptr_t data_start_addr() {
  extern char _data_start[];
  return (uintptr_t)_data_start;
}

uintptr_t kernel_end_addr() {
  extern char _kernel_end[];
  return (uintptr_t)_kernel_end;
}

bool allocatable_physical_address(uintptr_t pa) {
  return !reserved_physical_address(pa)
    && (pa < KERNEL_START_ADDR || pa >= round_up(kernel_end_addr(), PAGESIZE)) 
    && (pa < KERNEL_STACK_TOP - PAGESIZE || pa >= KERNEL_STACK_TOP)
    && pa < MEMSIZE_PHYSICAL;
}

bool program_image_segment::executable() const {
  return ph_->p_flags & ELF_PFLAG_EXEC;
}

k-hardware.cc

    ...
    
    . = ALIGN(4096);
    _rodata_start = .; /* A tag to indicate the start of rodata */
    .rodata : {
    
    ...

kernel.ld

Finally, execution protection can be enforced for the kernel in kernel_start:

void kernel_start(const char* command) {
  ...

  // (re-)initialize kernel page table
  for (vmiter it(kernel_pagetable); it.va() < MEMSIZE_PHYSICAL; it += PAGESIZE) {
    if (it.va() != 0) {
      uint64_t flags;
      if (it.va() == CONSOLE_ADDR) {
        // The console should be r/w to users
        flags = PTE_PWUXD;
      } else if (it.va() >= KERNEL_START_ADDR && it.va() < rodata_start_addr()) {
        // The text segment should be read-only execute
        flags = PTE_P;
      } else if (it.va() >= rodata_start_addr() && it.va() < data_start_addr()) {
        // The rodata segment should be read-only
        flags = PTE_PXD;
      } else {
        // The rest (data segment, heap) should be r/w
        flags = PTE_PWXD;
      }
      it.map(it.va(), flags);
    } else {
      // nullptr is inaccessible even to the kernel
      it.map(it.va(), 0);
    }
  }
  
  ...
}

kernel.cc

To check if this actually works you can mark the text segment with the XD bit, WeensyOS will crash on startup because it's trying to execute memory that is now protected:

If the XD bit is cleared from the text segment, the OS continues running as normal, yay!

Process allocation

So far pages are manually mapped to a process from a few places, to make it simpler I've created a single function that takes care of allocation, mapping, copying, and dealing with edge cases:

static bool proc_alloc(pid_t pid, uintptr_t vaddr, size_t sz, uint64_t flags = PTE_PWUXD, void* src = nullptr) {
  assert_eq(vaddr & PAGEOFFMASK, 0);
  vmiter it(ptable[pid].pagetable);
  uintptr_t num_pages = (sz + PAGESIZE - 1) / PAGESIZE;
  for (uintptr_t page = 0; page < num_pages; page++) {
    void* kptr = kalloc();
    if (kptr != nullptr) {
      it.find(vaddr + page * PAGESIZE);
      if (it.try_map(kptr, flags) >= 0) {
        // Success, copy from src and continue
        if (src != nullptr) {
          memcpy(kptr, (void*)src, min(sz, PAGESIZE));
          src = (char*)src + PAGESIZE;
          sz -= PAGESIZE;
        }
        continue;
      } else {
        kfree(kptr);
      }
    }

    // Map or allocation failed, free and unmap everything
    while (page > 1) {
      page--;
      it.find(vaddr + page * PAGESIZE);
      kfree(it.kptr());
      assert(it.perm() == flags);
      it.map(it.pa(), PTE_XD);
    }
    return false;
  }
  return true;
}

kernel.cc

We can then use this function along with the program_image_segment::executable function defined earlier in process_setup to load ELF segments:

void process_setup(pid_t pid, const char* program_name) {
  ...

  for (auto seg = pgm.begin(); seg != pgm.end(); seg++) {
    if (seg.size() == 0) continue;
    uint64_t flags = PTE_PUXD;
    if (seg.writable()) {
      flags |= PTE_W;
    }
    if (seg.executable()) {
      flags &= ~PTE_XD;
    }

    // Allocate and copy segment to process
    if (!proc_alloc(pid, seg.va(), seg.size(), flags, (void*)seg.data())) {
      panic("Failed to allocate process memory during setup");
    }
  }
  
  ...
}

kernel.cc

Additionally the syscall_page_alloc function can use proc_alloc rather than mapping directly:

int syscall_page_alloc(uintptr_t addr) {
  if ((addr & PAGEOFFMASK) != 0 || addr < PROC_START_ADDR) {
    return -1;
  } else if (proc_alloc(current->pid, addr, PAGESIZE)) {
    return 0;
  } else {
    return -1;
  }
}

kernel.cc

The result

Now that virtual memory is implemented, WeensyOS now looks like this:

Instead of a 1:1 mapping, pages are allocated on-demand, this allows greedy processes like #4 to consume much more than PROC_SIZE bytes of memory.

Fork

The final steps of the assignment are to implement the fork syscall.

If you are not familiar with fork it essentially duplicates your process incl. its memory and execution state, for more information see the Linux fork(2).

Thankfully, all of the low level stuff is taken care of already, adding a new syscall is as simple as defining it in lib.hh and adding a switch case to exception:

uintptr_t syscall(regstate* regs) {
  ...
  
  // To handle a new syscall number we just add a case here
  switch (regs->reg_rax) {
    ...
    case SYSCALL_FORK:
      return syscall_fork(current);
    ...
  }

  ...
}

kernel.cc

Instead of simply copying all of the processes memory in fork, we only copy pages when they are written to. This optimization is called copy on write (COW) and can be implemented by marking a page read-only then copying it when a page fault occurs.

To tell our exception handler a page is COW, we will use one of the user-defined flags in its page table entry:

#define PTE_COW PTE_OS1 // Copy on write

kernel.cc

The implementation of syscall_fork goes as follows:

int syscall_fork(proc* process) {
  pid_t pid = 0;

  // Find the first available process id
  for (pid_t i = 1; i < NPROC; i++) {
    if (ptable[i].state == P_FREE) {
      pid = i;
      break;
    }
  }

  if (!pid) return -1;

  proc& proc_info = ptable[pid];

  x86_64_pagetable* pt = kalloc_pagetable();

  if (pt == nullptr) return -1;

  // Copy page table of parent to child
  vmiter proc_it(pt);
  void* lastaquire = nullptr;
  for (vmiter it(process->pagetable); it.va() < MEMSIZE_VIRTUAL; it.next()) {
    int perms = it.perm();
    void* kptr = it.kptr();
    lastaquire = 0;

    if (it.va() >= PROC_START_ADDR && it.present()) {
      if (it.writable()) {
        // Mark the page COW if it was writeable
        perms = (it.perm() | PTE_COW) & ~PTE_W;
        // Mark it COW in the parent process too, this map doesn't need to be
        // reverted on abort because it does not change the semantics of writes
        if (it.try_map(kptr, perms) < 0) {
          goto abort_fork;
        }
      }
	  // Read-only pages will be shared implicitly, acquire a reference
      kacquire(kptr);
      lastaquire = kptr;
    }

    proc_it.find(it.va());
    if (proc_it.try_map(kptr, perms) < 0) {
      goto abort_fork;
    }
  }

  // Page table creation success, initialize child process

  // Copy registers from parent
  proc_info.regs = process->regs;

  // Set return value of child syscall to 0
  proc_info.regs.reg_rax = 0;

  // Enqueue the process
  proc_info.wake = 0; // Field used by the sleep syscall later on
  proc_info.pagetable = pt;
  proc_info.state = P_RUNNABLE;

  // Return child process id
  return pid;

abort_fork:

  // Release pa if try_map fails after an acquire or kalloc
  if (lastaquire != nullptr) {
    krelease(lastaquire);
  }

  // Release pages we've aquired so far
  for (vmiter it(pt); it.va() < MEMSIZE_VIRTUAL; it += PAGESIZE) {
    if (it.va() >= PROC_START_ADDR && it.present()) {
      krelease(it.kptr());
    }
  }

  free_page_table(pt);

  return -1;
}

kernel.cc

In short this function copies the parent's state to a new process, remapping any writeable pages as read-only plus PTE_COW.

Now in the exception function, we can check for page faults occurring on a COW page:

void exception(regstate* regs) {
  ...

  bool usermode = (regs->reg_cs & 0x3) != 0;
  
  switch (regs->reg_intno) {
    ...
    case INT_PF: {
      uintptr_t addr = rdcr2();

      bool is_write = regs->reg_errcode & PFERR_WRITE;

      // Check if write exception was in process memory
      if (usermode && is_write && addr >= PROC_START_ADDR) {
        vmiter it(current->pagetable, round_down(addr, PAGESIZE));
        // If page is present and PTE_OS1 is set, this is a COW page
        if (it.perm() & (PTE_P | PTE_COW)) {
          pageinfo& info = pages[it.pa() / PAGESIZE];
          assert(info.used());
          if (info.refcount == 1) {
            // We have the only reference to this page, avoid copying by
            // just marking it writeable
            if (it.try_map(it.pa(), (it.perm() | PTE_W) & ~PTE_COW) == 0) {
              // Success, resume process
              break;
            }
          } else {
            // Allocate and copy data to new page
            void* kptr = kalloc();
            if (kptr) {
              void* source = it.kptr();
              memcpy(kptr, source, PAGESIZE);
              // Set writeable, clear COW
              if (it.try_map(kptr, (it.perm() | PTE_W) & ~PTE_COW) == 0) {
                // Success, resume process
                krelease(source);
                break;
              }
            }
          }

          // Out of memory!
          proc_kill(current);
          schedule();
        }
      }

      // Bits that log page faults to the console
      ...
    }
    ...
  }

  // We can quietly resume the process if not defunct
  if (current->state == P_RUNNABLE) {
    run(current);
  } else {
    schedule();
  }
}

kernel.cc

The result

Along with the default program p-allocator.cc, WeensyOS also comes with p-fork.cc and p-forkexit.cc which can be booted using special key codes.

In this case we want to test the fork implementation using p-fork.cc which can be run by pressing f at startup:

This program is a variation of the first but starts with a single process which calls fork 3 times rather than the kernel initializing them.

You may notice that the first few pages are now labelled S, this indicates the same physical address has been mapped to multiple processes.

Exit and kill

The final task of the assignment is to implement the exit syscall and test it with p-forkexit.cc. Later on we will also implement the sleep, kill, mmap, and munmap syscalls as part of extra credit.

Like with fork, you handle syscalls in the kernel by adding cases to the syscall function:

    case SYSCALL_EXIT:
      proc_kill(current);
      schedule();

    case SYSCALL_KILL:
      if (arg0 > 0 && arg0 < 16 && ptable[arg0].state == P_RUNNABLE) {
        proc_kill(&ptable[arg0]);
        return 0;
      }
      return -1;

kernel.cc

Exit and kill basically do the same thing, so their functionality can just be combined into a single function.

The implementation of proc_kill turns out to be extremely simple:

// Also used in syscall_fork previously 
static void free_page_table(x86_64_pagetable* pt) {
  if (pt == nullptr) return;
  // The ptiter utility can be used to iterate page tables recursively, as
  // opposed to vmiter which just iterates virtual memory entries
  for (ptiter it(pt); !it.done(); it.next()) {
    kfree(it.kptr());
  }
  kfree(pt);
}

static void proc_kill(proc* process) {
  // Clean up pages
  for (vmiter it(process->pagetable); it.va() < MEMSIZE_VIRTUAL; it += PAGESIZE) {
    if (it.va() >= PROC_START_ADDR && it.present()) {
      krelease((void*)it.pa());
    }
  }

  free_page_table(process->pagetable);

  process->pagetable = nullptr;
  process->state = P_FREE;
}

kernel.cc

p-forkexit.cc is a fairly intense test, along with consuming memory and forking it also exits at random intervals. It is quite hard to get past this point without leaking memory, there was a lot of debugging involved.

Like p-fork.cc, this program is also started through a special key combination e:

After running it for awhile we don't experience a crash and no pages are leaked!

Those large strips of Ses are COW pages that haven't been written to yet, they are shared between children just like the shared read-only segments from before.

Sleep

To implement this syscall you must first understand how the scheduler works in WeensyOS.

Here are a few notable functions we start with:

void schedule() - Picks the next runnable process and runs it.
void run(proc* p) - Sets the current variable and uses exception_return to resume the process.
void exception(regstate* regs) - The exception handler called by k-exception.S, including APIC timer interrupts.

On startup kernel_start initializes the APIC timer, a neat bit of hardware that sends periodic interrupts to the processor. Along with keeping track of time with the ticks variable, this timer allows the kernel to prevent processes from hogging CPU by interrupting them.

When an interrupt or exception happens the processor will save register state, disable interrupts, enter kernel mode, and call the appropriate exception handler. The exception function copies the register state to current->regstate, which is used in the run function when resuming the process.

To resume a process the kernel uses the exception_return function, a small bit of assembly that restores register state and invokes iretq to get back into user mode.

Note that the kernel's state is not saved, you are reset back to the bottom of the stack every time there is an exception (schedule, run, and exception never return).

With this knowledge implementing sleep is much more straightforward, to start I added a field to keep track of when a specific process should resume from sleep:

struct proc {
  ...
  
  unsigned long wake;         // when to wake up from a sleep syscall
};

kernel.cc

Like the global ticks variable, the wake field is based on the frequency of the APIC timer (HZ), but the syscall will take its duration in milliseconds:

    case SYSCALL_SLEEP:
      current->regs.reg_rax = 0;
      current->wake = ticks + (arg0 / (1000 / HZ));
      schedule();

kernel.cc

Now rewrite the schedule function to skip any processes that are not ready to wake:

void schedule() {
  for (;;) {
    bool sleeping = false;
    for (int i = 1; i <= NPROC; i++) {
      pid_t pid = (current->pid + i) % 16;
      proc& p = ptable[pid];
      if (p.state != P_RUNNABLE) continue;
      if (ticks >= p.wake) {
        run(&p);
      } else if (p.wake) {
        sleeping = true;
      }
    }
    check_keyboard();
    if (sleeping) {
      // Interrupts are disabled, enable them with the sti instruction
      sti();
      // Put the processor in a more efficient halt state while we wait
      // for timer interrupts
      halt();
      // The halt instruction can occasionally resume, whatever, just
      // disable interrupts and try again
      cli();
    } else {
      memshow();
    }
  }
}

kernel.cc

Hold up, this introduces some unexpected behavior!

It turns out the exception function didn't anticipate any non-fatal exceptions to happen while in kernel mode.

The problem is that it unconditionally stashes register states to the previous process regardless of if the exception actually happened in that process. This leads to clobbering of the the state of the last running process with the state of the kernel while its halted above.

A simple solution is to only set current->regs if the lower 2 bits of the cs register (the privilege level) is non-zero:

void exception(regstate* regs) {
  bool is_user_mode = (regs->reg_cs & 0x3) != 0;

  if (is_user_mode) {
    current->regs = *regs;
    regs = ¤t->regs;
  }

  ...
}

kernel.cc

mmap, munmap

Currently all memory management in a process is done through the very primitive sys_page_alloc syscall, this has several limitations such as:

Can only map pages at specific address, can't find free ones automatically
No way to free pages (munmap).
No way to map executable or read-only pages (such as in a dynamic linker).
No way to share memory between a parent an child process.

To solve these issues I've implemented the mmap and munmap syscalls, starting with definitions in lib.hh / u-lib.hh:

// Available flags for the sys_mmap syscall
#define MAP_PROT_READ  0x10 // Page can be read.
#define MAP_PROT_WRITE 0x20 // Page can be read and written.
#define MAP_PROT_EXEC  0x40 // Page can be executed.
#define MAP_PROT_NONE  0x00 // Page can not be accessed.

#define MAP_PRIVATE 0x00 // Changes are private.
#define MAP_SHARED  0x01 // Share changes.
#define MAP_FIXED   0x02 // Interpret addr exactly.

lib.hh

Note that MAP_ANONYMOUS is not defined because there is no filesystem, all mappings are anonymous by default!

inline void* sys_mmap(void* addr, size_t length, int flags) {
  return (void*)make_syscall(SYSCALL_MMAP, (uintptr_t)addr, length, flags);
}

inline int sys_munmap(void* addr, size_t length) {
  return make_syscall(SYSCALL_MUNMAP, (uintptr_t)addr, length);
}

// Replace previous implementation with one that uses mmap
inline int sys_page_alloc(void* addr) {
  void* result = sys_mmap(addr, PAGESIZE, MAP_PRIVATE | MAP_FIXED | MAP_PROT_WRITE);
  return result == nullptr ? -1 : 0;
}

inline int sys_page_free(void* addr) {
  return sys_munmap(addr, PAGESIZE);
}

u-lib.hh

Like other syscalls, mmap and munmap get their own functions:

    case SYSCALL_MMAP:
      return syscall_mmap(current, arg0, arg1, arg2);
    
    case SYSCALL_MUNMAP:
      return syscall_munmap(current, arg0, arg1);

kernel.cc

The syscall_mmap function is probably the most complex so far as it has to handle several edge cases and failure conditions, its implementation goes as follows:

static uintptr_t next_free_vaddr(proc* process, uintptr_t num_pages) {
  // This could be made a lot faster if we used a tree to manage virtual
  // address space.
  size_t viable = 0;
  vmiter it(process->pagetable);
  for (uintptr_t addr = PROC_START_ADDR; addr < MEMSIZE_VIRTUAL; addr += PAGESIZE) {
    it.find(addr);
    if (it.present()) {
      viable = 0;
    } else if (++viable == num_pages) {
      // Found free pages
      addr -= (viable - 1) * PAGESIZE;
      return addr;
    }
  }
  return 0;
}

static uintptr_t syscall_mmap(proc* process, uint64_t addr, uint64_t length, uint64_t flags) {
  if ((addr & PAGEOFFMASK) != 0 || length == 0) {
    // The base address must be a multiple of PAGESIZE
    return 0;
  }

  uintptr_t num_pages = (length + PAGESIZE - 1) / PAGESIZE;
  vmiter it(process->pagetable);

  if (addr == 0) {
    // nullptr always allocates new pages
    addr = next_free_vaddr(process, num_pages);
  } else if ((flags & MAP_FIXED) == 0) {
    // If the address is not fixed, reallocate if any overlap with existing maps
    for (uintptr_t i = 0; i < num_pages; i++) {
      it.find(addr + i * PAGESIZE);
      if (it.present()) {
        // Page overlaps, reallocate
        addr = next_free_vaddr(process, num_pages);
        break;
      }
    }
  }

  uintptr_t end = addr + num_pages * PAGESIZE;
  if (
    addr < PROC_START_ADDR
    || end > MEMSIZE_VIRTUAL
    || num_pages > PAGESIZE / sizeof(uint64_t)
  ) {
    // Either the range is invalid, we failed to allocate one, or there are too
    // many page entries to fit in a journal.
    return 0;
  }

  uint64_t pte = PTE_PXD;
  if (flags & MAP_SHARED)     pte |= PTE_SHARED;
  if (flags & MAP_PROT_READ)  pte |= PTE_PU;
  if (flags & MAP_PROT_WRITE) pte |= PTE_PWU;
  if (flags & MAP_PROT_EXEC)  pte &= ~PTE_XD;

  // Keep a journal of the entries that we clobber, so they can be restored on failure
  auto journal = (uint64_t*)kalloc();
  if (journal == nullptr) {
    return 0;
  }

  for (uintptr_t i = 0; i < num_pages; i++) {
    it.find(addr + i * PAGESIZE);
    uint64_t pa = it.pa();
    uint64_t perm = it.perm();
    journal[i] = perm | pa;
    uint64_t new_perm = pte;

    if (!(perm & PTE_P)) {
      // Allocate new pages if they do not exist
      pa = (uint64_t)kalloc();
      if (pa == 0) {
        goto abort_map;
      }
    } else if (pages[pa / PAGESIZE].refcount > 1 && (new_perm & PTE_PWU)) {
      // If page has multiple references and should be writable, mark it COW
      new_perm = (new_perm & ~PTE_W) | PTE_COW;
    }

    // Finally apply permissions
    if (it.try_map(pa, new_perm) < 0) {
      if (!it.present()) {
        kfree((void*)pa);
      }
      goto abort_map;
    }

    continue;

  abort_map:

    // Map incomplete, revert changes using journal
    for (uintptr_t p = 0; p < i; p++) {
      it.find(addr + p * PAGESIZE);
      uint64_t prev = journal[i];
      if (!(prev & PTE_P)) {
        kfree(it.kptr());
      }
      it.map(prev & PTE_PAMASK, prev & ~PTE_PAMASK);
    }
    kfree(journal);

    return 0;
  }

  // Success, all pages have been mapped
  kfree(journal);
  return addr;
}

kernel.cc

Because try_map / kalloc can fail at any point, the implementation of mmap writes exiting page table entries to a journal that can be rolled back.

A similar is approach is used to release pages with munmap:

static int syscall_munmap(proc* process, uint64_t addr, uint64_t length) {
  if ((addr & PAGEOFFMASK) != 0 || length == 0) {
    // The base address must be a multiple of PAGESIZE
    return 0;
  }

  uintptr_t num_pages = (length + PAGESIZE - 1) / PAGESIZE;
  vmiter it(process->pagetable);

  uintptr_t end = addr + length;
  if ((length == 0 || addr < PROC_START_ADDR || end > MEMSIZE_VIRTUAL || addr > end)) {
    return 0;
  }

  // Keep a journal of the entries that we clobber
  auto journal = (uint64_t*)kalloc();
  assert(num_pages <= PAGESIZE / sizeof(uint64_t));
  if (journal == nullptr) {
    return -1;
  }

  for (uintptr_t i = 0; i < num_pages; i++) {
    it.find(addr + i * PAGESIZE);
    uint64_t pa = it.pa();
    journal[i] = it.perm() | pa;
    // Finally apply permissions
    if (it.try_map(pa, 0) < 0) {
      // Mapping incomplete, revert changes using journal
      for (uintptr_t p = 0; p < i; p++) {
        it.find(addr + p * PAGESIZE);
        uint64_t prev = journal[i];
        it.map(prev & PTE_PAMASK, prev & ~PTE_PAMASK);
      }
      kfree(journal);

      return -1;
    }
  }

  // Success, all pages have been mapped, release their references.
  for (uintptr_t i = 0; i < num_pages; i++) {
    if (journal[i] & PTE_P) {
      kfree((void*)(journal[i] & PTE_PAMASK));
    }
  }

  kfree(journal);

  return 0;
}

kernel.cc

Testing all the things

In order to test all the new syscalls we need to create custom user-mode programs, thankfully the process is pretty straightforward.

First, create the program:

#include "u-lib.hh"

[[noreturn]] void process_main() {
  // Let the system allocate new memory, make it shared between procs
  int* shared_memory = (int*)sys_mmap(nullptr, PAGESIZE, MAP_SHARED | MAP_PROT_WRITE);
  assert(shared_memory != nullptr);

  pid_t cp = sys_fork();
  if (cp == 0) {
    // Child process
    // Read 1, write 2, wait
    assert_eq(*shared_memory, 1);
    *shared_memory = 2;
    sys_sleep(1000);

    // Read 3
    assert_eq(*shared_memory, 3);

    // Write 4 until killed
    for (;;) {
      *shared_memory = 4;
      sys_yield();
    }
  } else {
    // Parent process
    // Write 1, wait
    *shared_memory = 1;
    sys_sleep(500);

    // Read 2, write 3, wait
    assert_eq(*shared_memory, 2);
    *shared_memory = 3;
    sys_sleep(1000);

    // Read 4
    assert_eq(*shared_memory, 4);

    // Kill, write 5, wait
    sys_kill(cp);
    *shared_memory = 5;
    sys_sleep(1000);

    // Read 5
    assert_eq(*shared_memory, 5);

    // Success
    sys_munmap(shared_memory, PAGESIZE);
    panic("Success!");
  }
}

p-custom.cc

This test covers most of the functionality of mmap, sleep, kill, munmap, then panics with a success message after everything checks out.

Next, add the program to the Makefile:

PROCESS_BINARIES += $(OBJDIR)/p-custom
PROCESS_OBJS += $(OBJDIR)/p-custom.uo

GNUMakefile

Finally, add it to k-hardware.cc:

// These symbols are defined by the linker
extern uint8_t _binary_obj_p_custom_start[];
extern uint8_t _binary_obj_p_custom_end[];

...

} ramimages[] = {
  ...
  {"custom", _binary_obj_p_custom_start, _binary_obj_p_custom_end}};

// The ramimages table defines ranges of static memory for
// each process binary

...

int check_keyboard() {
  ...
  // Add custom key codes here and below
  if (c == 'a' || c == 'f' || c == 'e' || c == 'c') {
    ...
    // The kernel uses this argument to load a specific program
    // from the table above
    const char* argument = "fork";
    if (c == 'a') {
      ...
    } else if (c == 'c') {
      argument = "custom";
    }
    ...
  }
  ...
}

k-hardware.cc

Now the program should run when you press c:

Success!

Reverse engineering Flutter apps (Part 1)

ping — Sat, 28 Mar 2020 09:29:14 GMT

Chapter 1: Down the rabbit hole

To start this journey I'll cover some backstory on the Flutter stack and how it works.

What you probably already know: Flutter was built from the ground up with its own render pipeline and widget library, allowing it to be truly cross platform and have a consistent design and feel no matter what device its running on.

Unlike most platforms, all of the essential rendering components of the flutter framework (including animation, layout, and painting) are fully exposed to you in package:flutter.

You can see these components in the official architecture diagram from wiki/The-Engine-architecture:

From a reverse engineering perspective the most interesting part is is the Dart layer since that is where all of the app logic sits.

But what does the Dart layer look like?

Flutter compiles your Dart to native assembly code and uses formats that have not been publicly documented in-depth let alone fully decompiled and recompiled.

For comparison other platforms like React Native just bundle minified javascript which is trivial to inspect and modify, additionally the bytecode for Java on Android is well documented and there are many free decompilers for it.

Despite the lack of obfuscation (by default) or encryption, Flutter apps are still extremely difficult to reverse engineer at the moment since it requires in-depth knowledge of Dart internals to even scratch the surface.

This makes Flutter very good from an intellectual property perspective, your code is almost safe from prying eyes.

Next I'll show you the build process of Flutter applications and explain in detail how to reverse engineer the code that it produces.

Snapshots

The Dart SDK is highly versatile, you can embed Dart code in many different configurations on many different platforms.

The simplest way to run Dart is to use the dart executable which just reads dart source files directly like a scripting language. It includes the primary components we call the front-end (parses Dart code), runtime (provides the environment for code to run in), and the JIT compiler.

You can also use dart to create and execute snapshots, a pre-compiled form of Dart which is commonly used to speed up frequently used command line tools (like pub).

#lint shell
ping@debian:~/Desktop$ time dart hello.dart
Hello, World!

real    0m0.656s
user    0m0.920s
sys     0m0.084s

ping@debian:~/Desktop$ dart --snapshot=hello.snapshot hello.dart
ping@debian:~/Desktop$ time dart hello.snapshot
Hello, World!

real    0m0.105s
user    0m0.208s
sys     0m0.016s

As you can see, the start-up time is significantly lower when you use snapshots.

The default snapshot format is kernel, an intermediate representation of Dart code equivalent to the AST.

When running a Flutter app in debug mode, the flutter tool creates a kernel snapshot and runs it in your android app with the debug runtime + JIT. This gives you the ability to debug your app and modify code live at runtime with hot reload.

Unfortunately for us, using your own JIT compiler is frowned upon in the mobile industry due to increased concerns of RCEs. iOS actually prevents you from executing dynamically generated code like this entirely.

There are two more types of snapshots though, app-jit and app-aot, these contain compiled machine code that can be initialized quicker than kernel snapshots but aren't cross-platform.

The final type of snapshot, app-aot, contains only machine code and no kernel. These snapshots are generated using the gen_snapshots tool found in flutter/bin/cache/artifacts/engine///, more on that later.

They are a little more than just a compiled version of Dart code though, in fact they are a full "snapshot" of the VMs heap just before main is called. This is a unique feature of Dart and one of the reasons it initializes so quickly compared to other runtimes.

Flutter uses these AOT snapshots for release builds, you can see the files that contain them in the file tree for an Android APK built with flutter build apk:

#lint shell
ping@debian:~/Desktop/app/lib$ tree .
.
├── arm64-v8a
│   ├── libapp.so
│   └── libflutter.so
└── armeabi-v7a
    ├── libapp.so
    └── libflutter.so

Here you can see the two libapp.so files which are a64 and a32 snapshots as ELF binaries.

The fact that gen_snapshots outputs an ELF / shared object here might be a bit misleading, it does not expose dart methods as symbols that can be called externally. Instead, these files are containers for the "clustered snapshot" format but with compiled code in the separate executable section, here is how they are structured:

#lint shell
ping@debian:~/Desktop/app/lib/arm64-v8a$ aarch64-linux-gnu-objdump -T libapp.so

libapp.so:     file format elf64-littleaarch64

DYNAMIC SYMBOL TABLE:
0000000000001000 g    DF .text  0000000000004ba0 _kDartVmSnapshotInstructions
0000000000006000 g    DF .text  00000000002d0de0 _kDartIsolateSnapshotInstructions
00000000002d7000 g    DO .rodata        0000000000007f10 _kDartVmSnapshotData
00000000002df000 g    DO .rodata        000000000021ad10 _kDartIsolateSnapshotData

The reason why AOT snapshots are in shared object form instead of a regular snapshot file is because machine code generated by gen_snapshot needs to be loaded into executable memory when the app starts and the nicest way to do that is through an ELF file.

With this shared object, everything in the .text section will be loaded into executable memory by the linker allowing the Dart runtime to call into it at any time.

You may have noticed there are two snapshots: the VM snapshot and the Isolate snapshot.

DartVM has a second isolate that does background tasks called the vm isolate, it is required for app-aot snapshots since the runtime can't dynamically load it in as the dart executable would.

The Dart SDK

Thankfully Dart is completely open source so we don't have to fly blind when reverse engineering the snapshot format.

Before creating a testbed for generating and disassembling snapshots you have to set up the Dart SDK, there is documentation on how to build it here: https://github.com/dart-lang/sdk/wiki/Building.

You want to generate libapp.so files typically orchestrated by the flutter tool, but there doesn't seem to be any documentation on how to do that yourself.

The flutter sdk ships binaries for gen_snapshot which is not part of the standard create_sdk build target you usually use when building dart.

It does exist as a separate target in the SDK though, you can build the gen_snapshot tool for arm with this command:

./tools/build.py -m product -a simarm gen_snapshot

Normally you can only generate snapshots for the architecture you are running on, to work around that they have created sim targets which simulate snapshot generation for the target platform. This has some limitations such as not being able to make aarch64 or x86_64 snapshots on a 32 bit system.

Before making a shared object you have to compile a dill file using the front-end:

~/flutter/bin/cache/dart-sdk/bin/dart ~/flutter/bin/cache/artifacts/engine/linux-x64/frontend_server.dart.snapshot --sdk-root ~/flutter/bin/cache/artifacts/engine/common/flutter_patched_sdk_product/ --strong --target=flutter --aot --tfa -Ddart.vm.product=true --packages .packages --output-dill app.dill package:foo/main.dart

Dill files are actually the same format as kernel snapshots, their format is specified here: https://github.com/dart-lang/sdk/blob/master/pkg/kernel/binary.md

This is the format used as a common representation of dart code between tools, including gen_snapshot and analyzer.

With the app.dill we can finally generate a libapp.so using this command:

gen_snapshot --causal_async_stacks --deterministic --snapshot_kind=app-aot-elf --elf=libapp.so --strip app.dill

Once you are able to manually generate the libapp.so, it is easy to modify the SDK to print out all of the debug information needed to reverse engineer the AOT snapshot format.

As a side note, Dart was actually designed by some of the people who created JavaScript's V8 which is arguably the most advanced interpreter ever made. DartVM is incredibly well engineered and I don't think people give its creators enough credit.

Anatomy of a snapshot

The AOT snapshot itself is quite complex, it is a custom binary format with no documentation. You may be forced to step through the serialization process manually in a debugger to implement a tool that can read the format.

The source files relevant to snapshot generation can be found here:

Cluster serialization / deserialization
vm/clustered_snapshot.h
vm/clustered_snapshot.cc
ROData serialization
vm/image_snapshot.h
vm/image_snapshot.cc
ReadStream / WriteStream
vm/datastream.h
Object definitions
vm/object.h
ClassId enum
vm/class_id.h

It took me about two weeks to implement a command line utility that is capable of parsing a snapshot, giving us complete access to the heap of a compiled app.

As an overview, here is the layout of clustered snapshot data:

Every RawObject* in the Isolate gets serialized by a corresponding SerializationCluster instance depending on its class id. These objects can contain anything from code, instances, types, primitives, closures, constants, etc. More on that later.

After deserializing the VM isolate snapshot, every object in its heap gets added to the Isolate snapshot object pool allowing them to be referenced in the same context.

Clusters are serialized in three stages: Trace, Alloc, and Fill.

In the trace stage, root objects are added to a queue along with the objects they reference in a breadth first search. At the same time a SerializationCluster instance is created corresponding to each class type.

Root objects are a static set of objects used by the vm in the isolate's ObjectStore which we will use later to locate libraries and classes. The VM snapshot includes StubCode base objects which are shared between all isolates.

Stubs are basically hand written sections of assembly that dart code calls into, allowing it to communicate safely with the runtime.

After tracing, cluster info is written containing basic information about the clusters, most importantly the number of objects to allocate.

In the alloc stage, each clusters WriteAlloc method is called which writes any information needed to allocate raw objects. Most of the time all this method does is write the class id and number of objects that are part of this cluster.

The objects that are part of each cluster are also assigned an incrementing object id in the order they are allocated, this is used later during the fill stage when resolving object references.

You may have noticed the lack of any indexing and cluster size information, the entire snapshot has to be read fully in order to get any meaningful data out of it. So to actually do any reverse engineering you must either implement deserialization routines for 31+ cluster types (which I have done) or extract information by loading it into a modified runtime (which is difficult to do cross-architecture).

Here is a simplified example of what the structure of the clusters would be for an array [123, 42]:

If an object references another object like an array element, the serializer writes the object id initially assigned during the alloc phase as shown above.

In the case of simple objects like Mints and Smis, they are constructed entirely in the alloc stage because they don't reference any other objects.

After that the ~107 root refs are written including object ids for core types, libraries, classes, caches, static exceptions and several other miscellaneous objects.

Finally, ROData objects are written which are directly mapped to RawObject*s in-memory to avoid an extra deserialization step.

The most important type of ROData is RawOneByteString which is used for library / class / function names. ROData is also referenced by offset being the only place in the snapshot data where decoding is optional.

Similar to ROData, RawInstruction objects are direct pointers to snapshot data but are stored in the executable instruction symbol rather than main snapshot data.

Here is a dump of serialization clusters that are typically written when compiling an app:

#lint cluster-tbl
idx | cid | ClassId enum        | Cluster name
----|-----|---------------------|----------------------------------------
  0 |   5 | Class               | ClassSerializationCluster
  1 |   6 | PatchClass          | PatchClassSerializationCluster
  2 |   7 | Function            | FunctionSerializationCluster
  3 |   8 | ClosureData         | ClosureDataSerializationCluster
  4 |   9 | SignatureData       | SignatureDataSerializationCluster
  5 |  12 | Field               | FieldSerializationCluster
  6 |  13 | Script              | ScriptSerializationCluster
  7 |  14 | Library             | LibrarySerializationCluster
  8 |  17 | Code                | CodeSerializationCluster
  9 |  20 | ObjectPool          | ObjectPoolSerializationCluster
 10 |  21 | PcDescriptors       | RODataSerializationCluster
 11 |  22 | CodeSourceMap       | RODataSerializationCluster
 12 |  23 | StackMap            | RODataSerializationCluster
 13 |  25 | ExceptionHandlers   | ExceptionHandlersSerializationCluster
 14 |  29 | UnlinkedCall        | UnlinkedCallSerializationCluster
 15 |  31 | MegamorphicCache    | MegamorphicCacheSerializationCluster
 16 |  32 | SubtypeTestCache    | SubtypeTestCacheSerializationCluster
 17 |  36 | UnhandledException  | UnhandledExceptionSerializationCluster
 18 |  40 | TypeArguments       | TypeArgumentsSerializationCluster
 19 |  42 | Type                | TypeSerializationCluster
 20 |  43 | TypeRef             | TypeRefSerializationCluster
 21 |  44 | TypeParameter       | TypeParameterSerializationCluster
 22 |  45 | Closure             | ClosureSerializationCluster
 23 |  49 | Mint                | MintSerializationCluster
 24 |  50 | Double              | DoubleSerializationCluster
 25 |  52 | GrowableObjectArray | GrowableObjectArraySerializationCluster
 26 |  65 | StackTrace          | StackTraceSerializationCluster
 27 |  72 | Array               | ArraySerializationCluster
 28 |  73 | ImmutableArray      | ArraySerializationCluster
 29 |  75 | OneByteString       | RODataSerializationCluster
 30 |  95 | TypedDataInt8Array  | TypedDataSerializationCluster
 31 | 143 |           | InstanceSerializationCluster
...
 54 | 463 |           | InstanceSerializationCluster

There are a few more clusters that could potentially be in a snapshot, but these are the only ones I have seen in a Flutter app so far.

In DartVM there are a static set of predefined class IDs defined in the ClassId enum, 142 IDs as of Dart 2.4.0 to be exact. IDs outside of that (or do not have an associated cluster) are written with separate InstanceSerializationClusters.

Finally bringing the parser together I can view the structure of the snapshot from the ground up, starting with the libraries list in the root object table.

Using the object tree here's how you can locate a top level function, in this case package:ftest/main.darts main:

As you can see above the names of libraries, classes, and functions are included in release snapshots.

Dart can't really remove them without also obfuscating stack traces, see: https://github.com/flutter/flutter/wiki/Obfuscating-Dart-Code

Obfuscation is probably not worth the effort but this will most likely change in the future and become more streamlined similar to proguard on Android or sourcemaps on the web.

The actual machine code is stored in Instructions objects pointed to by Code objects from an offset to the start of the instruction data.

RawObject

Under the hood all managed objects in DartVM are called RawObjects, in true DartVM fashion these classes are all defined in a single 3,000 line file found at vm/raw_object.h.

In generated code you can access and move around RawObject*s however you want as long as you yield according to an incremental write barrier mask, the GC appears to be able to track references through passive scanning alone.

Here is the class tree:

RawInstances are the traditional Objects you pass around Dart code and invoke methods on, all of them have an equivalent type in dart land. Non-instance objects however are internal and only exist to leverage reference tracking and garbage collection, they do not have equivalent dart types.

Each object starts with a uint32_t containing the following tags:

Class IDs here are the same as before with cluster serialization, they are defined in vm/class_id.h but also include user-defined starting at kNumPredefinedCids.

Size and GC data tags are used for garbage collection, most of the time they can be ignored.

If the canonical bit is set that means that this object is unique and no other object is equal to it, like with Symbols and Types.

Objects are very light and the size of RawInstance is usually only 4 bytes, they surprisingly do not use virtual methods at all either.

All of this means allocating an object and filling in its fields can be done virtually for free, something we do quite lot in Flutter.

Hello, World!

Cool, we can locate functions by name but how do we figure out what they actually do?

As expected reverse engineering from here on is a bit more difficult because we are digging through the assembly code contained in Instructions objects.

Instead of using a modern compiler backend like clang, Dart actually uses its JIT compiler for code generation but with a couple AOT specific optimizations.

If you have never worked with JIT code, it is a bit bloated in some places compared to what the equivalent C code would produce. Not that Dart is doing a bad job though, it's designed to be generated quickly at runtime and the hand-written assembly for common instructions often beats clang/gcc in terms of performance.

Generated code being less micro-optimized actually works heavily to our advantage since it closer resembles the higher level IR used to generate it.

Most of the relevant code generation can be found in:

vm/compiler/backend/il_.cc
vm/compiler/assembler/assembler_.cc
vm/compiler/asm_intrinsifier_.cc
vm/compiler/graph_intrinsifier_.cc

Here is the register layout and calling conventions for dart's A64 assembler:

#lint reg-tbl
       r0 |     | Returns
r0  -  r7 |     | Arguments
r0  - r14 |     | General purpose
      r15 | sp  | Dart stack pointer
      r16 | ip0 | Scratch register
      r17 | ip1 | Scratch register
      r18 |     | Platform register
r19 - r25 |     | General purpose
r19 - r28 |     | Callee saved registers
      r26 | thr | Current thread
      r27 | pp  | Object pool
      r28 | brm | Barrier mask
      r29 | fp  | Frame pointer
      r30 | lr  | Link register
      r31 | zr  | Zero / CSP

This ABI follows the standard AArch64 calling conventions here but with a few global registers:

R26 / THR: Pointer to the running vm Thread, see vm/thread.h
R27 / PP: Pointer to the ObjectPool of the current context, see vm/object.h
R28 / BRM: The barrier mask, used for incremental garbage collection

Similarly, this is the register layout for A32:

#lint reg-tbl
r0 -  r1 |     | Returns
r0 -  r9 |     | General purpose
r4 - r10 |     | Callee saved registers
      r5 | pp  | Object pool
     r10 | thr | Current thread
     r11 | fp  | Frame pointer
     r12 | ip  | Scratch register
     r13 | sp  | Stack pointer
     r14 | lr  | Link register
     r15 | pc  | Program counter

While A64 is a more common target I'll mostly be covering A32 since its is simpler to read and disassemble.

You can view the IR along with the disassembly by passing --disassemble-optimized to gen_snapshot, but note this only works on the debug/release targets and not product.

As an example, when compiling hello world:

void hello() {
  print("Hello, World!");
}

Scrolling down a bit in the disassembly you will find:

#lint dartvm-dasm
Code for optimized function 'package:dectest/hello_world.dart_::_hello' {
        ;; B0
        ;; B1
        ;; Enter frame
0xf69ace60    e92d4800               stmdb sp!, {fp, lr}
0xf69ace64    e28db000               add fp, sp, #0
        ;; CheckStackOverflow:8(stack=0, loop=0)
0xf69ace68    e59ac024               ldr ip, [thr, #+36]
0xf69ace6c    e15d000c               cmp sp, ip
0xf69ace70    9bfffffe               blls +0 ; 0xf69ace70
        ;; PushArgument(v3)
0xf69ace74    e285ca01               add ip, pp, #4096
0xf69ace78    e59ccfa7               ldr ip, [ip, #+4007]
0xf69ace7c    e52dc004               str ip, [sp, #-4]!
        ;; StaticCall:12( print<0> v3)
0xf69ace80    ebfffffe               bl +0 ; 0xf69ace80
0xf69ace84    e28dd004               add sp, sp, #4
        ;; ParallelMove r0 <- C
0xf69ace88    e59a0060               ldr r0, [thr, #+96]
        ;; Return:16(v0)
0xf69ace8c    e24bd000               sub sp, fp, #0
0xf69ace90    e8bd8800               ldmia sp!, {fp, pc}
0xf69ace94    e1200070               bkpt #0x0
}

What is printed here is slightly different from a snapshot built in product but the important part is that we can see the IR instructions alongside assembly.

Breaking it down:

#lint dartvm-dasm
        ;; Enter frame
0xf6a6ce60    e92d4800               stmdb sp!, {fp, lr}
0xf6a6ce64    e28db000               add fp, sp, #0

This is a standard function prologue, the frame pointer of the caller and link register are pushed to the stack after which the frame pointer is set to the bottom of the function stack frame.

As with the standard ARM ABI, this uses a full-descending stack meaning it grows backwards in memory.

#lint dartvm-dasm
        ;; CheckStackOverflow:8(stack=0, loop=0)
0xf6a6ce68    e59ac024               ldr ip, [thr, #+36]
0xf6a6ce6c    e15d000c               cmp sp, ip
0xf6a6ce70    9bfffffe               blls +0 ; 0xf6a6ce70

This is a simple routine which does what you probably guessed, checks if the stack overflowed.

Sadly their disassembler does not annotate either thread fields or branch targets so you have to do some digging.

A list of field offsets can be found in vm/compiler/runtime_offsets_extracted.h, which defines Thread_stack_limit_offset = 36 telling us that the field accessed is the threads stack limit.

After the stack pointer is compared, it calls the stackOverflowStubWithoutFpuRegsStub stub if it has overflowed. The branch target in the disassembly appears to be un-patched but we can still inspect the binary afterwards to confirm.

#lint dartvm-dasm
        ;; PushArgument(v3)
0xf6a6ce74    e285ca01               add ip, pp, #4096
0xf6a6ce78    e59ccfa7               ldr ip, [ip, #+4007]
0xf6a6ce7c    e52dc004               str ip, [sp, #-4]!

Here an object from the object pool is pushed onto the stack. Since the offset is too big to fit in an ldr offset encoding it uses an extra add instruction.

This object is in fact our "Hello, World!" string as a RawOneByteString* stored in the globalObjectPool of our isolate at offset 8103.

You may have noticed that offsets are misaligned, this is because object pointers are tagged with kHeapObjectTag from vm/pointer_tagging.h, in this case all of the pointers to RawObjects in compiled code are offset by 1.

#lint dartvm-dasm
        ;; StaticCall:12( print<0> v3)
0xf6a6ce80    ebfffffe               bl +0 ; 0xf6a6ce80
0xf6a6ce84    e28dd004               add sp, sp, #4

Here print is called followed by the string argument being popped from the stack.

Like before the branch hasn't been resolved, it is a relative branch to the entry point for print in dart:core.

#lint dartvm-dasm
        ;; ParallelMove r0 <- C
0xf69ace88    e59a0060               ldr r0, [thr, #+96]

Null is loaded into the return register, 96 being the offset to the null object field in a Thread.

#lint dartvm-dasm
        ;; Return:16(v0)
0xf69ace8c    e24bd000               sub sp, fp, #0
0xf69ace90    e8bd8800               ldmia sp!, {fp, pc}
0xf69ace94    e1200070               bkpt #0x0

And finally the function epilogue, the stack frame is restored along with any callee-saved registers. Since lr was pushed last, popping it into pc will cause the function to return.

From now on I'll be using snippets from my own disassembler which has less problems than the builtin one.

Continued in Part 2!

Cute little space ship

ping — Sun, 14 Jul 2019 19:04:56 GMT

Here are some pictures of my first project with Substance Painter!

First I modeled a space ship in Autodesk Inventor, this is the modelling software I have a lot of experience in so it's easy peasy.

Why does a space ship need wings? who knows.

And to finish it off let's give it guns.

Now for the texturing part, I started with an outline:

Textured the window and added some decals:

And that's version 1, Imported it into Unreal Engine in all it's PBR glory:

I didn't really like the color scheme so I re-did everything now that I got the hang of texturing in substance.

Started again, this time I put much less emphasis on the edges and made the black metal pop out more, along with breaking it up into panels.

Imported into unreal engine again:

Went back to add some more detail to the texture, screwing around with metal wear.

That's all folks, thanks for tuning in.

Raytracing in the browser

ping — Sun, 16 Jun 2019 07:23:04 GMT

As a small side project I built a raytracer in Dart:

It took an enormous amount of time to render this video, around 12 hours for 3600 frames at 3840x2160.

Source code: https://gist.github.com/PixelToast/84377d383c20d056664e80849c5b79e9

Tangent - A discord bot with full access to a Linux VM

ping — Wed, 05 Jun 2019 10:50:24 GMT

Tangent is a bot I've been working on and testing for about a week now and I think its ready for public use.

There are a lot of reasons why allowing anyone to execute arbitrary code is a terrible idea so I had to do a lot of planning beforehand.

Think you can break it? Try it out here: https://discord.gg/F2F2EdE
Github: https://github.com/PixelToast/tangent

My design constraints were the following:

Arbitrary native code execution, not just a sandboxed interpreted language like the OpenComputers mod for Minecraft.
Limited internet access, people should not be able to use my internet connection for nefarious purposes.
Limited CPU, memory, and disk usage.
Automatic recovery, nothing a user can do should put the bot into an unrecoverable state whether it be a fork bomb, filling the filesystem, killing all processes.
The discord bot should limit buffers on anything the VM sends, you should not be able to spam files / data / process events to the bot and cause it to run out of memory.
Discord should not be trusted, if discord server or my account were compromised it should not allow attackers to gain access to my server.

The most common method for sandboxing is restricting methods at the language level for example my old Lua sandbox, this is incredibly language-specific and some languages like C simply cannot be sandboxed like this.

Another common method is chroot, a Linux command which changes the root directory of the running program but can be easily broken out of if you aren't careful.

Docker does everything I need in terms of sandboxing with little effort including limiting resources, custom network routing, etc but it is based on chroot meaning sandboxed applications share a kernel with the host system.

Sharing a kernel with the host system makes it more vulnerable to side channel attacks and nasty privilege escalation than a hypervisor based virtual machine which is not something I would feel safe leaving running publicly for a long period time especially with the amount of hardware exploits being discovered on x86 CPUs lately.

Docker also doesn't have support for qcow2 snapshots meaning resetting a containers state is much slower than a qemu / libvirt based machine, something that is very important for handling people continuously bricking the vm as a denial of service attack.

It seems pretty clear now that a full Linux VM in a hypervisor would be much better than a container for what I'm trying to do.

My solution

Right now, Tangent uses libvirt to manage a qemu powered Debian 9 virtual machine and communicates to it through a custom json rpc like protocol.

tangent-server contains the server that runs as an unprivileged user on the VM, allowing the bot on the host machine to start processes, read stdin/stdout, and access files safely.

In its current configuration the VM is on a closed virtual network with only the host machine being routable (192.168.69.1), iptables are set up so requests from the VM to the host are blocked to prevent it from attempting to connect to SSH or other services it should not have access to.

The only way for information to go in and out of the VM is through connections initiated by the host.

System resources are also heavily limited with 1 thread, 256MB of memory, and a 16GB virtual disk.

If you manage to put the VM in an unusable state like by killing the server process continuously, Tangent will automatically use virsh to reboot the VM which only takes around 4 seconds.

As a last resort if someone obtains root and bricks the system a qcow2 snapshot can restore the system state to brand new using the qclean command, this is actually much faster than rebooting the VM.

The bot itself is a Dart application and is designed to be as fault tolerant as possible, all buffers that the VM send to are capped, any malformed packets will instantly terminate the connection, and all of the async wrappers for files and processes are destroyed properly when closed.

Dart is especially good for this job because of its powerful and safe async library, it eliminates a lot of corner cases and concurrency problems you usually get when designing asynchronous code.

Where the fun starts

So far I've installed the SDKs of over 50 languages to the VM, including:
sh, bash, ARM assembly, x86 assembly, C, C++, Lua 5.3/5.2/5.1, LuaJIT, Python 2/3, JavaScript, Perl, Java, Lisp, Brainfuck, C#, F#, Haskell, PHP, COBOL, Golang, Ruby, APL, Prolog, OCaml, SML, Crystal, Ada, D, Groovy, Dart, Erlang, FORTH, Pascal, Fortran, Hack, Julia, Kotlin, Scala, Swift, TypeScript, Verilog, WebAssembly, Scheme, AWK, Clojure, TI-BASIC, Batch, Racket, Rust. Over 12GB of packages!
All with bot commands that compile and run them for a single file.

Here are some examples:

But wait there's more

You can upload and download files to it!

In conclusion

It was a fun project to work on, I hope somebody finds a good use for it.
Back to working on my new game.

Discord: https://discord.gg/F2F2EdE

KOHCTPYKTOP 2: Electric Boogaloo

ping — Fri, 10 May 2019 07:08:03 GMT

I've made a zachtronics inspired digital circuit simulator in AngularDart.

It was inspired by is one of my favorite zachtronics games, KOHCTPYKTOP. It is a game where you design digital circuits on a grid, closely resembling the CMOS process which is how ICs are made in real life.

If you haven't played it already I highly recommend checking it out: http://www.zachtronics.com/kohctpyktop-engineer-of-the-people/

My recreation features a configurable PCB size, panning, live simulation, and more!

At the bottom there is a toolbox containing the 4 primary elements:

Metal
N-Type silicon
P-Type silicon
Via

These elements can be placed by click-dragging to form traces:

Metal and silicon conduct electricity provided by inputs to your IC, and the two layers can be connected with a via:

Metal is placed on a layer above Silicon where it conducts electricity separately, the two types of silicon can't be overlapped but can however be used to create gates.

Gates can be created by placing different typed silicon over one another, here is a simple circuit acting as an AND gate:

You may have noticed a little blip on the output when A turned off at the same time B turned on, this was because it takes time for gates to activate and deactivate:

Here it takes time for each gate to turn but the output turns off as soon as the input does because gates are just bridges, the delay is only in it's activation and not the electricity that flows through.

To get a proper delay you would do this:

Cool! What about an OR gate:

Since there are no diodes the simplest way to make an OR gate is to AND both inputs with VCC and then combine the result.

If PNP transistors are used instead of NPN what you get is a simple NAND gate:

XOR gates are a bit more complicated, here is a version that uses 1 PNP and 3 NPN transistors:

Now for something more interesting, a 2 bit full adder:

There is a test build up at https://c.tst.sh/ let me know what you think ❤️

Emulating the VEX Cortex

ping — Thu, 10 Jan 2019 17:55:58 GMT

If you have ever competed in VEX you know how painful programming the cortex is.

In order to develop your robot code you have to plug into the robot or controller, upload, wait several seconds, reset the field, test your code, and repeat. Sometimes you may not even have physical access to a robot which makes the entire process even slower.

So far has been only one way to simulate your robot which is RVW (Robot Virtual Worlds). Unfortunately RVW only works for RobotC code and you only have a handful of pre-built virtual robots to choose from making it useless for most teams.

What compounds issues even more is that the Cortex does not have any officially supported debugging functionality, we do not have a JTAG port to attach a debugger when things go wrong.

I aimed to solve this issue by implementing a way of simulating your robot hardware seamlessly but there a few technical challenges to overcome before that is possible, to explain these challenges first lets look at a block diagram of the cortex:

Ideally we do full emulation including both the supervisor and user SoCs so that we can execute the exact binaries that are used on a real robot, I quickly realized how difficult that would be as the supervisor's pinout is not documented and the protocol it uses to communicate with VEXNet keys is not documented.

The user SoC is a STM32F103VD with 384K flash, 64k ram, and a Cortex-M3 armv7-m processor. QEMU itself supports Cortex-M3 but only a very limited number of SoCs and development boards not including anything stm32.

Support for stm32 specifically is extremely important because timers, interrupt control, dma, and most other mmio functionality is vendor specific.

Turns out there is a qemu fork that adds stm32 support for the stm32f103xx SoCs here.
This fork includes enough to get PROS running out of the box and even print messages to UART without any modifications to the kernel.

This is great but unfortunately I2C and SPI aren't implemented and the ADC is incomplete (lacks continuous conversion mode). I need SPI for the supervisor or user code won't even run, I2C is needed to simulate IMEs, and the ADC is needed to simulate analog sensors.

To work around these issues instead of rewriting everything I opted to modify the pros kernel to do all I/O through custom hardware.

First step was to make a custom qemu device for the vex cortex implemented here.

Pretty straight forward, just a stripped down version of the stm32f103c8 device.

Next I needed to write qemu hardware here to communicate with the kernel, to do this i needed the following:

A timer to update supervisor every couple ms
The irq list for the cortex so I can poke them from qemu
A memory region that I can control read and writes from on the qemu side

Sadly the qemu internals aren't too well documented so I had to take reference from some other hardware implementaitons (dma, adc, etc.)

Timers are simple:

s->circular_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, (QEMUTimerCB *)vex_mgr_stream_circular_timer, s)

This creates a nanosecond timer with vex_mgr_stream_circular_timer as a callback when it gets triggered.

uint64_t curr_time = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
timer_mod(s->circular_timer, curr_time + 10000000);

This schedules the timer 10ms in the future

Next we need the irq list, which is returned by stm32_init and can just be passed to vex_mgr_create in vex_cortex_init

Once you have the list (named pic) you can just trigger them with qemu_irq_pulse. The stm32 header conveniently provides definitions for irqs so the following example triggers the SPI1 interrupt as the SPI controller would do when there is data in the SPI buffer sent from the supervisor:

qemu_irq_pulse(s->pic[STM32_SPI1_IRQ]);

Finally the memory region, I picked some reserved space in the stm32 memory map that's at 0x40021400 and 0xC00 (3072) bytes wide. The following code maps it to functions that handle reading and writing:

static const MemoryRegionOps vex_mgr_ops = {
	.read = vex_mgr_read,
	.write = vex_mgr_write,
	.endianness = DEVICE_NATIVE_ENDIAN,
	.impl = {
		.min_access_size = 1,
		.max_access_size = 4,
	}
};

sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, 0x40021400);
// s->iomem is a MemoryRegion struct
memory_region_init_io(&s->iomem, OBJECT(s), &vex_mgr_ops, s, "vex_mgr", 0xC00);
sysbus_init_mmio(dev, &s->iomem);

Now if the cortex ever tries to read or write from that memory region qemu will immediately call the handlers allowing two way communication.

As with most MMIO on the cortex, in the kernel you define the device as a convenient struct with volatile fields:

typedef struct {
  // Control register
  __IO uint32_t CR;
  // Args
  __IO uint32_t A1;
  __IO uint32_t A2;
  __IO uint32_t A3;
  // Results
  __IO uint32_t R1;
  __IO uint32_t R2;
  __IO uint32_t R3;
  __IO uint32_t R4;
} EMULATOR_TypeDef;

#define EMULATOR ((EMULATOR_TypeDef*)0x40021400)

With the struct and EMULATOR macro we can access the hardware registers by name instead of directly reading and writing from memory.

The emuCall helper function i wrote loads arguments and signals the control register to call a specific function on the qemu side provided a 16 bit module ID and 16 bit function ID. It is this this way instead of passing a function name as a string i.e. "Motor_set" for simplicity as there are only a few dozen functions.

inline int emuCall(uint16_t mod, uint16_t func, uint32_t a1, uint32_t a2, uint32_t a3) {
  _enterCritical();
  EMULATOR->A1 = a1;
  EMULATOR->A2 = a2;
  EMULATOR->A3 = a3;
  EMULATOR->CR = func | (mod << 16);
  int out = EMULATOR->R1;
  _exitCritical();
  return out;
}

The enter and exit critical ensure interrupts are disabled while registers are being accessed, as an interrupt inbetween any of these reads and writes will most likely clobber the argument and return registers.

After that I refactored a good chunk of the kernel to do all I/O through emuCall, you can see the huge commit here.

Here are all the functions that needed to be implemented to remove dependency on standard cortex gpio:

int EmuSerial_init(int port, int baud, int flags);
int EmuSerial_shutdown(int port);
int EmuSerial_putc(int port, int c);

int EmuFS_programOn();
int EmuFS_programOff();
int EmuFS_erasePage(int page);

int EmuI2C_startRead(int addr, void* data, int count);
int EmuI2C_startWrite(int addr, void* data, int count);
int EmuI2C_setAddr(int addr);

int EmuGPIO_ADCInit(uint32_t data);
int EmuGPIO_setDir(int port, int mode);
int EmuGPIO_getInput(int port);
int EmuGPIO_getOutput(int port);
int EmuGPIO_setOutput(int port, int value);
int EmuGPIO_setInterrupt(int port, int edges);

int EmuMotor_get(int channel);
int EmuMotor_set(int channel, int value);
int EmuMotor_stop();

int EmuComp_init();
int EmuComp_enableStandalone();
int EmuComp_setName(char* name);
int EmuComp_getStatus(void* buff);

int EmuSystem_exit();
int EmuSystem_break();

Header where these are defined

On the qemu side most of this will remain stubbed until I have an actual robot simulation to send and recieve all the data from.

For now only UART and the supervisor is fully implemented but that's enough to get the PROS kernel and user code fully running in an emulator:

And that's what I have so far! In the future I plan on hooking this up to an Unreal Engine powered robot simulation, tune in for more updates.

Saucy: QEMU Fork, PROS Fork

MC6000 in hardware - Part 2: XBus

ping — Sat, 21 Jul 2018 15:47:02 GMT

So, we have the instruction set the next step is to design the schematics for the MC6000 in an HDL. HDLs are programming languages which compile to circuit schematics with transistors or program FGPAS which can simulate logic gates many times faster than what a CPU could.

In SHENZHEN I/O components can send each other data over a protocol called XBus:

In this screenshot the microcontroller on the left is sending 69 on the x1 pin and the microcontroller on the right is saving it to acc from the x0 pin.

The difference between this and analog signals is XBus will guarantee the data was received by another component and you can use the SLX instruction to sleep until a specific bus has data.

Traditionally buses have a single master and multiple slaves where the master generates the clock signal and slaves can only talk to the master and vise versa, XBus however appears to be a multi-master bus like CAN which allows any component on the bus to talk to any other.
You can probably point out an issue with doing this, if two components want to talk at the same time they will talk over each other and the data will be corrupted.

To solve this issue we can use uid based delegation, before sending any data you transmit the uid and the highest wins. How it works is you transmit your uid MSB first and then compare it to what you are receiving at the same time and if what you are receiving is different from what you are sending then you lose the delegation.

Here is an example:

This shows what the MCs are transmitting vs what is seen on the bus where the bus is pull-down and the MCs are floating when low and pull up when high. The red means the MC failed arbitration and outputs floating for the remaining bits, you can see this for MC1 and MC2 after they write a 0 but read a 1 meaning there was someone trying to transmit using a higher id.

I used this method to write a multi-master XBus in Verilog that can handle an
arbitrary number of devices to attempt to send data in the same clock cycle, here it is in action:

This is a screenshot of ModelSim PE running a test on my xbus circuit where xb1 is trying to write 0x69 in the same clock as xb2 tries to write 0x7f0, xb1 loses arbitration because it has a lower id and when arbitration is over xb2 goes into the write state

To make it a little easier to understand, here is the state machine diagram of the xbus controller:

When arbitration happens the d1 line goes high for 1ns where the controllers that are trying to write go into the arbitration state and the rest go into the read state.

Note that all controllers but the one sending are required to actively read the data and you might fail arbitration indefinitely if a controller with a higher id is constantly sending data and that might cause a deadlock.

Verilog code: https://gist.github.com/PixelToast/64f05b06537d3043f47ca20d065759c4

Now i'll work on desinging the MC6000 CPU itself!

MC6000 in hardware - Part 1: The assembler

ping — Tue, 15 May 2018 05:25:17 GMT

SHENZHEN-IO is an interactive circuit building and programming puzzle game with a programmable microcontroller called the MC6000, it has an extremely simple instruction set and no memory besides 2 registers that can only store numbers from -999 to 999.

Each instruction consists of a label, condition, instruction, and comment:

foo: +mov 50 x2 # puts 50 to XBus 2

Conditions can either be +, -, or blank and control if the instruction should execute after a comparison. Labels are optional and used to tell where the jmp instruction should jump to, this is just sugar to make things easier to keep track of. Registers consist of acc, bak, and 6 virtual registers coresponding to the 6 I/O ports on the MC6000. The game comes with a more in-depth manual with a language specification here: https://u.pxtst.com/QAgvo8UJR6fah.pdf

The first step to implementing this in actual hardware is to lay out the machine code:

Registers

000	acc	001	dat	010	p0	011	p1
100	x0	101	x1	110	x2	111	x3

You may notice the lack of the register null which does nothing when you write to it and returns 0 when you read from it. I did not include it because it can simply be replaced with the literal 0 except when writing to it with MOV, because of this I added the flag E to the instruction SLX which when 1 will eat a value from the bus and do nothing with it.

Condition codes

00	always execute
01	only execute on - flag
10	only execute on + flag
11	only execute once

Internally there are two execution flags, + and - which are set by the test instructions TEQ, TGT, TLT, TCP. Every instruction but TCP sets the flags in a differential as in only either + or - can be true at the same time but TCP will disable both flags when both operands are equal.

Because I want to reduce instruction size I've made it so only the first operand of test instructions can be immediate values, because of this things have to be shifted around when assembling:

TGT acc 69 -> TLT 69 acc
TEQ 69 69 -> TST 1 0
TCP acc 42 -> TPC 42 acc

You may notice I've added 2 extra test instructions: TST and TPC.
TST takes 2 operands, + and - and sets the coresponding flags directly, this is always emitted when both operands of a test instruction are immediates.
TPC is the same as TCP but with the operands reversed, this happens when the second operand is an immediate.

Register/Immediate values

10	9	8	7	6	5	4	3	2	1	0
1	0	0	0	0	0	0	0	reg			Regsiter
immediate											Immediate

Immediate values can range from -999 to 999 and are encoded with two's complement and because it doesn't completely fill the 11 bits (0 - 2048) we can store a register number without adding an extra bit to signal whether or not it's a register. To easily tell the difference between the two you simply check if the first 8 bits are 10000000 which means it's <= -1017.

Register/Select values

3	2	1	0
1	reg			Regsiter
0		imm		Selection

Register/Digit values

4	3	2	1	0
1		reg			Regsiter
0	immediate				Digit

Register/Digit values work similarly to Register/Immediate values but store single digits from 0 to 9 but requires a flag to differentiate registers from immediates. If the digit is out of bounds it should be encoded as 0b1111 which will be interpreted as a no-op.

Instructions

18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
cond		0	0	0	R/I arg											reg			MOV
cond		0	1	0	0	0									line				JMP
cond		0	1	0	0	1		R/I sleep amount											SLP
cond		0	1	0	1	0	E										xpin		SLX
cond		0	1	0	1	1		R/I value											ADD
cond		0	1	1	0	0										reg			SUB
cond		0	1	1	0	1		R/I value											MUL
cond		0	1	1	1	1	0												NOT
cond		0	1	1	1	0	0								R/S selection				DGT
cond		0	1	1	1	0	1			R/D value					R/S selection				DST
cond		1	0	0	R/I arg											reg			TEQ
cond		1	0	1	R/I arg											reg			TGT
cond		1	1	0	R/I arg											reg			TLT
cond		1	1	1	R/I arg											reg			TCP
cond		0	0	1	R/I arg											reg			TPC
cond		0	1	1	1	1	1										+	-	TST

Using this layout I made an assembler and disassembler in Dart:

https://dartpad.dartlang.org/1398b0d59ce1f7292c5d5d1064b591b5

My own file uploading service

ping — Sun, 13 May 2018 08:10:03 GMT

The motivation behind this project was that I needed a very simple way for me and my friends to securely put files on my server for the various projects we use it for and the insane gigabit download speeds. Previously we just used rsync, a Linux command line utility that uses ssh to transfer files to and from another machine. This has many problems, mainly the fact that you need ssh and rsync installed which is a pain to do on Windows and doing it from a mobile device is out of the question. Secondly you need to add your ssh public key to the target server, this has some security issues because they have an actual linux user to run arbitrary code and after the infamous meltdown exploit i'm not willing to take that risk.

Of course we could just go back to using Google Drive but the free storage space is limited. We could use one of the many basic PHP uploaders but no, let's do it my way.

Over the span of 2 days I threw together some server-side Dart and some front-end html/css based off of my home website to make a unique way of uploading files:

Basically how it works is, I generate a token which grants a certain amount of storage space and when you upload a file it uses up that token's storage space.
You can add multiple tokens and choose which ones to use first, the file list is based off what files use it and when you delete a file it will give you back the storage space the file took.

In the settings menu you can delete and re-order tokens, it will fill your storage space from the top down and supports sharing files between multiple tokens so that if you had two 100MB tokens you could upload a 200MB file.

Drag-dropping files works as a bonus!

Setting it up is really simple, the backend is written in Dart (as usual) and can either be used self-contained as it's own webserver or through another webserver that is better at serving static files and supports like what I did with NGINX.

NGINX config:

server {
    listen 80;
    server_name u.pxtst.com;
    location / {
        rewrite ^(.*)$ https://u.pxtst.com$1 permanent;
    }
}


server {
        listen 443;
        listen [::]:443;

        ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
        ssl_prefer_server_ciphers on;
        ssl_stapling on;
        ssl_stapling_verify on;
        ssl_session_cache shared:SSL:10m;
        ssl_session_timeout 10m;
        add_header X-Frame-Options DENY;
        add_header X-Content-Type-Options nosniff;
        ssl on;
        server_name u.pxtst.com;
        ssl_certificate /etc/letsencrypt/live/u.pxtst.com/fullchain.pem;
        ssl_certificate_key /etc/letsencrypt/live/u.pxtst.com/privkey.pem;

        root /home/upload/;

        index index.html index.htm index.nginx-debian.html;

        location /api {
                proxy_pass http://localhost:17132;
                proxy_set_header    Host            $host;
                proxy_set_header    X-Real-IP       $remote_addr;
                proxy_set_header    X-Forwarded-for $remote_addr;
                port_in_redirect off;
                proxy_redirect   http://IP:17132/  /;
                proxy_connect_timeout 300;
                client_max_body_size 1G;
        }
        
        location /upload {
                alias /home/upload/www;
                index index.html;
                try_files $uri $uri/ =404;
        }
        
        location / {
                gzip            on;
                gzip_min_length 1000;
                gzip_types text/plain text/css application/json application/javascript application/x-javascript text/xml application/xml application/xml+rss text/javascript application/vnd.ms-fontobject application/x-font-ttf font/opentype image/svg+xml image/x-icon;
                expires 7d;
                alias /home/upload/uploads/;
                try_files $uri =403;
        }
}

Sauce: https://lab.pxtst.com/PixelToast/file-uploader

A small ARM assembler

ping — Thu, 10 May 2018 23:09:40 GMT

I made a small ARM assembler in Dart!

It just takes a string, like mov r0, #69 and emits the equivalent machine code: e3a00045. Initial tests with dissasemblers show it works pretty well though this isn't a full assembler, you can't do linking, macros, labels or even psudo-ops like push. It also has very little safeguards for incorrect flags on some instructions for example a pre-indexed ldrt.

Example output:

e3a00045 mov r0, #69
e12fff1e bx lr
09bc3ffc ldmedeq ip!, {a3-sp}
87e5d1ce strbhi r13, [v2, lr, asr #3]!
96992a4d ldrls a3, [sb], sp, asr #0x14

Live demo: https://dartpad.dartlang.org/9a4bb914d3d0d640061564517cf2e1ac

The following instructions are fully supported:

Data processing:
    AND{S}, EOR{S}, SUB{S}, RSB{S}
    ADD{S}, ADC{S}, SBC{S}, SRC{S}
    TST{S}, TEQ{S}, CMP{S}, CMN{S}
    ORR{S}, MOV{S}, BIC{S}, MVN{S}
Multiply:
    MUL{S}, MLA{S}
    UMULL{S}, UMLAL{S}, SMULL{S}, SMLAL{S}
Branching:
    BX, BL, B
Load/Store:
    STR{B}{T}, LDR{B}{T}
    LDM{FD|ED|FA|EA|IA|IB|DA|DB}
    STM{FD|ED|FA|EA|IA|IB|DA|DB}
    SWP{B}
Syscall:
    SWI, SVC
Condition codes:
    EQ, NE, CS, CC
    MI, PL, VS, VC,
    HI, LS, GE, LT
    GT, LE, AL

RIP Dedi

ping — Thu, 10 May 2018 21:48:06 GMT

After 4 years of constant use in the worst conditions imaginable, my first home server codenamed "Dedi" has finally died.

You will be missed.

What 28 hours of app development looks like

ping — Wed, 21 Feb 2018 21:15:00 GMT

Over the past 2 weeks I have been recording myself making Flutter apps, and here they are!