Bachelor's Thesis · BSc Artificial Intelligence · 2025

Exploring Reinforcement Learning for Profiling Self-Adaptive Systems

Evaluating the policies that drive self-adaptive systems is slow and expensive. I built a reusable, simulated 'profile' of a city-scale IoT network so reinforcement-learning policies could be benchmarked offline, then used it to ask which policies actually win.

Degree

BSc Artificial Intelligence

Institution

Vrije Universiteit Amsterdam

First Supervisor

Ilias Gerostathopoulos

Second Reader

Kim Baraka

DingNet

City-scale IoT exemplar profiled

3

Environments modelled (Forest · City · Plain)

14

Power settings benchmarked per environment

0.75

Top median utility (ε-greedy, ε = 0.2)

Evaluating self-adaptation is expensive

Self-adaptive systems (SAS) autonomously adjust their behaviour at runtime to cope with changing conditions like resource variability, shifting demand, and faults. Many use multi-armed bandit (MAB) policies to decide how to adapt: the system takes an action, receives a reward signalling how well it worked, and learns to balance exploring new options against exploiting what already works.

The trouble is evaluating and comparing those policies. A trustworthy comparison needs many runs across many different contexts, and that becomes prohibitively expensive when you multiply policies by contexts and run everything against the real system in real time.

Profile once, benchmark many times

MockSAS tackles this by building a statistical 'profile' (a mock) of a self-adaptive system. By measuring the system's metrics under different contexts and actions, it derives the reward distributions for each (context, action) pair. That profile only has to be created once. After that, any number of MAB policies can be benchmarked against it offline, saving enormous time and compute compared to live runs.

MockSAS had previously been used to profile SWIM, a web-infrastructure simulator. My contribution was to bring this profiling approach into a fundamentally different domain, IoT self-adaptation, and then use the resulting profile to evaluate how different reinforcement-learning policies compare.

A profile of a city-scale IoT network

I extended MockSAS with a new profile for DingNet, an exemplar that simulates a LoRaWAN sensor network across the entire city of Leuven, with motes logging air-quality data (particulate matter, CO₂, soot, ozone) and relaying it to gateways. The self-adaptation problem is to tune each mote's transmission power to minimise energy use while keeping packet-delivery reliability high.

Getting there took real systems engineering. I first targeted DeltaIoT (a 25-mote campus exemplar) but deprecated it after weeks fighting its OpenVPN-dependent simulator, plus a hardware switch from ARM to x86-64 to resolve architecture incompatibilities, before pivoting to the larger-scale DingNet. Crucially, DingNet logs simulation data but not the packets sent and lost, so I had to instrument its source code to track those counters before I could measure reliability at all.

  • Ran a two-mote setup (a 'variable' mote sweeping power settings 1 to 14 against a fixed 'control' mote) across roughly 3 runs each, confirming results were consistent between runs.
  • Built a Python pipeline parsing DingNet's XML output into pandas DataFrames, merging and preprocessing it into a clean dataset.
  • Designed the utility function utilityDingNet, defined as U(Ps, Ppf) = N(Psf(Ps)) − Ppf · N(Pf(Ps)), which balances normalised packet-success against a penalised, normalised power cost.
  • Generated context profiles for three environments (Forest, City, Plain), each treating all 14 power settings as separate MAB 'arms'.

Imitating a system isn't the same as beating it

Running MockSAS's predefined MAB policies against the DingNet profile produced a clear, counter-intuitive result: the policies that best mirrored DingNet's own decision-making were not the ones that achieved the best outcomes.

  • An ε-greedy policy with low exploration (ε = 0.2) achieved the highest median utility (0.75) and the best overall ranking, yet aligned with DingNet's native decisions 0% of the time.
  • Conversely, DUCB-0.95 and DUCB-0.97 matched DingNet's decisions 100% of the time, and ε-greedy (ε = 0.8) matched 93%, but none of these delivered the top utility.
  • The takeaway: alignment with a system's existing strategy does not imply optimal performance. Strategies that diverge from DingNet's own behaviour can outperform it, which suggests room to improve the system's decision-making itself.

A reusable recipe, and where it strains

Beyond the DingNet result, the thesis serves as a step-by-step guide to creating a MockSAS profile and an honest assessment of the tool's feasibility. Profiling proved genuinely useful for cheap, repeatable policy evaluation, but I hit real friction: MockSAS's grammar couldn't handle string inputs (which forced the environment-as-arm design), and its power-penalty factor had to be tuned manually.

I flag the lack of a direct baseline comparison against DingNet itself as the study's main limitation, and point to clear next steps: a self-adaptive power-penalty factor, that direct baseline, and improving MockSAS's flexibility so profiling generalises across domains.