speedrunning embodied ai as a software engineer
tech ·it was a couple of years back when as a software engineer i was tasked to build a natural language sql layer on top of an existing product analytics codebase that i first got introduced to LLMs and applied AI. it started with tinkering around with the nascent openAI API, figuring out vector databases, semantic search and building rag pipelines and eventually building agent graphs. with knowledge work being automated and the software intelligence layer feeling like a solved problem in the near future, my brain naturally went towards the next big thing to work on, hardware and physical intelligence. and i knew jack shit. didn’t even know how a simple neural network worked at a math level, let alone understanding how a transformer generates the next token.
i also built a small proof of concept around that time, a raspberry pi robot arm that could take language instructions, scan a room and describe what it saw using an LLM as the brain. total toy project but it got me paying attention to robotic intelligence companies like figure ai, tesla and 1x building humanoids and i got convinced that this is the space i want to contribute to over the next span of my career. wrote about that build here.
but none of the surface level exploration gave me depth. reading blog posts about embodied AI, vaguely understanding what teleoperation means and skimming papers just gave me superficial knowledge. and as an engineer, i hate being in that state of vagueness. so alongside my job i started building foundations. bought sebastian raschka’s build an llm from scratch, learned each component of the transformer architecture (had to revise a lot of math, but luckily i have a masters in math which helped me get going quickly), built a 30M parameter transformer trained on tinystories, did mechanistic interpretability experiments, ablating attention heads, removing layers, visualizing attention weights during inference. wrote about all of that here. the important thing this gave me was an intuition for how neural networks learn representations and how data flows through stacked layers, which transferred directly when i later had to reason about policy networks and dynamics models.
but a 9-10 hour job left me with very little time to properly explore and after months of being frustrated with myself for not learning fast enough, i quit my job. joined an AI research lab because their problem sounded cool, but in a month i realised i want to go deeper into robot learning, all in.
next i planned to get a good enough breadth of intuition across embodied AI. from classical control to RL based policies to world modelling and VLAs, i set up a full stack robot learning project where every code snippet, learning and piece of work has been documented. from zero simulation experience to a fine-tuned VLA hitting 90% success on a reaching task in about 3 weeks.
what 10xed my learning was not having to figure out which papers to reference, which stackoverflow thread to dig through, or how to structure code from scratch. all of that was handled by claude. i exploited it as a thought partner as well. every concept came through hundreds of back-and-forth conversations because i wanted to understand each small thing really well before moving on. all the code was given to me by AI but manually written by me. this was deliberate as i wanted to think through the code conceptually instead of copying it from my agent chat panel. once i had the fundamentals down i let it agentically make corrections and test hypotheses, but the core code is something i wanted to hand type, atleast once.
simulation and classical control
my broad goal was to learn how to make a robot act intelligently. a really small poc of it was to make it perform an action based on what it perceives. i decided to work with a robot arm because that’s what i was seeing a lot of people experiment with. asked claude how to go about it and it pointed me towards simulation software where you can model physics and interact through code. it suggested isaac sim but unfortunately i had a mac and it doesn’t work without an nvidia gpu and requires a ridiculous amount of ram. got to know about MuJoCo which is deepmind’s physics engine and blindly went head on with it after reading some reddit threads.
the mental model for MuJoCo is two objects: MjModel (the robot’s blueprint, geometry, masses, joint limits, loaded once from XML) and MjData (the live state, positions, velocities, forces, updated every step). you write a control signal to data.ctrl, call mj_step, and the physics moves forward.
started by building my own 3-joint toy arm in XML. shoulder yaw, shoulder pitch, elbow pitch. revised PID control from my electronics classes and learned about inverse kinematics (IK) which is basically the math to figure out “my fingertip needs to be there, what angles should my joints be at?” with PID you can control a robot without any ML at all, just error correction in a loop. the math is straightforward but i couldn’t get the arm to do anything useful lol.
three things made it hard. coupling: torque on one joint creates reaction forces on the others, so they fight each other. one joint corrects itself, knocks the other off balance, which corrects and knocks the first one back, endless oscillation. gravity: the PID was spending most of its torque budget just holding the arm up, nothing left for tracking toward the target. i watched joints collapse, one would hold while the other kept swinging, none could reach the target. environment variables: things like damping and joint limits that the IK math doesn’t account for but the physics engine does, so your computed angles might be correct on paper but physically unreachable.
realised my hand-built arm might just be badly constructed, so i switched to the Franka Panda from deepmind’s model zoo, a proper 7-DOF robot arm with real measured physics. dissecting this model taught me more than building the toy arm did. the Franka uses position-controlled actuators, you send a target joint angle and it figures out the torque internally. same as the real robot’s firmware.
this is where the math gets interesting and worth getting comfortable with. the jacobian answers “if i rotate each joint a bit, how does the fingertip move?” i want the reverse though, i want the fingertip to go towards the target and know which joints to move for that to happen. that’s the pseudo-inverse. the problem is singularities, configurations where some joints can’t contribute to fingertip motion at all (imagine the arm fully stretched out, moving the elbow does nothing). the math tries to command infinite joint velocities which breaks everything. the damped pseudo-inverse $J^+ = J^T (J J^T + \lambda^2 I)^{-1}$ adds a $\lambda^2$ term that regularizes the inverse so it doesn’t blow up near singularities. you lose some precision but the arm stays stable.
a 7-DOF arm with only 3 position constraints (x, y, z of fingertip) has a 4-dimensional null space, joint motions that don’t move the fingertip at all. think of moving your elbow around while keeping your hand flat on a table. without managing this the arm drifts into bad configurations. MuJoCo also gives you qfrc_bias, the exact torques to cancel gravity at each joint, so you add those to your control and the controller just focuses on tracking. a lot of this might be jargon and thats okay, i expect you to copy paste these paragraphs and ask your AI agent for more clarity.
even with all of this, the arm kept failing. weak wrist joints couldn’t keep up with sideways motion. singularities where the math broke down. table blocking approach angles. each fixable individually but they kept showing up in different combinations. the real issue is that IK is fundamentally reactive, it greedily minimizes distance one step at a time with no concept of the full trajectory. which brought me to the question: can the robot learn to plan ahead, know which configurations to avoid, and figure out approach angles on its own?
that’s reinforcement learning.
reinforcement learning
didn’t really know what RL was beyond vague ideas from my engineering degree. spent time with claude reading about fundamentals, PPO, discrete vs continuous policies, and landed on SAC (soft actor-critic) as the best fit for continuous robot control.
the core idea: run a lot of simulations (300K in our case) where two neural networks learn together. the actor predicts what action to take given the current state. the critic predicts how good a (state, action) pair is. reward is just negative distance between fingertip and target, closer = better.
the math backbone is the Bellman equation:
$Q(s,a) = r + \gamma \cdot Q(s’, a’)$
$Q(s,a)$ is “how good is it to take action $a$ in state $s$.” $r$ is the immediate reward. $\gamma$ is a discount factor (0.99, meaning future rewards matter but you slightly prefer sooner ones). $Q(s’, a’)$ is how good the next state turns out to be. so the value of what you’re doing now equals reward now plus discounted value of where you end up. the critic learns to estimate this. you can update at every step instead of waiting for a full episode, and you can learn from data collected by any policy because the equation describes the environment, not the controller.
the actor’s loss:
$L_{actor} = \alpha \cdot \log \pi(a \vert s) - Q(s, a)$
first term is entropy (stay exploratory, weighted by temperature $\alpha$), second is the critic’s rating. the actor balances exploring new actions with exploiting what the critic says is good. critic’s loss is just Bellman error, how wrong was my $Q$ prediction compared to what actually happened. SAC uses twin critics (two Q-networks, take the min) because a single critic tends to overestimate and the actor starts chasing those errors.
observation going in: 20 floats (7 joint angles + 7 velocities + 3 fingertip + 3 target). action: 7 joint angle deltas. reward: $-dist(fingertip, target)$ per step.
5 failed runs and the actual fix
remember $\alpha$ from the actor loss? it controls explore vs exploit. ran 4 experiments tweaking it: default where $\alpha$ crashed to near-zero and the policy locked onto one mediocre trajectory forever, higher starting $\alpha$ that still collapsed, fixed too high where the policy preferred being random over actually reaching, fixed too low which changed nothing. i was sure the exploration balance was the problem.
then i wanted to see if the arm is EVEN ONCE reaching the target. sampled 100K random joint configurations to check what fraction could even reach the target zone. the original target (table surface at z=0.33m) was reachable by 0.49% of random configs. the arm could barely get there even by accident. $\alpha$ collapsing was a symptom. the task itself was nearly impossible.
raised target z from 0.33 to [0.40, 0.75], doubled the action scale. reachability went to 2.9%, about 6x better. probably the most useful thing i learned in this entire sprint: fix the environment before tuning hyperparameters. RL needs enough “accidentally good” experiences in the replay buffer to bootstrap from. if the robot can never stumble into success by chance, no amount of hyperparameter tuning will save you.
final run: 300K steps, 100% success on eval (20/20), ~95 steps on average to reach. watching the average episode length drop below 200 around 80K steps, meaning the arm was actually getting there before time ran out, gave me a crazy kick :D
world models
next thing i wanted to explore: instead of running the physics engine for every step, what if you train a neural net to predict what happens next? give it (current state, action) and it predicts the next state. then you can “imagine” transitions and train RL on those instead. in theory much more sample-efficient, which matters on a real robot where every step costs wall-clock time.
trained an MLP on 100K transitions from the SAC policy. for 20 steps it could correctly predict the next state, getting the fingertip to move in the right direction. but errors compound when you feed predictions back as input and by step 100 it was off by 8.5cm. velocity was the hardest thing to predict since it’s driven by acceleration (the MLP I trained didn’t have knowledge of physics. just the state).
remember the RL policy was trained on steps from the actual physics engine. the idea here was to have those steps generated by the world model instead, so you don’t need to run MuJoCo for every single transition.
tried MBPO first where you co-train the world model and the RL policy together. the world model generates imagined steps, RL trains on those, and as the RL policy improves you retrain the world model on newer data. sounds good in theory but it spiralled. the world model trained on early random data produced garbage predictions, the RL policy trained on that garbage got worse, which produced even worse data for the next world model retrain. exponential degradation.
then tried Dyna where instead of co-training, you use the already pre-trained expert world model (trained on 100K good SAC transitions) to generate steps. the idea being that the RL policy gets to train on high quality imagined data and becomes sample efficient. but that also failed, the expert model only knew what expert trajectories look like. when the early RL policy visited random states far outside what the expert ever saw, the model’s predictions were 20cm off. it couldn’t help the policy learn from the messy exploration phase because it had never seen mess.
model-free SAC was just a good workable solution with better results. 300K sim steps is cheap. model-based would matter when each real step costs you something, like on actual hardware.
vision-language-action models
everything until now used perfect state, 20 floats telling you exact joint angles, velocities, positions. real robots have cameras, not ground truth state vectors. so the question: can a model that sees pixels and reads a language instruction do what the RL policy did?
used Octo, a 93M parameter VLA pre-trained on 1M+ real robot trajectories from 22 different robots. it takes images plus a language instruction and outputs actions through a diffusion head. added overhead and wrist cameras to the MuJoCo scene. zero-shot with “move to the green target” failed completely, arm drifted away every episode because sim renders look nothing like real images and the action spaces didn’t match. fine-tuning necessary.
ran 3 experiments that failed before figuring out what matters. single overhead camera couldn’t judge depth, arm would approach the target then drift back. adding a wrist camera helped training loss but the model memorized one average trajectory instead of learning conditional behavior, same action output regardless of where the target was. tried fixing a drift issue by changing normalization, wrong diagnosis, made things worse.
what actually worked: 3x more demonstration data collected with a non-deterministic policy (so the same target gets different approach paths), both cameras, 25K fine-tuning steps on a Vast.ai GPU. 90% success, 9 out of 10 episodes, ~60 steps average. the key insight was that data diversity mattered more than anything else. the model needed to see enough variety to learn “move toward the green thing” instead of memorizing one specific trajectory. Octo’s pre-training on 800K real robot demos did the heavy lifting for visual understanding, 30K sim demos alone would never teach vision and control from scratch.
one thing that clicked: training loss went from 0.48 (memorizing model, 0% success) to 1.63 (generalizing model, 90% success). higher loss on diverse data means generalization, not failure. SAC with perfect state still beat Octo with pixels (100% vs 90%), expected, but Octo learned to reach from raw images with no physics and no reward function, just watching demonstrations. will write more about the technical details of RL, VLAs and positive transfer in a separate post.
everything here was the minimum poc for each concept. didn’t do domain randomization, didn’t use gaussian splats for world models, didn’t implement ensemble disagreement, skipped stress tests. every section was about building intuition before going deeper. the breadth is done. i still have a lot of knowledge gaps, but the goal of gaining confidence to tackle tougher problems is achieved.
next is closing the real-world loop: 3D-print a robot arm, build a MuJoCo model for it, teleoperate to collect real episodes, mix with simulation data, fine-tune a stronger VLA like π0 or OpenVLA. sim-to-real with a real data anchor. that’s the hard part.