{"id":13606,"date":"2026-04-10T01:30:24","date_gmt":"2026-04-10T01:30:24","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=13606"},"modified":"2026-04-10T01:30:24","modified_gmt":"2026-04-10T01:30:24","slug":"rules-of-mechanical-sympathy","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=13606","title":{"rendered":"Rules of Mechanical Sympathy"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p>Over the previous decade, {hardware} has seen super advances, from unified<br \/>\n    reminiscence that is redefined how shopper GPUs work, to neural engines that may<br \/>\n    run billion-parameter AI fashions on a laptop computer.<\/p>\n<p>And but, software program is <i>nonetheless<\/i> sluggish, from seconds-long chilly begins for<br \/>\n    easy serverless capabilities, to hours-long ETL pipelines that merely<br \/>\n    rework CSV information into rows in a database.<\/p>\n<p>Again in 2011, a high-frequency buying and selling engineer named Martin Thompson<br \/>\n    observed these points, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/mechanical-sympathy.blogspot.com\/2011\/07\/why-mechanical-sympathy.html\">attributing<br \/>\n    them<\/a><br \/>\n    to a scarcity of <i>Mechanical Sympathy<\/i>. He borrowed this phrase from a Components<br \/>\n    1 champion:<\/p>\n<blockquote>\n<p>You do not have to be an engineer to be a racing driver, however you do want<br \/>\n      Mechanical Sympathy.<\/p>\n<p class=\"quote-attribution\">&#8212; Sir Jackie Stewart, Components 1 World Champion<\/p>\n<\/blockquote>\n<p>Though we&#8217;re not (normally) driving race vehicles, this concept applies to<br \/>\n    software program practitioners. By having \u201csympathy\u201d for the {hardware} our software program<br \/>\n    runs on, we are able to create surprisingly performant programs. The<br \/>\n    mechanically-sympathetic <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/martinfowler.com\/articles\/lmax.html\">LMAX<br \/>\n    Structure<\/a> processes<br \/>\n    hundreds of thousands of occasions per second on a single Java thread.<\/p>\n<p>Impressed by Martin&#8217;s work, I&#8217;ve spent the previous decade creating<br \/>\n    performance-sensitive programs, from AI inference platforms serving hundreds of thousands<br \/>\n    of merchandise at Wayfair, to <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.codas.dev\">novel binary encodings<\/a><br \/>\n    that outperform Protocol Buffers.<\/p>\n<p>On this article, I cowl the ideas of mechanical sympathy I exploit<br \/>\n    on daily basis to create programs like these &#8211; ideas that may be utilized most<br \/>\n    anyplace, at <i>any<\/i> scale.<\/p>\n<section id=\"Not-so-randomMemoryAccess\">\n<h2>Not-So-Random Reminiscence Entry<\/h2>\n<p>Mechanical sympathy begins with understanding how CPUs retailer, entry,<br \/>\n      and share reminiscence.<\/p>\n<div class=\"figure \" id=\"cpu-memory-structure.png\"><img decoding=\"async\" src=\"https:\/\/martinfowler.com\/articles\/mechanical-sympathy-principles\/cpu-memory-structure.png\" \/><\/p>\n<p class=\"photoCaption\">Determine 1: An summary diagram of how CPU<br \/>\n      reminiscence is organized<\/p>\n<\/div>\n<p>Most fashionable CPUs &#8211; from Intel&#8217;s chips to Apple&#8217;s silicon &#8211; set up<br \/>\n      reminiscence into <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/mechanical-sympathy.blogspot.com\/2013\/02\/cpu-cache-flushing-fallacy.html\">a hierarchy of registers, buffers, and<br \/>\n      caches<\/a>, every with totally different <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/mechanical-sympathy.blogspot.com\/2011\/08\/inter-thread-latency.html\">entry latencies<\/a>:<\/p>\n<ul>\n<li>Every CPU core has its personal high-speed <i>registers and buffers<\/i> that are<br \/>\n        used for storing issues like native variables and in-flight directions.<\/li>\n<li>Every CPU core has its personal <i>Degree 1 (L1) Cache<\/i> which is far bigger than<br \/>\n        the core&#8217;s registers and buffers, however just a little slower.<\/li>\n<li>Every CPU core has its personal <i>Degree 2 (L2) Cache<\/i> which is even bigger than<br \/>\n        the L1 cache, and is used as a form of buffer between the L1 and L3 caches.<\/li>\n<li>A number of CPU cores share a <i>Degree 3 (L3) Cache<\/i> which is by far the<br \/>\n        largest cache, however is <i>a lot<\/i> slower than the L1 or L2 caches. This cache is used<br \/>\n        to share knowledge between CPU cores.<\/li>\n<li>All CPU cores share entry to principal reminiscence, AKA <i>RAM<\/i>. This reminiscence is, by<br \/>\n        an order of magnitude, the slowest for a CPU to entry.<\/li>\n<\/ul>\n<p>As a result of CPUs&#8217; buffers are so small, applications steadily must entry<br \/>\n      slower caches or principal reminiscence. To cover the price of this entry, CPUs play a<br \/>\n      betting recreation:<\/p>\n<ul>\n<li>Reminiscence accessed lately will <i>most likely<\/i> be accessed once more quickly.<\/li>\n<li>Reminiscence <i>close to<\/i> lately accessed reminiscence will <i>most likely<\/i> be accessed<br \/>\n        quickly.<\/li>\n<li>Reminiscence entry will <i>most likely<\/i> observe the identical sample.<\/li>\n<\/ul>\n<p><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/mechanical-sympathy.blogspot.com\/2012\/08\/memory-access-patterns-are-important.html\">In<br \/>\n      observe<\/a>,<br \/>\n      these bets imply linear entry outperforms entry throughout the identical<br \/>\n      <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Page_(computer_memory)\">web page<\/a>, which in<br \/>\n      flip vastly outperforms random entry throughout pages.<\/p>\n<p class=\"tl-dr\">\n        Want algorithms and knowledge buildings that allow predictable,<br \/>\n        sequential entry to knowledge. For instance, when constructing an ETL pipeline,<br \/>\n        carry out a sequential scan over a whole supply database and filter out<br \/>\n        irrelevant keys as a substitute of querying for entries separately by key.\n      <\/p>\n<\/section>\n<section id=\"CacheLinesAndFalseSharing\">\n<h2>Cache Strains and False Sharing<\/h2>\n<p>Inside the L1, L2, and L3 caches, reminiscence is normally saved in \u201cchunks\u201d<br \/>\n      referred to as <b>Cache Strains<\/b>. Cache strains are all the time a contiguous energy of two<br \/>\n      in size, and are sometimes 64 bytes lengthy.<\/p>\n<p>CPUs all the time load (\u201clearn\u201d) or retailer (\u201cwrite\u201d) reminiscence in multiples of a<br \/>\n      cache line, which results in a delicate downside: What occurs if two CPUs<br \/>\n      write to 2 separate variables in the identical cache line?<\/p>\n<div class=\"figure \" id=\"cpu-false-sharing.png\"><img decoding=\"async\" src=\"https:\/\/martinfowler.com\/articles\/mechanical-sympathy-principles\/cpu-false-sharing.png\" \/><\/p>\n<p class=\"photoCaption\">Determine 2: An summary diagram of how two CPUs<br \/>\n      accessing two totally different variables can nonetheless battle if the variables are<br \/>\n      in the identical cache line.<\/p>\n<\/div>\n<p>You get <b>False Sharing<\/b>: Two CPUs combating over entry to 2<br \/>\n      totally different variables in the identical cache line, <i>forcing the CPUs to take<br \/>\n      turns accessing the variables by way of the shared L3 cache<\/i>.<\/p>\n<p>To forestall false sharing, many low-latency functions will \u201cpad\u201d<br \/>\n      cache strains with empty knowledge so that every line successfully incorporates <i>one<\/i><br \/>\n      variable. <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/mechanical-sympathy.blogspot.com\/2011\/07\/false-sharing.html\">The<br \/>\n      distinction<\/a><br \/>\n      may be staggering:<\/p>\n<ul>\n<li>With out padding, cache line false sharing causes a near-linear improve in<br \/>\n        latency as threads are added.<\/li>\n<li>With padding, latency is sort of fixed as threads are added.<\/li>\n<\/ul>\n<p>Importantly, false sharing solely seems when variables are being<br \/>\n      <i>written<\/i> to. Once they&#8217;re being <i>learn<\/i>, every CPU can copy the cache line<br \/>\n      to its native caches or buffers, and will not have to fret about<br \/>\n      synchronizing the state of these cache strains with different CPUs&#8217; copies.<\/p>\n<p>Due to this conduct, one of the crucial widespread victims of false<br \/>\n      sharing is atomic variables. These are one in all only some knowledge varieties (in<br \/>\n      most languages) that may be safely shared <i>and<\/i> modified between threads<br \/>\n      (and by extension, CPU cores).<\/p>\n<p class=\"tl-dr\">Should you&#8217;re chasing the ultimate little bit of efficiency in a<br \/>\n      multithreaded utility, test if there&#8217;s <i>any<\/i> knowledge construction being<br \/>\n      written to by a number of threads &#8211; and if that knowledge construction is perhaps a<br \/>\n      sufferer of false sharing.<\/p>\n<\/section>\n<section id=\"TheSingleWriterPrinciple\">\n<h2>The Single Author Precept<\/h2>\n<p>False sharing is not the one downside that arises when constructing<br \/>\n      multithreaded programs. There are security and correctness points (like race<br \/>\n      circumstances), the price of context-switching when threads outnumber CPU<br \/>\n      cores, and the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/mechanical-sympathy.blogspot.com\/2013\/08\/lock-based-vs-lock-free-concurrent.html\">brutal overhead of mutexes<br \/>\n      (\u201clocks\u201d)<\/a>.<\/p>\n<p>These observations carry me to the mechanically-sympathetic precept I<br \/>\n      use <i>probably the most<\/i>: The <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/mechanical-sympathy.blogspot.com\/2011\/09\/single-writer-principle.html\"><b>Single Author<br \/>\n      Precept<\/b><\/a>.<\/p>\n<p>In idea, the precept is straightforward: If there&#8217;s some knowledge (like an<br \/>\n      in-memory variable) or useful resource (like a TCP socket) that an utility<br \/>\n      writes to, all of these writes needs to be made by a single thread.<\/p>\n<p>Let&#8217;s take into account a minimal instance of an HTTP service that consumes textual content<br \/>\n      and produces vector embeddings of that textual content. These embeddings can be<br \/>\n      generated throughout the service by way of a textual content embedding AI mannequin. For this<br \/>\n      instance, we&#8217;ll assume it is an ONNX mannequin, however Tensorflow, PyTorch, or any<br \/>\n      different AI runtimes would work.<\/p>\n<div class=\"figure \" id=\"multiple-writers.png\"><img decoding=\"async\" src=\"https:\/\/martinfowler.com\/articles\/mechanical-sympathy-principles\/multiple-writers.png\" \/><\/p>\n<p class=\"photoCaption\">Determine 3: An summary diagram of a naive textual content<br \/>\n      embedding service<\/p>\n<\/div>\n<p>This service would rapidly run into an issue: Most AI runtimes can<br \/>\n      solely execute <i>one<\/i> inference name to a mannequin at a time. Within the naive<br \/>\n      structure above, we use a mutex to work round this downside.<br \/>\n      Sadly, if a number of requests hit the service on the identical time,<br \/>\n      they&#8217;re going to queue for the mutex and rapidly succumb to <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Head-of-line_blocking\">head-of-line<br \/>\n      blocking<\/a>.<\/p>\n<div class=\"figure \" id=\"single-writer.png\"><img decoding=\"async\" src=\"https:\/\/martinfowler.com\/articles\/mechanical-sympathy-principles\/single-writer.png\" \/><\/p>\n<p class=\"photoCaption\">Determine 4:  An summary diagram of a textual content embedding<br \/>\n      service utilizing the single-writer precept with batching<\/p>\n<\/div>\n<p>We are able to eradicate these points by refactoring with the single-writer<br \/>\n      precept. First, we are able to wrap entry to the mannequin in a devoted<br \/>\n      <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Actor_model\">Actor<\/a> thread. As a substitute of<br \/>\n      request threads competing for a mutex, they now ship asynchronous messages<br \/>\n      to the actor.<\/p>\n<p>As a result of the actor is the single-writer, it may group impartial<br \/>\n      requests right into a <i>single<\/i> batch inference name to the underlying mannequin, and<br \/>\n      then asynchronously ship the outcomes again to particular person request<br \/>\n      threads.<\/p>\n<p class=\"tl-dr\">Keep away from defending writable sources with a mutex. As a substitute, dedicate a single thread (\u201cactor\u201d) to personal each write, and use asynchronous messaging to submit writes from different threads to the actor.<\/p>\n<\/section>\n<section id=\"NaturalBatching\">\n<h2>Pure Batching<\/h2>\n<p>Utilizing the single-writer precept, we have eliminated the mutex from our<br \/>\n      easy AI service, and added assist for batch inference calls. However how<br \/>\n      ought to the actor <i>create<\/i> these batches?<\/p>\n<p>If we look ahead to a predetermined batch dimension, requests <i>might<\/i> block for<br \/>\n      an unbounded period of time till sufficient requests are available. If we create<br \/>\n      batches at a hard and fast interval, requests <i>will<\/i> block for a bounded quantity of<br \/>\n      time between every batch.<\/p>\n<p>There&#8217;s a greater manner than both of those approaches: <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/mechanical-sympathy.blogspot.com\/2011\/10\/smart-batching.html\"><b>Pure Batching<\/b><\/a>.<\/p>\n<p>With pure batching, the actor begins making a batch as quickly as<br \/>\n      requests can be found in its queue, and completes the batch as quickly as<br \/>\n      the utmost batch dimension is reached <i>or the queue is empty<\/i>.<\/p>\n<p>Borrowing a labored instance from Martin&#8217;s authentic submit on pure<br \/>\n      batching, we are able to see the way it amortizes per-request latency over time:<\/p>\n<table class=\"dark-head\">\n<thead>\n<tr>\n<th>Technique<\/th>\n<th>Greatest (\u00b5s)<\/th>\n<th>Worst (\u00b5s)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Timeout<\/td>\n<td>200<\/td>\n<td>400<\/td>\n<\/tr>\n<tr>\n<td>Pure<\/td>\n<td>100<\/td>\n<td>200<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>This instance assumes every batch has a hard and fast latency of <code>100\u00b5s<\/code>.<\/p>\n<p>With a timeout-based batching technique, assuming a timeout of <code>100\u00b5s<\/code>,<br \/>\n      the best-case latency will likely be <code>200\u00b5s<\/code> when all requests within the batch are<br \/>\n      acquired concurrently (<code>100\u00b5s<\/code> for the request itself, and <code>100\u00b5s<\/code><br \/>\n      ready for extra requests earlier than sending a batch). The worst-case latency<br \/>\n      will likely be <code>400\u00b5s<\/code> when some requests are acquired just a little late.<\/p>\n<p>With a pure batching technique, the best-case latency will likely be <code>100\u00b5s<\/code><br \/>\n      when all requests within the batch are acquired concurrently. The worst-case<br \/>\n      latency will likely be <code>200\u00b5s<\/code> when some requests are acquired just a little late.<\/p>\n<p>In each circumstances, the efficiency of pure batching is twice nearly as good as a<br \/>\n      timeout-based technique.<\/p>\n<p class=\"tl-dr\">If a single author handles batches of writes (or reads!), construct every batch greedily: Begin the batch as quickly as knowledge is offered, and end when the queue of knowledge is empty or the batch is full.<\/p>\n<p>These ideas work nicely for particular person apps, however they scale to<br \/>\n      whole programs. Sequential, predictable knowledge entry applies to a giant knowledge<br \/>\n      lake as a lot as an in-memory array. The only-writer precept can increase<br \/>\n      efficiency of an IO-intensive app, or present a powerful basis for a<br \/>\n      <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/martinfowler.com\/bliki\/CQRS.html\">CQRS structure<\/a>.<\/p>\n<p>Once we write software program that is mechanically sympathetic, efficiency<br \/>\n      follows naturally, at each scale.<\/p>\n<p>However earlier than you go: prioritize observability earlier than optimization.<br \/>\n      <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/martinfowler.com\/ieeeSoftware\/yetOptimization.pdf\">You&#8217;ll be able to&#8217;t enhance what you&#8217;ll be able to&#8217;t measure.<\/a> Earlier than making use of any of those<br \/>\n      ideas, outline your <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.withcaer.com\/c\/vale\/\">SLIs, SLOs, and<br \/>\n      SLAs<\/a> so  the place to focus and<br \/>\n      when to cease.<\/p>\n<p class=\"tl-dr\">Prioritize observability earlier than optimization, earlier than making use of<br \/>\n      these ideas, measure efficiency and perceive your objectives.<\/p>\n<\/section>\n<hr class=\"bodySep\" \/>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Over the previous decade, {hardware} has seen super advances, from unified reminiscence that is redefined how shopper GPUs work, to neural engines that may run billion-parameter AI fashions on a laptop computer. And but, software program is nonetheless sluggish, from seconds-long chilly begins for easy serverless capabilities, to hours-long ETL pipelines that merely rework CSV [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":13608,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[56],"tags":[8594,4397,8595],"class_list":["post-13606","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-software","tag-mechanical","tag-principles","tag-sympathy"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13606","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13606"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13606\/revisions"}],"predecessor-version":[{"id":13607,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13606\/revisions\/13607"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/13608"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13606"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13606"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13606"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69c6f7b5190636d50e9f6768. Config Timestamp: 2026-03-27 21:33:41 UTC, Cached Timestamp: 2026-04-10 04:57:45 UTC -->