{"id":10215,"date":"2025-12-28T23:10:02","date_gmt":"2025-12-28T23:10:02","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=10215"},"modified":"2025-12-28T23:10:02","modified_gmt":"2025-12-28T23:10:02","slug":"breaking-the-hardware-barrier-software-program-fp8-for-older-gpus","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=10215","title":{"rendered":"Breaking the {Hardware} Barrier: Software program FP8 for Older GPUs"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<h2 class=\"wp-block-heading\"\/>\n<p class=\"wp-block-paragraph\">As deep studying fashions develop bigger and datasets increase, practitioners face an more and more widespread bottleneck: GPU reminiscence bandwidth. Whereas cutting-edge {hardware} gives FP8 precision to speed up coaching and inference, most information scientists and ML engineers work with older GPUs that lack this functionality. <\/p>\n<p class=\"wp-block-paragraph\">This hole within the ecosystem is what motivated me to construct <strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/SuriyaaMM\/feather\" data-type=\"link\" data-id=\"https:\/\/github.com\/SuriyaaMM\/feather\">Feather<\/a><\/strong>, an open-source library that utilises a software-based method to ship FP8-like efficiency enhancements on broadly obtainable {hardware}. I created this software to make environment friendly deep studying extra accessible to the broader ML group, and I welcome contributions<\/p>\n<h3 class=\"wp-block-heading\">Notation &amp; Abbreviations<\/h3>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>FPX:<\/strong> X-bit floating level quantity<\/li>\n<li class=\"wp-block-list-item\"><strong>UX: <\/strong>X-bit unsigned integer<\/li>\n<li class=\"wp-block-list-item\"><strong>GPU: <\/strong>Graphics processing unit<\/li>\n<li class=\"wp-block-list-item\"><strong>SRAM: <\/strong>Static RAM (on-chip GPU Cache)<\/li>\n<li class=\"wp-block-list-item\"><strong>HBM: <\/strong>Excessive bandwidth reminiscence (GPU VRAM)<\/li>\n<li class=\"wp-block-list-item\"><strong>GEMV: <\/strong>Normal Matrix-Vector multiplication<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">Motivation<\/h2>\n<p class=\"wp-block-paragraph\" style=\"border-radius:0px\">FP8 processing has confirmed efficient within the Deep Studying group <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2209.05433\"><strong>[1]<\/strong><\/a>; nonetheless, solely particular current {hardware} architectures (Ada and Blackwell) assist it, limiting its advantages for practitioners and researchers to utilise it. I actually have an `<em>Nvidia RTX 3050 6GB Laptop computer GPU<\/em>`, which sadly doesn\u2019t assist FP8 operations on the {hardware} degree.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Impressed by software-based options like (software-accelerated rendering on computer systems that don\u2019t assist native {hardware} acceleration for gaming), the article proposes an attention-grabbing resolution that may utilise the facility of FP8 datatypes<strong\/><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h3 class=\"wp-block-heading\">Packing FP8 &amp; FP16 in FP32 containers<\/h3>\n<p class=\"wp-block-paragraph\"><strong\/>Impressed by bitwise operations and packing strategies, the article presents an algorithm that packs two FP16s or 4 FP8s right into a single FP32. This enables for packing twice or 4 occasions the reminiscence, benefiting from a decrease reminiscence footprint, whereas sacrificing solely a small quantity of precision.<\/p>\n<p class=\"wp-block-paragraph\">One may argue that we\u2019re performing redundant computation, \u201c<em>Pack -&gt; Load -&gt; Unpack -&gt; Compute<\/em>.\u201d Nevertheless, think about Deep Studying operations; More often than not, these operations are memory-bound reasonably than compute-bound. This is similar bottleneck that algorithms like FlashAttention tackle; nonetheless, FlashAttention utilises tiling to maintain information in quick SRAM, whereas Feather compresses information to scale back reminiscence visitors.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">GPU Reminiscence Hierarchy<\/h2>\n<figure class=\"wp-block-image aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/12\/feather_gpu_mem_chart-3-1024x512.png\" alt=\"\" class=\"wp-image-639260\"\/><figcaption class=\"wp-element-caption\">GPU Reminiscence Hierarchy &amp; Bandwidth Chart. (Tailored from <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2205.14135\" data-type=\"link\" data-id=\"https:\/\/arxiv.org\/abs\/2205.14135\">Flash Consideration<\/a>) (Be aware: Values given don&#8217;t characterize RTX 3050 playing cards)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Check out this diagram. SRAM is the quickest accessible GPU reminiscence area and has the very best bandwidth (excluding the register itself), however is restricted to solely 20MB. HBM may be considered because the VRAM of the GPU itself, which has roughly <strong>1\/seventh<\/strong> the bandwidth of SRAM.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">The GPU cores are quick sufficient to finish the computation immediately, however they spend most of their time sitting idle, ready for the info to complete loading and writing again. That is what I imply by memory-bound: the bottleneck right here isn\u2019t the maths, however the information switch between the hierarchy of reminiscence within the GPU.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h3 class=\"wp-block-heading\">Decrease Precision Varieties &amp; Bandwidth<\/h3>\n<p class=\"wp-block-paragraph\"><strong\/>More often than not, values throughout computation are restricted to ranges round zero resulting from normalisation. Engineers developed lower-precision sorts corresponding to FP8 and FP16, which permit for greater bandwidth. One is perhaps confused about how decreasing the precision permits for greater bandwidth. If we take a more in-depth look, we\u2019re successfully loading two values within the place of 1 for the FP16 kind and 4 values within the place of 1 for the FP8 kind.\u00a0 We\u2019re buying and selling off precision for greater bandwidth to sort out memory-bound operations.\u00a0<\/p>\n<h3 class=\"wp-block-heading\">{Hardware} Stage Assist<\/h3>\n<p class=\"wp-block-paragraph\">Identical to AVX-512 directions, that are supported solely on a restricted variety of {hardware} platforms, FP8 and FP16 directions and registers are additionally restricted by {hardware} and can be found solely on the current ones. If you&#8217;re on an RTX-30 or RTX-20 sequence GPU from Nvidia, you then will be unable to make the most of this decrease precision FP8 kind. That is precisely the issue that <strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/SuriyaaMM\/feather\" data-type=\"link\" data-id=\"https:\/\/github.com\/SuriyaaMM\/feather\">Feather<\/a><\/strong> makes an attempt to unravel.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">Packing Technique<\/h2>\n<p class=\"wp-block-paragraph\">Utilizing bitwise operators, one can simply pack the FP16 kind right into a FP32. The algorithm is described beneath.<\/p>\n<h3 class=\"wp-block-heading\">Packing FP16<\/h3>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Forged the enter FP32 right into a FP16; this step may be carried out with ease utilizing numpy\u2019s <em>astype<\/em> perform.\u00a0<\/li>\n<li class=\"wp-block-list-item\">Forged them to U16 after which to U32; this units the higher 16 bits to 0s and decrease 16 bits to the precise FP16.<\/li>\n<li class=\"wp-block-list-item\">Shift one in all them by 16 utilizing the bitwise <em>LSHIFT<\/em> operator, and mix each of them utilizing the bitwise <em>OR<\/em> operator.\u00a0<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">Unpacking FP16<\/h3>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Extract the decrease 16 bits utilizing the bitwise <em>AND<\/em> operator and masks 0xFFFF.<\/li>\n<li class=\"wp-block-list-item\">Extract the higher 16 bits utilizing the <em>RSHIFT<\/em> operation by 16 after which carry out a bitwise <em>AND<\/em> operation with the masks 0xFFFF.\u00a0<\/li>\n<li class=\"wp-block-list-item\">Forged each U16 values again to FP16 and to FP32 if wanted.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">Packing FP8\u00a0<\/h3>\n<p class=\"wp-block-paragraph\">FP8 has two broadly used codecs \u2013 <strong>E5M2 &amp; E4M3. <\/strong>One can not use the identical algorithm used for packing two FP16 into FP32 as a result of the CPU doesn\u2019t assist FP8 sorts natively, however does for FP16 (half precision); that is the rationale that <em>np.float8<\/em> doesn\u2019t exist.\u00a0<\/p>\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;6951af741f1c3&quot;}\" data-wp-interactive=\"core\/image\" class=\"wp-block-image size-full wp-lightbox-container\"><img decoding=\"async\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on-async--click=\"actions.showLightbox\" data-wp-on-async--load=\"callbacks.setButtonStyles\" data-wp-on-async-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/12\/feather_fp_1-scaled.png\" alt=\"\" class=\"wp-image-639261\"\/><button class=\"lightbox-trigger\" type=\"button\" aria-haspopup=\"dialog\" aria-label=\"Enlarge\" data-wp-init=\"callbacks.initTriggerButton\" data-wp-on-async--click=\"actions.showLightbox\" data-wp-style--right=\"state.imageButtonRight\" data-wp-style--top=\"state.imageButtonTop\"><br \/>\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewbox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\"\/>\n\t\t\t<\/svg><br \/>\n\t\t<\/button><figcaption class=\"wp-element-caption\">FP8-E5M2 &amp; FP16 format (Tailored from <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Half-precision_floating-point_format\" data-type=\"link\" data-id=\"https:\/\/en.wikipedia.org\/wiki\/Half-precision_floating-point_format\">Half-Precision<\/a>)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Casting an FP16 to FP8-E5M2 is simple, as seen within the determine, as a result of each have the identical variety of exponent bits and differ solely of their fraction.\u00a0<\/p>\n<h4 class=\"wp-block-heading\">FP8-E5M2 Packing<\/h4>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Forged the enter FP32 right into a FP16; this step may be carried out with ease utilizing numpy\u2019s <em>astype<\/em> perform, or get the enter itself as FP16.<\/li>\n<li class=\"wp-block-list-item\">Forged to U16, <em>LSHIFT<\/em> by 8, then <em>RSHIFT<\/em> by 8 to isolate the higher 8 bits<\/li>\n<li class=\"wp-block-list-item\">Do that for all 4 FP32s or FP16s.<\/li>\n<li class=\"wp-block-list-item\">Now utilizing the <em>LSHIFT<\/em> operator, shift them by 0, 8, 16 and 24 items and mix them utilizing the bitwise <em>OR <\/em>operator.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">As soon as once more, unpacking needs to be easy; it&#8217;s the actual reverse of\u00a0packing.<\/p>\n<p class=\"wp-block-paragraph\">Packing an FP8-E4M3 shouldn&#8217;t be as straightforward and simple as packing an FP16 or FP8-E5M2, as a result of exponent bits mismatch.<\/p>\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;6951af741f8b0&quot;}\" data-wp-interactive=\"core\/image\" class=\"wp-block-image size-full wp-lightbox-container\"><img decoding=\"async\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on-async--click=\"actions.showLightbox\" data-wp-on-async--load=\"callbacks.setButtonStyles\" data-wp-on-async-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/12\/feather_fp_2.png\" alt=\"\" class=\"wp-image-639262\"\/><button class=\"lightbox-trigger\" type=\"button\" aria-haspopup=\"dialog\" aria-label=\"Enlarge\" data-wp-init=\"callbacks.initTriggerButton\" data-wp-on-async--click=\"actions.showLightbox\" data-wp-style--right=\"state.imageButtonRight\" data-wp-style--top=\"state.imageButtonTop\"><br \/>\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewbox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\"\/>\n\t\t\t<\/svg><br \/>\n\t\t<\/button><figcaption class=\"wp-element-caption\">P8-E4M3 format (Tailored from <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Minifloat\">Minifloat<\/a>)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">As an alternative of implementing it from scratch, the library makes use of the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/jax-ml\/ml_dtypes\"><em><strong>ml_dtypes<\/strong><\/em><\/a> library, which already does the casting math.<\/p>\n<p class=\"wp-block-paragraph\">The <em>ml_dtypes<\/em> library offers assist for generally used FP8 requirements, corresponding to E5M2 and E4M3 casting, for NumPy arrays. Utilizing the identical astype perform, we will carry out casting simply as we did for FP16 sorts. The Algorithm is strictly similar to how we pack FP16, so I\u2019m skipping it right here.\u00a0<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h3 class=\"wp-block-heading\">Triton GPU Kernels<\/h3>\n<p class=\"wp-block-paragraph\">After we pack, we want an algorithm (kernel) to utilise this packed datatype and carry out the computation. Passing the packed datatype to a kernel applied for FP32 or FP64 will end in undefined computation as a result of we&#8217;ve got already corrupted the FP32 or FP64 being handed. Writing a kernel that takes the packed datatype as enter in CUDA shouldn&#8217;t be an easy process and is error-prone. That is precisely the place <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3315508.3329973\"><strong>Triton<\/strong><\/a> shines; it&#8217;s a Area-Particular Language library that leverages a customized intermediate illustration for GPU kernels. In layman\u2019s phrases, it permits one to write down GPU kernels in Python itself with out the necessity to write CUDA kernels in C.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Triton kernels do precisely what was talked about beforehand; the algorithm is as follows:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Load the packed array into reminiscence<\/li>\n<li class=\"wp-block-list-item\">Unpack the reminiscence and upcast it to FP32 for accumulation duties<\/li>\n<li class=\"wp-block-list-item\">Carry out the computation<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The reader ought to observe that when performing the computation, upcasting is used to forestall overflows. Subsequently, from a computational perspective, there isn&#8217;t a benefit. Nevertheless, from the angle of bandwidth, we\u2019re loading reminiscence twice or 4 occasions with out compromising the bandwidth.\u00a0<\/p>\n<h5 class=\"wp-block-heading\">Triton Kernel Implementation (pseudocode)<\/h5>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">@triton.jit\ndef gemv_fp8_kernel(packed_matrix_ptr, packed_vector_ptr, out_ptr): \n    # Get present row to course of\n    row_id = get_program_id()\n    \n    # Initialize accumulator for dot product\n    accumulator = 0\n    \n    # Iterate over row in blocks\n    for every block in row:\n        # Load packed FP32 values (every incorporates 4 FP8s)\n        packed_matrix = load(packed_matrix_ptr)\n        packed_vector = load(packed_vector_ptr)\n        \n        # Unpack the FP32 into 4 FP8 values\n        m_a, m_b, m_c, m_d = unpack_fp8(packed_matrix)\n        v_a, v_b, v_c, v_d = unpack_fp8(packed_vector)\n        \n        # Upcast to FP32 and compute partial dot merchandise\n        accumulator += (m_a * v_a) + (m_b * v_b) + (m_c * v_c) + (m_d * v_d)\n    \n    # Retailer closing consequence\n    retailer(out_ptr, accumulator)<\/code><\/pre>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">Outcomes<\/h2>\n<p class=\"wp-block-paragraph\"><strong>{Hardware}:<\/strong> <em>NVIDIA GeForce RTX 3050 6GB VRAM<\/em><\/p>\n<p class=\"wp-block-paragraph\"><strong>CUDA Model:<\/strong> 13.0<\/p>\n<p class=\"wp-block-paragraph\"><strong>Python Model<\/strong>: 3.13.9<\/p>\n<p class=\"wp-block-paragraph\"><strong>GEMV Benchmark <\/strong>(M = 16384, N = 16384) (MxN matrix)<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<tbody>\n<tr>\n<td><strong>Implementation<\/strong><\/td>\n<td><strong>Time (microseconds)<\/strong><\/td>\n<td><strong>Speedup<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Pytorch (FP32)<\/td>\n<td>5,635<\/td>\n<td>(Baseline)<\/td>\n<\/tr>\n<tr>\n<td>Feather (FP8-E4M3)<\/td>\n<td>2,703<\/td>\n<td><strong>2.13x<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Feather (FP8-E5M2)<\/td>\n<td>1,679<\/td>\n<td><strong>3.3x<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">The theoretical efficiency increase that may be achieved is 4x; 3.3x is excellent compared, with the remaining overhead primarily stemming from pack\/unpack operations and kernel launch prices.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">E5M2 is quicker than E4M3 as a result of simpler unpacking, however E4M3 gives higher precision. Nevertheless, it&#8217;s considerably extra advanced to unpack (Feather makes use of a separate GPU kernel to unpack the E4M3 format).<\/p>\n<p class=\"wp-block-paragraph\"><strong>Flash Consideration Benchmark<\/strong> (Sequence Size = 8192, Embedding Dimension = 512)<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<tbody>\n<tr>\n<td><strong>Implementation<\/strong><\/td>\n<td><strong>Time (microseconds)<\/strong><\/td>\n<td><strong>Speedup<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Pytorch (FP32)<\/td>\n<td>33,290<\/td>\n<td>(Baseline)<\/td>\n<\/tr>\n<tr>\n<td>Feather (FP8-E5M2)<\/td>\n<td>9,887<\/td>\n<td><strong>~3.3x<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<h3 class=\"wp-block-heading\">Accuracy &amp; Precision<\/h3>\n<p class=\"wp-block-paragraph\">Testing with random matrices (integer distributions within the vary [-3, 3] and customary regular distributions) reveals that each E4M3 and E5M2 keep numerical outcomes inside sensible tolerances for deep studying operations. The buildup errors stay manageable for typical workload sizes; nonetheless, customers requiring strict numerical precision ought to validate their particular use case.<strong\/><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h3 class=\"wp-block-heading\">When must you use Feather?<\/h3>\n<p class=\"wp-block-paragraph\">Use instances for Feather aren&#8217;t restricted; one can use Feather wherever FP8 packing and unpacking have a bonus, corresponding to\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Giant matrix-vector merchandise, the place loading and unloading are the bottlenecks.<\/li>\n<li class=\"wp-block-list-item\">Consideration-like memory-bound kernels.<\/li>\n<li class=\"wp-block-list-item\">Inference or fine-tuning on native RTX 30 or 20 sequence.<\/li>\n<li class=\"wp-block-list-item\">Batch processing, the place packing overhead is amortised<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">When must you not use Feather?<\/h3>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">You&#8217;ve RTX 40-series or H100 GPUs (native FP8 is quicker).<\/li>\n<li class=\"wp-block-list-item\">Workloads are compute-bound reasonably than bandwidth- or memory-bound.<\/li>\n<li class=\"wp-block-list-item\">You want assured precision.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">Limitations of Feather<\/h3>\n<p class=\"wp-block-paragraph\"><strong\/>Feather is at the moment within the early levels of prototyping with a number of areas for enchancment.\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Restricted assist for operations; at the moment, <strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/SuriyaaMM\/feather\" data-type=\"link\" data-id=\"https:\/\/github.com\/SuriyaaMM\/feather\">Feather<\/a><\/strong> helps solely the dot product, GEMV subroutine and FlashAttention.\u00a0<\/li>\n<li class=\"wp-block-list-item\">Accuracy validation for full ML workloads; at the moment, <strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/SuriyaaMM\/feather\">Feather\u2019s<\/a> <\/strong>accuracy is validated just for operations, not for end-to-end ML workloads.<\/li>\n<li class=\"wp-block-list-item\">Integration is at the moment restricted; <strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/SuriyaaMM\/feather\" data-type=\"link\" data-id=\"https:\/\/github.com\/SuriyaaMM\/feather\">Feather<\/a><\/strong> is a standalone implementation. Integration with PyTorch and assist for autograd would make it extra production-ready.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The undertaking is open supply; group contributions are welcome! You may check out the code by merely following the directions on <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/SuriyaaMM\/feather\">GitHub<\/a>.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Picture License:<\/strong> All the photographs are made by the creator. Adaptation sources are clearly talked about in respective captions.<\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>As deep studying fashions develop bigger and datasets increase, practitioners face an more and more widespread bottleneck: GPU reminiscence bandwidth. Whereas cutting-edge {hardware} gives FP8 precision to speed up coaching and inference, most information scientists and ML engineers work with older GPUs that lack this functionality. This hole within the ecosystem is what motivated me [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":10217,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[7154,5033,7155,1460,2838,4688,802],"class_list":["post-10215","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-barrier","tag-breaking","tag-fp8","tag-gpus","tag-hardware","tag-older","tag-software"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/10215","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=10215"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/10215\/revisions"}],"predecessor-version":[{"id":10216,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/10215\/revisions\/10216"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/10217"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=10215"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=10215"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=10215"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-05-09 03:43:57 UTC -->