{"id":12216,"date":"2026-02-27T09:39:25","date_gmt":"2026-02-27T09:39:25","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=12216"},"modified":"2026-02-27T09:39:25","modified_gmt":"2026-02-27T09:39:25","slug":"closing-the-hole-between-textual-content-and-speech-understanding-in-llms","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=12216","title":{"rendered":"Closing the Hole Between Textual content and Speech Understanding in LLMs"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p>Giant Language Fashions (LLMs) will be tailored to increase their textual content capabilities to speech inputs. Nonetheless, these speech-adapted LLMs persistently underperform their text-based counterparts\u2014and even cascaded pipelines\u2014on language understanding duties. We time period this shortfall the text-speech understanding hole: the efficiency drop noticed when a speech-adapted LLM processes spoken inputs relative to when the unique text-based LLM processes the equal textual content. Latest approaches to narrowing this hole both depend on large-scale speech synthesis of textual content corpora, which is dear and closely depending on artificial information, or on large-scale proprietary speech datasets, which aren&#8217;t reproducible. Because of this, there stays a necessity for extra data-efficient alternate options for closing the text-speech understanding hole. On this work, we analyze the hole as pushed by two elements: (i) forgetting of textual content capabilities throughout adaptation, and (ii) cross-modal misalignment between speech and textual content. Based mostly on this evaluation, we introduce SALAD\u2014Pattern-efficient Alignment with Studying by Energetic choice and cross-modal Distillation\u2014which mixes cross-modal distillation with focused artificial information to enhance alignment whereas mitigating forgetting. Utilized to 3B and 7B LLMs, SALAD achieves aggressive efficiency with a robust open-weight mannequin throughout broad-domain benchmarks in information, language understanding, and reasoning, whereas coaching on over an order of magnitude much less speech information from public corpora.<\/p>\n<ul class=\"links-stacked\">\n<li>\u2020 Universit\u00e9 de Toulon, Aix Marseille Universit\u00e9, CNRS, LIS<\/li>\n<\/ul>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Giant Language Fashions (LLMs) will be tailored to increase their textual content capabilities to speech inputs. Nonetheless, these speech-adapted LLMs persistently underperform their text-based counterparts\u2014and even cascaded pipelines\u2014on language understanding duties. We time period this shortfall the text-speech understanding hole: the efficiency drop noticed when a speech-adapted LLM processes spoken inputs relative to when the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":12218,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[1915,1433,1112,1233,3085,2742],"class_list":["post-12216","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-closing","tag-gap","tag-llms","tag-speech","tag-text","tag-understanding"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/12216","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=12216"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/12216\/revisions"}],"predecessor-version":[{"id":12217,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/12216\/revisions\/12217"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/12218"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=12216"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=12216"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=12216"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-04-15 04:52:10 UTC -->