• About Us
  • Privacy Policy
  • Disclaimer
  • Contact Us
TechTrendFeed
  • Home
  • Tech News
  • Cybersecurity
  • Software
  • Gaming
  • Machine Learning
  • Smart Home & IoT
No Result
View All Result
  • Home
  • Tech News
  • Cybersecurity
  • Software
  • Gaming
  • Machine Learning
  • Smart Home & IoT
No Result
View All Result
TechTrendFeed
No Result
View All Result

Paper Insights: CONTRASTIVE AUDIO-VISUAL MASKED AUTOENCODER | by Shanmuka Sadhu | Jun, 2025

Admin by Admin
June 4, 2025
Home Machine Learning
Share on FacebookShare on Twitter


Shanmuka Sadhu

I’m at present working as a Machine Studying researcher on the College of Iowa, and the precise modality I’m working with is the audio modality. Since I’m simply beginning the undertaking, I’ve been studying present state-of-the-art papers and different related papers to know the panorama. Contrastive Audio-Visible Masked Autoencoder was work launched by researchers from MIT CSAIL and different AI labs. This paper follows the Masked Auto Encoder and permits for using each the audio and visible modality collectively.

Masked Autoencoder(MAE), which was revealed in 2022, was State-of-the-art on the time. The MAE was the primary inspiration for this paper. Moreover, the paper states that previous to their publication, Contrastive Studying and Masked Knowledge Modeling haven’t been used collectively in audio and visible studying. By leveraging these 2 methodologies, the mannequin offered within the paper was capable of match and outperform SOTA metrics on audio and visible classification.

For the visible modality, movies are pre-processed utilizing a vanilla Imaginative and prescient Transformer. For the audio modality, an Audio Spectogram Transformer(AST) is used. AST was launched by the identical group in 2021 and makes use of Transformers with spectrograms for preprocessing. For pre-training and fine-tuning, it’s normal to see audio fashions use 10-second video clips with the corresponding audio clip. For audio, an audio waveform is transformed to a 128-dimensional log mel filterbank with a 25-ms Hanning window each 10 ms. A Hanning Window is a smoothing operate generally utilized in audio processing to filter out spectral leakage. Then the audio is slit into patches of 512 x 16×16 and inputted into the mannequin. So as to make processing for movies computationally cheaper, the paper used a Body Aggregation Technique. So, in a 10-second video clip, 10 RGB frames are uniformly sampled. For coaching, 1 RGB body is used, however throughout inference, the common of all of the predictions of the ten RBG frames is used. Every RGB body is cropped and resized into 196 16×16 sq. patches.

Contrastive Audio-Visible Studying consists of an audio and visible pair which might be inputted into unbiased audio and visible encoders. Then after being positioned by way of a linear projection, a contrastive loss is minimized. We are going to go into extra element on how CAV is carried out within the mannequin in a later part.

Masking has been a well-liked technique in self-supervised studying for making a supervisory sign in addition to in supervised studying for information augmentation. Masked Knowledge Modeling: A Main self-supervised framework. Masked autoencoder is one method. For an enter pattern, an MAE masks a portion of the enter and solely inputs the unmasked portion right into a Transformer mannequin to “reconstruct” the masked tokens whereas minimizing MSE loss. This permits the transformer to study a significant illustration of the enter information.

Then they attempt to mix contrastive studying with an audio-visual masked encoder. They tokenize the audio and visible modalities. Then they added some modality-specific positional embeddings after which added 75% masking. Then, identical to earlier than, they handed by way of unbiased audio and visible encoders. Then they’re collectively handed by way of an Audio-Visible encoder 3 occasions, every utilizing completely different normalization layers. The audio and visible single modality streams are then used for contrastive studying, and the output of the audio-visual multi-modal stream for reconstruction. After including modality-specific embeddings, they put them by way of decoders to reconstruct the enter audio and picture. Then they decrease an MSE loss. Lastly, they sum up contrastive loss and reconstruction loss because the loss for the CAV-MAE.

Contrastive Loss
Reconstruction Loss
Remaining CAV-MAE loss
Paper written map of the CAV-MAE

General, there have been 3 details that the authors have been capable of conclude with: Contrastive Studying + Masking are complementary and carry out higher collectively than individually. Moreover, multi-modal pertaining permits the mannequin to carry out higher in single-modal duties. Every modality is ready to present extra data to the opposite, fairly than a binary label. Additionally, when performing audio retrieval given a video, CAV-MAE outperformed each MAE and CAV individually. The ultimate level is that self-supervised studying of the CAV-MAE mannequin is much extra computationally environment friendly in comparison with the opposite SOTA fashions.

Audio Retrieval Comparision
Tags: AUDIOVISUALAUTOENCODERCONTRASTIVEInsightsJunMASKEDpaperSadhuShanmuka
Admin

Admin

Next Post
Pornhub pulls out of France over age verification regulation

Pornhub pulls out of France over age verification regulation

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Trending.

Discover Vibrant Spring 2025 Kitchen Decor Colours and Equipment – Chefio

Discover Vibrant Spring 2025 Kitchen Decor Colours and Equipment – Chefio

May 17, 2025
Reconeyez Launches New Web site | SDM Journal

Reconeyez Launches New Web site | SDM Journal

May 15, 2025
Safety Amplified: Audio’s Affect Speaks Volumes About Preventive Safety

Safety Amplified: Audio’s Affect Speaks Volumes About Preventive Safety

May 18, 2025
Flip Your Toilet Right into a Good Oasis

Flip Your Toilet Right into a Good Oasis

May 15, 2025
Apollo joins the Works With House Assistant Program

Apollo joins the Works With House Assistant Program

May 17, 2025

TechTrendFeed

Welcome to TechTrendFeed, your go-to source for the latest news and insights from the world of technology. Our mission is to bring you the most relevant and up-to-date information on everything tech-related, from machine learning and artificial intelligence to cybersecurity, gaming, and the exciting world of smart home technology and IoT.

Categories

  • Cybersecurity
  • Gaming
  • Machine Learning
  • Smart Home & IoT
  • Software
  • Tech News

Recent News

Report: AI coding productiveness positive aspects cancelled out by different friction factors that sluggish builders down

Report: AI coding productiveness positive aspects cancelled out by different friction factors that sluggish builders down

July 10, 2025
How authorities cyber cuts will have an effect on you and your enterprise

How authorities cyber cuts will have an effect on you and your enterprise

July 9, 2025
  • About Us
  • Privacy Policy
  • Disclaimer
  • Contact Us

© 2025 https://techtrendfeed.com/ - All Rights Reserved

No Result
View All Result
  • Home
  • Tech News
  • Cybersecurity
  • Software
  • Gaming
  • Machine Learning
  • Smart Home & IoT

© 2025 https://techtrendfeed.com/ - All Rights Reserved