Final Report Submitted to the Department of Electrical and Systems Engineering

Real-Time
ML-Based Audio
Classification Device
for Accessibility
and Safety

Everyday environments are filled with informational sounds — such as fire alarms, doorbells, microwave beeps, ringing phones, running water — that people who have auditory disabilities, or are simply out of earshot, can miss.

Group Members
Shua Jeon · Kevin Gutierrez Velasco · Keshav VasanthBS Electrical Engineering
McKelvey School of Engineering
Faculty Instructor & Project Client
Prof. Dorothy WangDept. of Electrical & Systems Engineering
Project Period
February 2026 – April 2026
Date of Submission
May 1, 2026
1

Executive Summary

Overall Objective and Problem Statement

Everyday environments are filled with informational sounds (such as fire alarms, doorbells, microwave beeps, ringing phones, running water) that people who have auditory disabilities, or are simply out of earshot, can miss. The objective of this project was to design and build a real-time, ML-based audio classification device that reliably identifies a fixed set of accessibility-relevant sounds (even in noisy conditions) and communicates them through a visual display. This included real-time operation on two microphone circuits, and classification accuracy at low signal-to-noise ratios (SNRs).

Aims and Objectives

We aimed to design and construct a circuit using two ESP32 microphones capable of data transfer through WiFi to a Raspberry Pi. The data is then uploaded to a laptop for real-time visual display. Then we built a digital audio-processing and classification pipeline capable of cleaning noisy audio and classifying a set of five sound events: fire alarm, phone ringtone, microwave beep, doorbell, and water running. There is a major emphasis on Fire Alarms as they imply that a life-threatening situation may be occurring. Lastly, we integrate hardware and software into a continuously running system, test its accuracy across different levels of noisy environments, and tune operating parameters for the best accuracy and latency compromise.

Key Outcomes

The final prototype is a real-time audio classifier running on two ESP32 microphone inputs, with a GUI (graphical user interface) that can be displayed on a personal computer. The created algorithm recovered a large fraction of the accuracy that it lost to room noise (after digital noise processing is done) without retraining the ML program (YAMNet). A tuned saturation coefficient Ω_K was found, and overall classification accuracy increased by at least 75% across every tested SNR value. The system maintains above 50% average correct-label accuracy across all five labels and reaches 100% accuracy on the “Water Running” event down to -1.5 dB SNR. The overall ESP32 microphone hardware is inexpensive and can be affordable as an actual product.

2

Background and Justification

Approximately 1 in 8 people in the United States have hearing loss in both ears, and many more experience temporary or situational hearing limitations (sleeping, wearing headphones, being in a different room of the home) [1]. Commercial products addressing this problem (such as visual smoke alarms, synced lighting throughout a house, app notifications, etc.) are usually device-specific, expensive, and difficult to install. A general-purpose sound classifier that operates on a small circuit, and wirelessly connects to a visual display (such as a laptop) would fill this gap.

This project also builds on the ESE 3310 microphone-circuit laboratory developed by Prof. Wang, redesigning the microphone (model CMA-6542PF) amplifier circuit into an integrated audio-visual interface that detects and classifies a fixed set of sounds across a wide range of auditory environments, and displays the resulting identification visually in real time [2].

The contribution is not a new neural network, but rather an integrated hardware and software system in which audio pre-processing, conditional detections and boosts, and score fusion are combined to make a stock ML model usable in noisy real-world conditions.

YAMNet is an open-source audio classification model trained on 521 audio classes [3]. It was decided upon as the foundation of the audio classification program because of its broad classification capability, and the plethora of instructions, literature, and walkthroughs regarding its use that were available.

3

Methods

System Overview

The complete system consists of two ESP32 microphone circuits with analog filtration and wireless data transfer to a laptop, and software implementing the processing and classification pipeline shown in Figure 1.

Audio event processing pipeline
Figure 1Audio event processing pipeline.

Pre-Processing Front End

Spectral Gate Noise Suppressor: This part of the system listens for steady, background sounds in the room, such as air conditioning, fans, and low hums, and turns them down. To do this, it uses a Short-Time Fourier Transform (STFT), where it divides the incoming audio into short slices and measures what frequencies are present in each slice. To determine what counts as noise, the program takes the quietest recent slices and classifies them as the room's baseline noise. It then creates a mask which lowers the magnitude of those frequencies rather than completely removing them. This prevents any potentially important audio from being removed entirely. Because the spectral profile of noise will differ between environments, the program continually updates its measuring and classification of baseline noise.

Voice-Band Notch Filter: This second stage turns down the range of pitches human voices occupy, which is roughly 100 to 3,400 Hz. The magnitude of reduction is capped at 8 dB deliberately so events such as doorbells and water running (which occasionally exist in this range as well) aren't fully cut-off. The frequency is capped at 3.4 kHz because alarms and ringtones contain a lot of high-pitched overtones above that threshold.

Classifier and Auxiliary Detectors

The cleaned 5-second audio window is fed into YAMNet, which returns a matrix of confidence scores mapped to different labels. Because YAMNet under weighs very short transient sounds (knocks and single beeps) and very long sustained tonal sounds (some alarms and ringtones), two detectors run in parallel and are conditionally added into the pipeline:

  • Transience detector: Computes spectral flux (the difference between consecutive frames). Sharp onsets are created by knocks and beeps, which can then be detected. If they are detected, a multiplier of 1.5 (50% boost) is applied on the labels under the “knock” and “beep” categories (e.g. “door”, “bump”, “beep boop”, “ping”) to increase their confidence scores.
  • Harmonic peak detector. Searches for the opposite of a transience detector: narrow, steady tones, which are usually characteristic of fire-alarms and ringtones. If detected, similarly a multiplier of 1.5 is applied to the labels under the “alarm” and “ringtone” categories (e.g. “ring”, “music”, “boom”).

Event Fusion and Scoring

Since YAMNet spreads its confidence across many output labels rather than committing to a single event, the five target events (fire alarm, phone ringtone, microwave beep, doorbell, and water running) are each defined by a list of related YAMNet sub-labels (e.g. “water”, “faucet”, “drip”, and “shower” all under the event “water running”). For the event with the best confidence interval in one sample, the matching sub-label scores are summed and put through an exponential-saturation function:

Final_confidence = 1 − exp( −Ω_K · Σ sublabel_scores )

where Ω_K is the coefficient that controls how strongly the sum of confidence values support the claim of the top sound. The exponential form guarantees the final confidence value stays between 0 and 1 regardless of how high the confidence level gets. This equation is crucial for classification as it allows us to combine evidence without over exaggerating the overall confidence level.

An event is finally reported only after a multi-frame voting check is passed. VOTES_REQUIRED is the number of times a specific sound needs to appear as number 1 in the last 10 samples before being declared as the confirmed sound being heard. Every reported event has a minimum cooldown before it can be triggered again. The full set of tunable parameters and their final values is summarized in Table 1.

Table 1Tunable parameters of the audio classification program and their final operating values.
ParameterFinal valueRole
MIN_SCORE0.20Minimum per-event confidence that counts as a vote.
VOTE_WINDOW10Rolling buffer of the last N inferences used for voting.
VOTES_REQUIRED2Minimum agreeing votes within the window before an event fires.
COOLDOWN_S6.0 sMinimum gap between two triggers of the same event.
INFER_WINDOW_S5.0 sLength of audio window passed to YAMNet on each inference call.
INFER_EVERY_S0.5 sHow often inference runs (heavy window overlap).
Ω_K1.0Coefficient in the exponential-saturation event fusion; controls how influential the sum of sub-label confidences is.

Hardware Implementation

Two integrated ESP32 microphone circuits were built. Each board has an op-amp-based gain before sending signals to the ESP32, which transmits the signals to the Raspberry Pi over Wi-Fi. The Pi drives a GUI that displays the most recently detected event.

Circuit diagram of one ESP32 microphone setup
Figure 2Circuit diagram of one ESP32 microphone setup.

Development Cycle

The work was divided into hardware and software. When developing software, YAMNet was analyzed to understand which of the initial parameters had to be changed in order to improve accuracy and decrease false positives. This includes personalization of VOTES_REQUIRED for specific sounds. For short burst sounds like a doorbell (and possibly phone ring) with long pauses in between, it would make sense for that number to be lower than the rest. Additionally, the spectral-flux transience detector and the harmonic peak detector was developed and implemented into the algorithm to help with these kinds of specific distinctions between sounds. Before the integration of software and hardware, all testing of the algorithm was done on a laptop to ensure time efficiency while the device was being developed. Testing was done in three different environments: a library, a dorm kitchen, and a building hallway. These were used to profile realistic noise floors.

The hardware development started with the lab microphone amplifier circuit from ESE 3310. First, we modified the microphone lab circuit from ESE 3310 so that the input from the microphone passes through an analog-to-digital converter to be processed on a Raspberry Pi.

Modified preamplifier circuit from Mic Lab
Figure 3Modified preamplifier circuit from Mic Lab [2].

There were several changes made to the circuit from the original lab circuit. Normally, sound inputs in the time domain are centered at 0 amplitude, but this will not work for our circuit. Because the Raspberry Pi will run on 3.3V, any voltage outside 0-3.3V will not be interpreted. Therefore, a voltage bias at 1.65V must be set, meaning the sound inputs are centered at 1.65V, and the circuit must limit voltage to hit maximums of 3.3V and 0V. The bias will be applied by changing R4 to 10K and voltage is lowered by implementing a voltage divider before feeding data to the MCP3008. The VCC was decreased to input 9V for future implementation of a 9V battery in PCB designs. In order to test this new circuit, we used an oscilloscope and Matlab to observe the output of the opamp through the MCP3008 circuit.

From auditory observation, it was clear that the USB mic had higher quality audio than the MCP3008 mic. Furthermore, the FFT of the output showed that the data using the ADC converter was limited to 2.4Khz. The audio recording limitations of the ADC converter circuit prevented any analysis of fire alarms and microwave sounds, which show peaks at 2Khz and 3Khz respectively. This is alarming because our circuit should have priority in detecting these signals. Because of this we explored other options for the microphone circuits.

Our second approach was the ESP32 microphone circuit. The ESP32 is advantageous because it has WiFi capabilities, which can be used in transferring audio data to the Raspberry Pi, and a built-in ADC. As shown in Figure 2, the microphone circuit is now attached to the ESP32 and not the Raspberry Pi. With this circuit design, we have the flexibility to collect audio data from multiple ESP32 units, which will transfer data to the Raspberry Pi. There were several changes to the circuit; first, because the ESP32 unit is powered by 3.3V, we powered all parts of the circuit with 3.3V. We removed the voltage divider and changed 9V to 3.3V.

We collected audio data from the ESP32 by compiling an algorithm through Arduino IDE that starts the ESP32 data collection on a specified WiFi port. We graph a sample of the data with a python file activated on the terminal. Data collection showed that the audio data sampled frequencies sufficient for the microwave and fire alarm signals. We used the ESP32 circuit moving forward in the project, and further along created another ESP32 unit and connection to a AA battery holder for portable use.

Initial Errors and Mitigations

  • Hardware mismatch. The ESP32 circuits initially displayed incorrectly in the program because of inconsistent normalization and unstable Wi-Fi connections through the institution network. Both were eventually rectified. The ESP's can only handle audio processing at 10 kHz which requires the algorithm to resample the processed audio to be able to insert into YAMNet. Before integration of software and hardware, the algorithm was developed based on the microphone from a personal computer which comes with a high-end microphone compared to the one from the device. Consequently, initial testing resulted in making multiple changes to the algorithm to account for the microphone and ESP shortcomings. This includes, resampling and averaging to cleanup the audio signal being received.
  • Noise-floor drift: A fixed noise estimate failed when the device was moved between environments (especially from library to the kitchen). The fix was a rolling percentile-based estimate.
  • Latency vs. accuracy. A 10-second analysis window gave the best accuracy but felt too unresponsive. A 5-second window with overlapping 0.5-second loops was chosen as the final compromise.

A real-time audio classifier running on two ESP32 microphone inputs, with a GUI that can be displayed on a personal computer.

-->
4

Results

Expected vs. Actual Audio Plots

As mentioned previously, initial testing on the algorithm was done on a computer that used a high-end microphone with better capabilities. Therefore, when analyzing the 5 sounds to be detected, the patterns and distinctions that could be seen from the audio plots were not as clear when seen through the ESP-32 and lab circuit.

Waveform and FFT of the microwave audio recorded on the laptop and prototype
Figure 4Waveform and FFT of the microwave audio recorded on the laptop (left) and prototype (right).

As seen in Figure 4, the microwave audio has two clear distinctions, which include the small bursts of rings and its high concentration of amplitude around 2 kHz. When the same audio is analyzed through the prototype in Figure 9, the bursts of waves aren't as clear and the concentration of amplitudes in frequency is not at 2 kHz anymore. These kinds of changes required the algorithm to be adjusted as filter ranges were placed on certain sounds to reduce noise. On top of this, there were certain checks that were implemented to prevent the misidentification of fire alarms with microwaves and vice versa, which also needed to be adjusted due to the changes.

Effect of the Fusion Coefficient Ω_K

The number of correctly triggered events was measured as Ω_K was varied (Figure 5). The detection rate rose as Ω_K increased from 0 toward 1, then plateaued. Above Ω_K = 1 there was no measurable gain in accuracy, so Ω_K = 1.0 was selected as the final operating value.

Accuracy of triggered events as a function of the fusion coefficient Omega_K
Figure 5Accuracy of triggered events as a function of the fusion coefficient Ω_K.

Effect of SNR on Triggering and Accuracy

With Ω_K = 1.0 fixed, the system was evaluated against test recordings mixed to varying SNRs ranging from −1.5 dB to 9.2 dB. Figure 11 shows the number of events triggered as a function of SNR, and Figure 7 shows the corresponding correct-label rate (the fraction of triggers that matched the actual event).

Number of triggered events per label as a function of SNR
Figure 6Number of triggered events per label as a function of SNR (dB).
Accuracy per event as a function of SNR
Figure 7Accuracy per event as a function of SNR (dB).

Headline Performance Numbers

  • With the two-stage denoising, the conditional detector boosts, and the exponential-saturation fusion enabled, the system produced event triggering down to −1.5 dB SNR, for which the stock YAMNet produced no triggers.
  • Across the tested SNR range, scoring fusion with Ω_K = 1.0 increased label accuracy by at least 75% relative to the same pipeline run without score fusion.
  • Averaged across all five labels, the system maintained above 50% label accuracy across all SNRs.
  • The “Water Running” event was the strongest performer, achieving 100% correct-label accuracy down to -1.5 dB SNR. This is because running water is extremely distinct from the other 4 event sounds. The final program had near a 100% accuracy in triggering events across all tested SNRs when one of the 5 event sounds were played, but accuracy in distinguishing between fire alarm, phone ringtone, microwave beep, and doorbell was around 50%. This is because all four of those sounds have significant overlap with each other. A loud microwave beep can be indistinguishable from a distant fire alarm, and a short phone ringtone might be the exact same tone as a doorbell. Even a human might have trouble classifying them when they don't align exactly with the stereotypical sounds we associate for each.

GUI Implementation

A Flask website was made for the Raspberry Pi to upload the information to so a computer could use its IP address to access this information and display on the webpage. The GUI would display the necessary information, including: The heard sound, confidence value, the current time, and which microphone detected the sound.

Example display of GUI when sound is detected vs. no sound detected
Figure 8Example display of GUI when sound is detected vs. no sound detected.
5

Discussion

Satisfaction of Project Aims

All three project aims were met. 1) Hardware: produced two functional ESP32 microphone front ends with wireless data transfer and a visual display. 2) Software: produced a digital filtering and machine-learning pipeline that classifies the five target events. 3) Integration and Validation: produced a continuously running system whose accuracy was characterized over realistic SNR ranges and whose key parameters were tuned to operating points that balance accuracy and responsiveness.

Strengths and Limitations of the Approach

The ESP32 circuits and auxiliary detectors do not attempt to outperform YAMNet, they instead pre-clean the input and cover its weak spots, while the label fusion turns YAMNet's diffuse confidence into higher accuracy event decisions. The voice-band notch is a deliberate tradeoff. It improves performance in chatty rooms but slightly attenuates events whose spectral signature exists inside the speech band, such as some doorbell tones. The lowest SNR value (−1.5 dB) was set by the test recordings used, and performance below that level was not directly characterized.

Performance Tradeoffs

There were several performance tradeoffs that needed to be evaluated. Longer windows (10 s) improve accuracy on sustained sounds but can leave short events from triggering while being laggy. The chosen 5 s window with 0.5 s loops keeps latency low while still averaging out short-term noise. Additionally, 8 dB notch was empirically the biggest reduction that didn't noticeably suppress alarm and ringtone sounds in the same band. Finally, increasing Ω_K beyond 1.0 produced no accuracy gain but did increase the rate at which weakly-supported events crossed the firing threshold.

Future Work

Adding events such as glass breaking, baby crying, and dog barking, all of which exist as YAMNet sub-labels and would require new aggregation lists. Having more events to look out for can prove useful to the user based on their needs and preferences. It would also be useful to allow users to use their own personally relevant sounds (a specific ringtone, or a specific microwave beep) by training a small auxiliary ML model on a handful of recorded examples. Having more microphones at their disposal would also secure their area better, which is also a potential addition to the device.

6

Conclusions

The project demonstrates that careful audio pre-processing and category score fusion can vastly improve the accuracy of a stock pretrained audio-classification model in noisy environments without retraining the model. This entire pipeline is achievable on inexpensive ESP32 hardware, with open-source resources, and with much room for customizability and user-specific refinement.

7

Deliverables

Proposed and Completed Deliverables

The final design includes two functional ESP32 microphone front ends with analog filtration, Wi-Fi data transfer, and a laptop GUI that displays the currently detected event. A complete Python software pipeline was designed and implemented with tunable parameters based on specific environments. The software program is able to digitally filter noisy audio and classify a fixed list of sounds using a machine-learning model. A circuit capable of audio detection, wireless communication, and visual display was developed and tested. Finally, this report was made, and a project webpage was uploaded.

8

Schedule and Timeline

The project was conducted from February–April 2026. Table 2 lists the originally planned milestones and the team's adherence to them. Per-member responsibilities are detailed in Appendix B.

Table 2Project schedule, planned completion dates, and adherence.
TasksPlanned DateStatus
Preliminary Research01/12 – 01/26On Time
One Page Proposal01/19 – 01/23On Time
ESE 3310 Lab testing01/22 – 01/24On Time
Drafting Hardware and Software Design01/26 – 01/29On Time
ML Model Implementation01/27 – 02/04On Time
Basic GUI Creation01/28 – 01/31On Time
Raspberry Pi Testing01/28 – 02/11Delayed
Create Full Proposal02/02 – 02/06On Time
Test First Hardware Design02/09 – 02/13Delayed
Debug and Implementations to ML Model02/09 – 02/18On Time
Connect Hardware with Software02/23 – 02/27Delayed
Test First Prototype03/02 – 03/09Delayed
Fixes and Debugging03/16 – 03/20On Time
Additions and Experimenting03/23 – 04/08On Time
Final Testing and Debugging04/10 – 04/24On Time
Final Report04/15 – 04/24On Time
B
Appendix B

Project Team

Each member contributed across hardware, software, and documentation, with primary responsibilities as listed below.

Shua Jeon
BS Electrical Engineering

Hardware design and integration, including the ESP32 microphone front-end circuits, analog filtration, and troubleshooting Wi-Fi data transfer.

Kevin Gutierrez Velasco
BS Electrical Engineering

Led debugging of laptop-vs-ESP32 normalization differences and Wi-Fi stability. Researched resampling strategies as well as filters for denoising. Developed a GUI interface and an exponential event fusion function.

Keshav Vasanth
BS Electrical Engineering

Developed and iteratively refined main program pipeline: incorporating YAMNet, de-noising functions, auxiliary classical detectors (spectral-flux transience and harmonic-peak detection), and the voting and cooldown logic. Recorded noise-environment characterization data and final program operation/accuracy data.

Project Client

The project client and faculty instructor was Prof. Dorothy Wang. The client's role was to define the high-level scope and project motivations and provide weekly technical feedback. The project also builds directly on the ESE 3310 microphone-circuit laboratory developed by Prof. Wang, whose original analog front-end design was the starting point for the ESP32 microphone circuits.

A
Appendix A

Engineering Design Considerations (ABET)

A.1 — Relation to the BS EE Curriculum

The project relates to the core electrical-engineering curriculum in many ways. ESE 3310 (the microphone-circuit laboratory developed by Prof. Wang) directly provided the analog front-end starting point. Coursework in signals and systems, Fourier analysis, and digital signal processing supported the STFT-based spectral-gate suppressor, the voice-band notch filter, the spectral-flux transience detector, and the harmonic-peak detector. Embedded-systems coursework supported the ESP32 firmware, sampling, and wireless data-transfer pipeline.

A.2 — Incorporation of Engineering Standards

The Software was developed in Python using standard scientific-computing libraries (NumPy, SciPy) and TensorFlow/TensorFlow Hub for YAMNet inference, all of which provided the standards for audio DSP and ML inference. Audio was sampled at 16 kHz, which is the standard input rate expected by YAMNet and a common standard for speech-band acoustic processing. Wi-Fi data transfer between the ESP32 boards and the host follows the IEEE 802.11 wireless standard (implemented natively by the ESP32).

A.3 — Constraints Inherent in the Project

The system had to operate continuously and in real time on the given hardware from ESE 3310, classify a fixed five-event vocabulary, and remain useful at low SNRs, or in other words, in noisy environments. All components were chosen from the available supply in the Urbauer labs to keep the total cost under the capstone budget. The 1-semester timeline bounded both the scope of the event vocabulary and the complexity of the hardware design. Since the device is intended primarily for users with auditory disabilities, the visual display had to convey detected events unambiguously. Fire-alarm and water-running events are safety-relevant, so false negatives would be dangerous, which was the reasoning behind the relatively conservative voting and cooldown thresholds, and de-noising parameters.

C
Appendix C

References

  1. Lin, Harrison W et al. “Self-reported Hearing Difficulty and Risk of Accidental Injury in US Adults, 2007 to 2015.” JAMA otolaryngology — head & neck surgery vol. 144, 5 (2018) [Online]: 413–417. doi:10.1001/jamaoto.2018.0039.
  2. WashU (2026). ESE 3310 Electronics Laboratory.
  3. TensorFlow, “Transfer learning with YAMNet for environmental sound classification,” TensorFlow Tutorials, Aug. 16, 2024.
  4. Google, “YAMNet (AudioSet) model card,” Kaggle Models.
  5. TensorFlow, “YAMNet README,” tensorflow/models.