When a student presses play on a text-to-speech tool and just listens, they are only getting half the benefit. The real gains in reading comprehension come when students can see each word light up on screen at the exact moment it is spoken. This is not a minor UI polish. It is the difference between passive listening and active reading practice, and decades of research explain why.
Dual-Coding Theory: Two Channels Are Better Than One
In 1971, psychologist Allan Paivio proposed dual-coding theory, which holds that human cognition operates through two distinct but interconnected systems: a verbal system that processes language and a nonverbal (visual-spatial) system that processes images, spatial information, and visual patterns.
When information enters through only one channel, it creates a single mental representation. When it enters through both channels simultaneously, it creates two linked representations. This dual encoding makes the information significantly easier to retrieve later.
Text-to-speech with word highlighting activates both channels at the same time. The auditory channel processes the spoken words. The visual channel processes the highlighted text moving across the page. The brain creates connections between what is heard and what is seen, reinforcing the association between the written word and its spoken form.
For struggling readers, this dual input is especially important. Many students with reading difficulties have a disconnect between their decoding ability (turning letters into sounds) and their comprehension ability (understanding meaning). When they try to read silently, so much cognitive effort goes into decoding that little is left for comprehension. TTS with highlighting offloads the decoding work to the audio while keeping the student visually engaged with the text, allowing more cognitive resources for understanding.
Multimodal Learning and the Multimedia Principle
Richard Mayer's cognitive theory of multimedia learning, developed through extensive experimental research at UC Santa Barbara, extends this idea. Mayer demonstrated that people learn better from words and pictures together than from words alone. His multimedia principle has been replicated across hundreds of studies.
The key insight for TTS is that the visual highlighting acts as a form of pictorial annotation. It draws the student's attention to the specific word being spoken, creating a tight temporal and spatial alignment between the auditory and visual information streams. This alignment reduces the cognitive load required to integrate the two streams, which is critical for students who already have limited working memory bandwidth for reading tasks.
Mayer's temporal contiguity principle is particularly relevant: corresponding words and visual elements should be presented simultaneously rather than sequentially. A TTS tool that reads a sentence aloud and then highlights the sentence afterward is less effective than one that highlights each word at the exact moment it is spoken. The synchronization is what creates the learning benefit.
Eye-Tracking Research: Where Students Look
Eye-tracking studies provide direct evidence of how synchronized highlighting changes reading behavior. Research conducted by Woody and colleagues (2018) at the University of Central Florida tracked eye movements of students using TTS with and without word-level highlighting.
Students using TTS without highlighting showed scattered gaze patterns. Their eyes wandered across the page, often settling on images, headers, or unrelated text while the audio played. Essentially, many students were not reading at all; they were just listening.
Students using TTS with word-level highlighting showed fundamentally different patterns. Their gaze followed the highlighted word across each line, mimicking the eye movement patterns of proficient readers. Their fixation durations on individual words were longer and more consistent, and they made fewer regressive saccades (backward eye movements that indicate confusion or lost tracking).
This matters because eye-tracking patterns are not just a measure of attention. They are a measure of reading practice. A student whose eyes are following the highlighted text is rehearsing the left-to-right, word-by-word tracking pattern that fluent reading requires. Over time, this practice strengthens the neural pathways for reading, even when the TTS support is removed.
Sentence Highlighting vs. Word Highlighting
Not all highlighting is created equal. Many TTS tools highlight at the sentence level: the entire sentence turns yellow (or another color) while it is being read aloud. This is easier to implement technically, but it is far less effective for struggling readers.
Sentence highlighting tells the student roughly where the audio is in the text, but it does not help with word-level tracking. A sentence might contain 15 to 20 words. A student with tracking difficulties cannot maintain their place within a highlighted block of 20 words; they need the specific word marked.
Research by Hecker and colleagues found that word-level highlighting was significantly more effective than sentence-level highlighting for students with learning disabilities, particularly for:
- Word recognition: Students exposed to word-level highlighting showed greater improvement in recognizing individual words on subsequent tests.
- Tracking fluency: Students developed better left-to-right tracking habits when each word was individually highlighted.
- Engagement duration: Students spent more time actively reading (eyes on text) with word-level highlighting compared to sentence-level highlighting.
There is also a practical consideration. When a sentence is highlighted as a block, a student who gets lost cannot easily find their place. When individual words are highlighted in sequence, a student can re-orient instantly by looking for the highlighted word.
How AWS Polly Speech Marks Enable Precise Highlighting
Achieving accurate word-level highlighting is technically challenging. The TTS audio must be synchronized with the text display at the individual word level, which requires knowing the exact timestamp (in milliseconds) when each word begins and ends in the audio stream.
Most TTS engines output audio as an opaque audio file. You send in text, you get back an MP3. To synchronize highlighting, you would need to run a separate speech-to-text alignment process on the generated audio, which is computationally expensive and error-prone.
AWS Polly solves this problem with a feature called Speech Marks. When generating audio, you can simultaneously request a speech marks output, which returns a JSON stream with the exact timing of every word:
{"time": 0, "type": "word", "start": 0, "end": 3, "value": "The"}
{"time": 122, "type": "word", "start": 4, "end": 7, "value": "cat"}
{"time": 287, "type": "word", "start": 8, "end": 11, "value": "sat"}
{"time": 510, "type": "word", "start": 12, "end": 14, "value": "on"}
{"time": 621, "type": "word", "start": 15, "end": 18, "value": "the"}
{"time": 744, "type": "word", "start": 19, "end": 22, "value": "mat"}
Each entry tells you: at 122 milliseconds into the audio, the word "cat" (characters 4 through 7 in the original text) begins. This allows the highlighting engine to schedule a highlight event for each word at the precise millisecond it is spoken.
ReadingVox uses Polly's Neural TTS engine, which produces natural-sounding speech, combined with speech marks to achieve millisecond-accurate word highlighting. The speech marks are generated server-side alongside the audio, cached together, and sent to the client. The extension's playback engine then schedules highlight events using the timestamps, ensuring the visual highlighting stays perfectly synchronized with the audio even at different playback speeds.
Why This Is a Key Differentiator for TTS Tools
When evaluating TTS tools for a school, word-level highlighting should be near the top of the feature checklist. Here is why it matters at a practical level.
IEP and 504 compliance. Many IEPs specify "text-to-speech with highlighting" as an accommodation. Sentence-level highlighting technically meets this requirement, but word-level highlighting is what the accommodation is intended to provide. If a parent or advocate presses the point, the distinction matters.
Building reading independence. The ultimate goal of TTS accommodations is not permanent dependence on audio. It is building reading skills so the student needs the tool less over time. Word-level highlighting actively teaches tracking and word recognition. Audio alone does not.
Student engagement. Teachers consistently report that students are more engaged and on-task when using TTS with word highlighting compared to audio alone. The highlighting gives students a visual anchor that keeps their attention on the text rather than drifting.
Assessment validity. When students use TTS on assessments (where permitted by their accommodation plan), word highlighting ensures they are actually reading the questions and passages, not just listening passively. This makes the assessment results more valid as a measure of comprehension rather than listening ability.
What to Look For
If you are selecting a TTS tool for your school or district, ask these specific questions about highlighting:
- Does the tool highlight at the word level or the sentence level?
- Is the highlighting synchronized with the audio in real time, or is there a noticeable delay?
- Does the highlighting work at different playback speeds (0.5x, 1.5x, 2x)?
- Does the highlighting work on Google Docs and other non-standard web pages, or only on regular web content?
- Can students customize the highlight color for visual comfort?
The answers to these questions will tell you whether the tool is providing a genuine reading support or just playing audio over text. For struggling readers, that difference is everything.