I290 eyetracking

Eye tracking experiment web pages

View the Project on GitHub divyakkm/i290_eyetracking


Project Presentation

Project Goals

The goal of our project is to identify subjects’ eye gaze patterns when they are reading on a screen while listening to accompanying audio, either synchronized (verbatim) speech or non-synchronized natural speech and to measure comprehension of the content read/heard. We also compare this with gaze patterns and comprehension for a single stimulus (only text or only audio).We use this to gain insight that might help us make a case for (or against) audio-assisted reading.


We stumbled on this idea while brainstorming with Prof. Marti Hearst at the I School, Berkeley. She was very interested in audiobooks and their effectiveness in reading and comprehension. We thought eye-tracking provided by the SMI Eye-tracking system device would be a great way to validate existing hypotheses and research about audio-assisted reading by giving evidence of readers’ eye gaze patterns, especially parameters like fixation, saccades, etc.

Experiment Design

We designed an experiment with 4 variants:
1. Text only
2. Audio only
3. Audio synchronized with text (verbatim)
4. Audio unsynchronized with text (lecture/talk format, free form, with text as a more formal and verbose reference)

We recruited participants on a voluntary basis and gave each of them two comprehension tasks - one with single stimulus content and the other with double stimulus content. We assigned them randomly to any of the following four comparison groups:
a) Text alone versus Synchronized audio-assisted text
b) Audio alone versus Synchronized audio-assisted text
c) Text alone versus Unsynchronized audio-assisted text
d) Audio alone versus Unsynchronized audio assisted text

We were interested in the above 4 specific groups and not all combinations of 4 variants.

Before the experiment, participants were given information and instructions about the experiment as well as a consent form. After the experiment, we collected their self-reported English proficiency level (text and audio were in English) on the ILR scale and general demographic information(age, gender) and also asked them informal, qualitative questions about their experience.

In the actual experiment, participant were required to do 2 comprehension tasks. Each task had about 4 minutes of reading and/or listening (with the goal of answering some questions later, which they were informed about before they started). After the reading/listening they were presented with a set of questions based on the text/audio. Questions were multiple choice and a mix of inference/reasoning (E.g. What do you think was the cause of X?) and recall (e.g. How many days did the person A do X? ) related. They were not timed (they could take as much time to answer a question) and we recorded time taken for questions. The reading/listening, however, was timed. Participants were not allowed to refer back to the text while answering questions, so we took this into consideration while setting the questions. They were also not allowed to take notes during the reading/listening nor to pause the audio and read for longer.

We recorded the eye-gaze data for each experiment. For audio only, we had a static image consistent with the context of the content instead of accompanying text and we didn’t really record their gaze. We did, however, measure the same while the participant answered questions. During questions, we expected the fixations on different words in the questions/options to give us a hint about dubiousness and deliberation when faced with the question, difficulty level of the question, etc. whereas the lack of fixation could suggest confidence level or carelessness while answering the question. We also got mouse-click data which gave us a hint about uncertainty during questions if a person switched between answer options multiple times before submitting an answer.

We used content from http://www.manythings.org/listen/
This has a repository of texts with accompanying audio. We selected two:
http://www.manythings.org/voa/how/2076.html (for single stimulus)
http://www.manythings.org/voa/things/7027.html (for double stimulus)

Each subject received both of these. The texts and questions did not require/assume any specialized prior knowledge on the part of the participants in order to comprehend. We built web pages using this text and embedded the respective audio files in them. These web pages were used to run the experiments.

We used the Reading Comprehension sample questions of the ETS to model our questions based on the content.

For the unsynchronized audio, we recorded a natural, informal rendition of the text in our own voice and words, retaining all the essential facts while leaving out extraneous formal details. Here the content was interspersed with natural expressions, emphasis, repetitions, prompts etc. as deemed necessary.

We used time taken to answer questions and the accuracy of the answers for each task as our metrics to measure effectiveness of comprehension.


We controlled the experiment by following steps:

  • A single participant received two different contents for their two comprehension tasks to eliminate learning effect. There might still be some learning effect with the UI itself, but this could not be controlled, and hopefully would be negligible.
  • We used the same two contents for all the 4 variants we mentioned above. There also was the same time overall for reading/listening. This took care of ensuring difficulty level was same across variants.
  • For audio, we controlled the speed of the audio file and ensured it was the same for the cases which had audio in both parts.
  • Same questions were asked to all participants.
  • There was a change in audio speeds between synchronized and unsynchronized audio, because we compare audio-book standard audio with natural speech one would use during a lecture/talk to compare effectiveness.

    Some pictures from the experiment -


    Once the participants completed the experiment, we did a small qualitative interview to understand and record their experience. Some questions we asked them were:

  • What was your experience during the experiment?
  • Which task did you like better?
  • Which task did you find easier/tougher?
  • Which task did you think you did better on?
  • What were some difficulties you faced in any/both tasks?
  • Do you listen to audiobooks often? Why/why not? What is your experience with them?
  • When/under what conditions do you think audio-assisted reading would be useful to you?

    We also showed them the gazing patterns immediately and let them walk us through what/how they read the text. This was interesting and useful for us to know and correlate to the findings from their gaze patterns and comprehension performance metrics.

    The SMI toolkit gives us visualizations like heatmap, focus map, swarm path chart, scan chart etc. for each user’s gaze data which helped us compare differences in users’ reading styles.


    Gaze patterns

    This video shows the gazing patterns for two users reading text. There was no audio involved in this case.

    This video shows the gazing patterns for two users reading text alongwith unsynchronized/paraphrased audio being played in the background.

    This video shows the gazing patterns for two users reading text alongwith synchronized audio being played in the background.

    Key observations from gazing patterns

    • People fixate more on nouns, beginning of paragraphs and fact-rich areas
    • Saccades in text only and sync audio groups are more linear
    • Saccades in unsync audio are all over the place
    • Some people read the beginnings of paragraphs and trail off, mentally predicting/inferring the remaining sentence
    • Some people concentrate on the middle of paragraphs which have more facts and neglect the noise in the text
    • People keep backtracking to comprehend/retain facts when they know they will be tested on them later
    • Got insight on uncertainty and deliberation during questions through fixation, saccades and mouse clicks
    • Mixed reactions to single stimulus vs. two stimuli
    • People liked natural, reinforcing/introductory prompts in unsync. audio

    Measuring reading comprehension

    TaskAverage AccuracyAverage Time
    Text + Audio Synchronized73%11526ms
    Text + Audio Unsynchronized60%9953ms
    Order of effectiveness: Audio > Text > Text + Sync Audio > Text + Unsync Audio


    Case Average Time(ms) Average Accuracy
    Part 1 Part 2 Part 1 Part 2 Part 1 Part 2
    Text Sync Audio 11637 11988 70% 70%
    Audio Sync. Audio 13511 12210 100% 90%
    Text Faster Audio 13418 10445 80% 70%
    Audio Faster Audio 8968 11461 60% 60%
    Text Unsync. Audio 12765 10307 70% 60%
    Audio Unsync. Audio 13591 9600 70% 60%
    Participant Single Stimulus(Part 1) Double Stimulus(Part 2)
    ID Preference Accuracy Preference Accuracy
    2 Yes 100% No 100%
    3 Yes 80% No 80%
    4 No 60% Yes 60%
    5 No 100% Yes 80%
    6 No 60% Yes 80%
    7 No 60% Yes 40%
    8 No 60% Yes 60%
    9 No 80% Yes 60%
    10 Yes 60% No 60%
    11 Yes 80% No 60%
    12 Yes 60% No 60%

    More visualizations used for analysis:

    Image showing heatmap for reading patterns for a subject

    Image showing heatmap for reading patterns for a subject

    Image showing heatmap for specific regions for reading patterns for a subject

    Image showing focus map for reading patterns for a subject

    Image showing focus map for reading patterns for a subject

    Qualitative feedback

  • “It would be great if the positioning of the facts would match the text, but I like the prompts in the natural audio”
  • “Though it was annoying, slow audio helped me pay more attention, so I managed to retain more. I think I did better on that.”
  • “I gave priority to audio...(and stopped reading)”
  • “I just read the text and ignored the audio”
  • “I performed better on the first one but liked the second one better”

    Future scope/work

  • More experiments per comparison group
  • Effect of annotations in text on eye-gaze data
  • Effect of accompanying graphics with text on eye-gaze data
  • Increasing complexity, context/type of text
  • Depending on results from the experiment, we plan to come up with design principles for design of audiobooks and presentations. We also plan to suggest specific scenarios/applications that would be better suited for the 4 different variants of audio and/or text.


  • Eye-tracking device doesn’t consider peripheral vision
  • Device cannot track eye gaze data when subject is wearing glasses
  • No automated way to record answer scores: had to manually look at click and gaze data on each qs. page
  • Fixation does not always imply attention and vice versa


    Google Drive Folder


    Content and Questions