Despite significant developments in the world of automated image captioning, current image captioning approaches are not well-aligned with the needs of people with visual impairments. People who are blind or with low vision share a unique and real challenge –their visual impairment exposes them to a time-consuming, and sometimes, impossible task of learning what content is present in an image without visual assistance. As such, these communities often seek a visual assistant to describe photos they take themselves or find online.
In an ideal world, a fully-automated computer vision (CV) approach would provide such descriptions. However, this artificial intelligence (AI) process is riddled with challenges. Not only is CV work missing images taken by this population, but people who are blind and with low vision are required to passively listen to one-size-fits-all descriptions of images to locate information of interest. In addition, CV algorithms often deliver incomplete or incorrect information. Because of these shortcomings, reliable image captioning systems continue to depend on humans to provide descriptions of photos to people with visual impairments.
Determined to find a way to improve image captioning for blind and low vision communities, Principal investigator and Texas iSchool Assistant Professor Danna Gurari and Associate Professor Ken Fleischmann believe there is a more efficient and effective solution that reduces human effort and produces accurate results for communities who are blind or with low vision. And they recently embarked on a new project to “design algorithms and systems that close the gap between CV algorithm and human performance for describing pictures taken by both sighted and visually impaired photographers.”
But the Texas School of Information professors weren’t the only ones thinking about how to improve image captioning for people who are blind or with low vision. A team of researchers at Microsoft Research recently announced a similar vision and goal –to train AI systems to provide more detailed captions that can offer a richer understanding, and more accurate representation of images for the blind or those with low vision. In light of this mission, Microsoft Research developed a new project called the Microsoft Ability Initiative.
According to Microsoft Research Principal Researcher and Research Manager Meredith Ringel Morris, “the companywide initiative aims to create a public dataset that ultimately can be used to advance the state of the art in AI systems for automated image captioning.”
After a competitive process involving a select number of universities, the search for an academic research unit with whom they could partner for the new venture came to an end when Microsoft Research chose The University of Texas at Austin, School of Information. The proposed work of Gurari and Fleischmann was the only project selected through this competition.
The Texas iSchool research team proposed two main tasks of (1) introducing the first publicly-available image captioning dataset from people with visual impairments paired with a community AI challenge and workshop, and (2) identifying the values and preferences of people with visual impairments –to inform the design of next-generation image captioning systems and datasets.
“The collaboration builds upon prior Microsoft research that has identified a need for new approaches at the intersection of computer vision and accessibility,” explained Morris.
The Microsoft Research team which includes Ed Cutrell, Roy Zimmermann, Meredith Ringel Morris, and Neel Joshi, plans to collaborate with UT Austin, School of Information over an 18-month period. Gurari and Fleischmann will lead the UT Austin team, which will also include three PhD students and one postdoctoral fellow.
The Microsoft Ability Initiative builds on the interdisciplinary team’s expertise in computer vision, human-computer interaction, accessibility, ethics, and value-sensitive design. Gurari’s team is experienced in establishing new datasets, designing human-machine partnerships, creating human computer interaction systems, and developing accessible technology. As co-founder of the ECCV VizWiz Grand Challenge in 2018, Gurari is skilled in community-building and has a previous record of success in creating public datasets to advance the state-of-the-art in AI and accessibility.
Fleischmann’s team offers complementary experience in the ethics of AI and understanding users’ values to inform technology design. Given his expertise in the role of human values in the design and use of information technologies, Fleischmann will lead the effort focused on uncovering the needs and values of people with visual impairments –which will ultimately inform the design of future image captioning systems.
The Microsoft researchers involved in this initiative have specialized experience in accessible technologies, human-centric AI systems, and computer vision. “Our efforts are complemented by colleagues in other divisions of the company, including the AI for Accessibility program, which helps fund the initiative, and Microsoft 365 accessibility,” explained Morris.
Dubbed “a collaborative quest to innovate in image captioning for people who are blind or with low vision,” Morris explained that “the Microsoft Ability Initiativeis one of an increasing number of initiatives at Microsoft in which researchers and product developers are coming together in a new, cross-company push to spur innovative and exciting new research and development in the area of accessible technologies.”
Gurari believes that the initiative “will not only advance the state of the art of vision-to-language technology, but it will also continue the progress Microsoft has made with such tools and resources as the Seeing AI mobile phone application and the Microsoft Common Objects in Context (MS COCO) dataset. It will also serve as a great teaching opportunity for Texas iSchool students.”
The Texas iSchool team will employ a user-centered approach to the problem, including working with communities who are blind or with low vision to improve understanding of their expectations of image captioning tools. The team will also host community challenges and workshops to accelerate progress on algorithm development and facilitate the development of more accessible methods to assist people who are blind or with low vision.
Gurari and Fleischmann explain that “this work can empower people with visual impairments to more rapidly and accurately learn about the diversity of visual information, while contributing to solving related problems including image search, visual question answering, and robotics.”
The Microsoft Research team launched the new collaboration with the Texas iSchool during a two-day visit to Austin in January. Morris noted that the Microsoft Research team came away from the meeting at The University of Texas at Austin, School of Information, “even more energized about the potential for this initiative to have real impact in the lives of millions of people around the world.” “We couldn’t be more excited,” she said.
The Texas iSchool professors share the Microsoft Research team’s excitement about their upcoming collaboration. “To be selected for this gift is a great honor,” said Gurari and Fleischmann. “We look forward to working with the Microsoft Research team over the months, and are eager to make progress with our shared goal –to better align image captioning systems with the needs of those who are blind or with low vision.”