VisionNav is built with a focus on aiding the visually impaired who are looking for an easily accessible way to navigate their daily lives. The reason behind creating such an app is most of our team knows someone in their family who may face difficulties with vision. With further research into this topic, we found that 7.3 million adults in the U.S. are classified as visually impaired, with 1 million fully blind, of which 70% are unemployed. This was the primary motivation behind implementing VisionNav, which aims to help its users find affordable ease in their daily lives.
VisionNav utilizes the built-in camera and LIDAR in an iPhone to stream a real-time visualization of the user's surroundings, and uses sound within devices like the AirPods to navigate around the setting. Within this setting, VisionNav aims to accomplish three different categories of tasks. The first behavior is set to navigate around obstacles using 3D audio cues that go off in either the left or right AirPod, directing the user to move in that direction. The decision to set the audio cue off is made based on the proximity and depth perception of the object, as determined by the real-time streaming data collected by the app. The next behavior lies in the user's ability to ask VisionNav AI assistant to locate a specific item in a given scene. With the help of the AI assistant, the command stated by the user is deciphered and processed into the YOLO model, which identifies the object based on the deciphered message. Once the object is confirmed to be in the frame, the user will need to put their hand in the frame as well, which will then prompt the app to then guide the hand using the behavior from the navigation ability towards the target object. The final behavior is to perform miscellaneous tasks, such as finding a path to a specific local location, locating an available seat in a room, or even reading a book. This task is accomplished through the combination of the AI assistant and navigation ability, where we identify the available locations in the frame and navigate towards that destination while avoiding objects. The AI assistant automatically switches to navigation mode based on the user's command.
VisionNav has a few distinct phases. First, to utilize the iPhone camera and LIDAR, the app was built using Xcode and Swift, which provided good compatibility with the required services. Next, the real-time images captured by the iPhone were streamed to a local web server, where a Python script was incorporated to read the pictures. More Python functions were created to divide the frame captured by the iPhone into five equidistant columns, which tracked the depth and distance of objects within the frame. Next, the images were read by the YOLO model, which was used to identify objects. This data was then pushed into other Python functions, which determined which of the audio cues needed to be sounded. We then utilized Gemini APIs for the vocal commands input by the user into the app. We used Gemini as the AI assistant to pass messages, such as locating specific targets.
The main challenge faced during the implementation of VisionNav was accurately handling multiple objects in proximity, as the app struggled to provide clear enough audio cues to account for both obstacles, given that it detects both objects simultaneously. There was also considerable difficulty with the automated switching between "find an object mode" and "navigate around the obstacle mode," where the app struggled with the timing of identifying the object as the end goal and avoiding obstacles while progressing towards that target object. There was also a struggle with the voice recognition accuracy of the Gemini text-to-speech implementation, as the feature's accuracy was not efficient.
The main accomplishment we are proud of is the feature where we use the hand navigation to find a target object in a set frame based on the 3D audio cues which signified the left and right directions through the corresponding AirPod signal, and the aligned forward motion of the hand with a different audio with rapid ticking noise as the hand got closer to the target object, all of which was processed through the help of Gemini to help understand the user's desire to find the target object, along with the YOLO models ability to locate and identify the target object in the set frame.
VisionNav was a project that provided valuable learning experiences. Starting with the sensing aspect, on Xcode using Swift, our team learned to activate and extract the data collected through the camera and lidar built into the iPhone. Our team also experimented with the use of depth data and distancing while using audio libraries from Python to connect the conditional situations of playing audio cues for the user. There was also the use of the YOLO model, a real-time object detection algorithm, which is vital in identifying objects present in a given frame. We also learned how to utilize Gemini's API keys to integrate the speech-to-text feature specifically into our app. With all these functionalities present, we also had to learn how to efficiently combine all these components into one application, which resulted in VisionNav.
With the current implementation of VisionNav, we have created straightforward functionality that aims to make mobility easier for visually impaired users. We aim to enhance the accuracy of the audio cues by improving latency and overall programming. We also want to integrate the app into wearable hardware, such as smart glasses, with a strict focus on implementing the app's functionality, which could make the viewing aspect of the app more natural while keeping costs low.