Apple developed the MM1 multimodal model for image and text interpretation

By: Bohdan Kaminskyi | 19.03.2024, 21:58

Zhiyue/Unsplash.

Apple researchers have created a new artificial intelligence called MM1 that can interpret both image and text data. The company published a paper on arXiv describing a family of multimodal models (MLLM) and their test results.

Here's What We Know

According to the developers, the MM1 family of multimodal models has made significant advances in the tasks of image captioning, visual question answering and search queries by integrating text and image data. Some of them include up to 30 billion parameters.

The models use datasets consisting of captioned images, image documents and plain text. The researchers claim that MM1 can count objects, identify them in pictures and use "common sense" to provide users with useful information.

In addition, MLLM is capable of contextual learning, using knowledge from the current dialogue rather than starting from scratch each time. As an example, an image from a menu is uploaded, and the model can then calculate the cost of drinks for a group based on the prices shown.

Flashback

While large language models (LLMs) have received a lot of press coverage, Apple has decided not to use third-party development and instead focus on building its own next-generation LLM with multimodal capabilities.

Multimodal AI combines and processes different types of input data such as visual, audio, and textual information. This allows systems to better understand complex data and provide more accurate and contextual interpretation than unimodal models.

Source: TechXplore