By Kushal Kafle, Senior Research Scientist
This year’s CVPR (the IEEE / CVF Computer Vision and Pattern Recognition Conference) was the largest ever, with an attendance of over 12,000 people in person from more than 75 countries, and a record-breaking number of paper submissions. It was an extraordinarily exciting time for the community to gather—it feels as if there are new breakthroughs and novel applications nearly every day, and many of us (including myself) were delighted to attend our first large, in-person conference since the pandemic.
Adobe was a gold-level sponsor of the event for 2024, and my Adobe Research colleagues had a strong presence in the conference. We shared over 40 papers, helped organize conference workshops, delivered keynote talks, served on expert panels, and gave workshop presentations. Adobe Research’s work is on the cutting edge of visual intelligence and carries a strong momentum in all the conference’s trending areas.
As I took in the presentations, panels, and papers, I noticed four big trends that are shaping the world of computer vision right now at CVPR.
Trend 1: Large foundation models are speeding up the pace of development—a lot
The trend of developing and leveraging large foundation models continues to have a strong momentum in all areas of machine learning and AI, and computer vision is no exception. This year, a lot of research from both academia and industry focused on developing and refining huge foundation models or making use of existing foundation models for downstream applications.
Due to the unprecedented ability of big models to generalize, new applications are now being developed at a record pace. The once difficult and time-consuming task of developing computer vision models for specific applications has become significantly simpler because large foundation models provide a strong starting point for building bespoke solutions and do much of the heavy lifting.
Trend 2: Multimodality means vision research is going beyond vision toward text, audio, user interactions, and more
As models get bigger, there is also an increased push towards consolidating computer vision with other modalities. This trend is not entirely new. Vision and language models and problems such as image captioning, visual question answering, and limited forms of user interaction via referring expression recognitions and related problems have been a distinct sub-field of computer vision for a while. However, the advent of large openly available foundation models has ushered in many more use cases by the research community in the past couple of years.
At this year’s CVPR, I saw an increased push towards creating models that enable large language models to “see” visual content. Besides just images and text, there are also efforts towards developing broader multimodal models incorporating audio, user interactions, and beyond. It’s clear that large multimodal models are going to be one of the hottest trends in the coming years.
Trend 3: Data curation and synthetic data generation have everyone’s attention, especially in industry
With ever-growing model sizes, the importance of training data in training AI models has come into sharp focus. While data has always been a crucial part of developing AI models, the research community is increasingly becoming aware of its outsized importance in the final model quality. This year, I saw a huge amount of interest around the role of sourcing, curating, filtering, and annotating data to produce the biggest and cleanest training data possible. An important subset on this trend was regarding the generation and use of synthetic data, i.e., data that is procedurally generated by following various tools, rules, heuristics, and other machine learning models which is then used to train subsequent AI models.
Understandably, the buzz was bigger among industry folks than academics, since the development of large models is mostly being carried out by industrial AI labs. Furthermore, AI solutions from industry need to power real-world products, which means the quality of the training data is crucially important. To this end, I noticed a large portion of the exhibition booths were from companies and startups focused on labeling, curating, or procuring high-quality data. As model designs become consolidated across modalities, it looks like the race to obtain the biggest and cleanest data to train these models will be a crucial part of the “secret sauce” for key players in this space.
Trend 4: New frontiers are taking computer vision researchers into unexpected territory
I went to CVPR expecting to see a lot of excitement and work around foundation models and the usage/adaptation of LLMs for computer vision tasks. I was impressed by all of the progress in this area but perhaps not surprised to see the latest trends.
I was, however, pleasantly surprised by a much bigger-than-expected pool of papers, workshops, and hallway discussions around breaking new greenfield opportunities in computer vision and AI. I observed this in two distinct ways. Firstly, I saw a huge expansion and intermingling of previously distinct subfields into mainstream computer vision research. The lines between, for example, computer graphics and computer vision, seem to be blurring, which is leading to a much larger presence of computer graphics, 3D modeling, sensors, medical, and physics-based models than in previous years. Secondly, I noticed a significant share of work around building robotic, embodied, and digital agents to unlock new abilities, ranging from digital assistants to autonomous technologies. I’m eager to see what new technologies will mature and what new opportunities are uncovered by next year’s conference!
Adobe Research at CVPR
The conversations at CVPR are especially meaningful to our team because a growing number of Adobe products and offerings are increasingly being powered by large foundation models including the Firefly foundation model suite, which powers image generation, generative fill, and related features. Adobe also places high importance on the responsible use of data for training our models and is actively involved in research to better procure, curate, and synthetically generate more training data to train our models. Finally, Adobe Research also continues to break new ground in visual intelligence across diverse areas, including 3D vision, videos, vector graphics, audio-visual models, and creative digital assistants.
Wondering what else is happening inside Adobe Research? Check out our latest news here and learn more about our researchers here!