The incredible diversity of visual systems in the animal kingdom is a result of millions of years of coevolution between eyes and brains, adapting to process visual information efficiently in different environments. We introduce the generative design of visual intelligence . . .
The incredible diversity of visual systems in the animal kingdom is a result of millions of years of coevolution between eyes and brains, adapting to process visual information efficiently in different environments. We introduce the generative design of visual intelligence (GenVI), which leverages computational methods and generative artificial intelligence to explore a vast design space of potential visual systems and cognitive capabilities. By cogenerating artificial eyes and brains that can sense, perceive, and enable interaction with the environment, GenVI enables the study of the evolutionary progression of vision in nature and the development of novel and efficient artificial visual systems. We anticipate that GenVI will provide a powerful tool for vision scientists to test hypotheses and gain new insights into the evolution of visual intelligence while also enabling engineers to create unconventional, task-specific artificial vision systems that rival their biological counterparts in terms of performance and efficiency.
Keywords: generative AI; vision sciences; evolutionary biology; computer vision; computational imaging; embodied AI
Author Disclosures: Tzofi Klinghoffer and Aaron Young contributed to this paper equally; their name order was determined by the flip of a coin. Tzofi Klinghoffer was supported by the Draper Scholars Program. Aaron Young was supported by the NSF GRFP (no. 2022339767).
Imagine a world where artificial intelligence (AI) could see with the incredible speed of a fly, detect targets from miles away like an eagle, navigate using polarized light like a bee, and possess the acute night vision of an owl, all while consuming only a fraction of the energy required by current human-engineered systems. Why have humans been unable to design artificial visual systems that rival the capabilities and efficiency found in nature? The answer to this question lies in the fundamental differences between how nature and humans approach the design of visual intelligence (VI)—systems that sense visual information and make inferences to interact with the world.
The design of biological VI in nature is driven by the coevolution of animal eyes and brains and has yielded great success in producing highly efficient visual systems. The driving factors behind the development of animal vision is evolution, natural selection, learning, and survival imperatives. Evolution guides the development process of eyes, while natural selection preserves the most advantageous features. As the eyes develop, the brain also develops by learning to process new visual stimuli. The symbiotic relationship between evolution and learning leads to the coevolution of animal sensory hardware, nervous systems, and cognitive abilities.1 This coevolution has produced a diverse range of animal eyes and brains over millions of years that efficiently perceive and interact with the world in unique ways.
Inspired by this success in biological coevolution, generative design of visual intelligence (GenVI) aims to automatically generate novel VI systems by computationally emulating the evolution of natural visual systems, as illustrated in Figure 1. While natural VI emerges through the slow process of evolution and learning over millions of years, GenVI seeks to accelerate this process by leveraging computational methods and advances in generative AI (GenAI). Doing so enables exploration of a vast design space of potential visual systems and cognitive capabilities.
The primary difference between the proposed GenVI framework and the existing GenAI approaches is that GenVI focuses on creating physically realizable visual systems, composed of artificial eyes and brains, capable of sensing, perceiving, and interacting with their environment (Figure 5). In contrast, recent advancements in GenAI have primarily focused on generating static outputs such as text,2 images,3 speech,4,5 or proteins.6
This paper is structured as follows: In section 2, we provide a conceptual overview of GenVI by (1) defining VI in more detail, (2) proposing the building blocks that enable the generation of VI, and (3) defining the generative process that enables the coevolution of eyes and brains. In section 3, we discuss the potential of GenVI to significantly impact various disciplines that study natural vision, such as vision sciences and evolutionary biology. In section 4, we discuss GenVI’s application in the engineering domain to create artificial vision systems. Lastly, in section 5, we review computational methods and advances that enable GenVI.
This paper is focused on the sensing and behavior of a visual system. Sensing defines the hardware responsible for what visual information is extracted from the scene (e.g., color, depth, edges). Behavior, as defined in cognitive science, defines either the action that the system takes or the visual processing that leads to a downstream task. Based on this definition of behavior, VI is always defined either within the context of an embodied agent that can act in an environment (i.e., locomotion in an environment) or a visual system that performs a perception task (e.g., motion processing or classification). Such perception tasks typically lie within the fields of computer vision and computational imaging. Defining VI in terms of sensing and behavior allows us to design diverse visual systems. For example, designing visual systems allows us to study vision science based on downstream behavior (section 3) or design visual systems based on perceptual tasks in computer vision (section 4).
In existing GenAI frameworks, the design space defines the building blocks for generating new outputs, such as the alphabet for text or RGB pixels for images. The natural design space of VI consists of biological photoreceptors that convert light into signals that are sent to the brain, and neurons (fundamental units of the brain) that process these signals.
Drawing inspiration from the biological design space, in section 2.1 we define the GenVI design space. In section 2.2, we discuss the role of simulation as a way to generate new VI designs. In section 2.3, we discuss the cogeneration of sensing and behavior as a result of our formulation.
To enable the cogeneration of artificial eyes and brains, the design space for GenVI must include sensing and learning primitives (building blocks). Inspired by natural VI, we propose a sensing primitive that measures the properties of incoming light and a learnable primitive that learns from visual input.
To design the sensing primitives, for artificial vision, we focus on the properties of light that can be measured at a point on the eye or sensor. We represent such a point as a single sensing building block that samples desirable properties from incoming light. We term this block an artificial photoreceptor. An artificial photoreceptor can then be designed to sample from the plenoptic function,7 which describes how light rays propagate at every point through a 3D scene. By sampling from this function, the artificial photoreceptor can measure various properties of light, such as intensity, direction, wavelength (color), and polarization at a given position in space and time. These photoreceptors can be arranged in different configurations to create visual systems with varying resolutions, fields of view, or multiple viewpoints, enabling the capture of different aspects of the plenoptic function. This ability to arrange photoreceptors is analogous to natural VI, which uses stacks of biological photoreceptors, such as compound eyes in arthropods or camera-like eyes in humans, to sample incoming light in ways best suited for the environment.
To design the learning primitives for VI, we consider the space of computational units that can perform visual processing and decision-making from sensory information captured by the artificial photoreceptors. We propose to use the artificial neuron, such as the perceptron,8,9 as a basic unit for this purpose. Artificial neurons, stacked in neural networks trained using backpropagation,10 can learn to map sensory inputs to desired behavior outputs, enabling the system to adapt based on environmental stimuli. This is analogous to how biological neurons process and learn from information captured by photoreceptors in natural VI. The complexity of the network topology and the learning of weights based on sensory feedback allows for the development of sophisticated visual behaviors. In practice, efficient network architectures, such as convolutional networks11 or transformers,12 can be employed to process and learn from the sensory information, facilitating the development of advanced artificial visual systems.
The sensor hardware proposed by GenVI needs to be verified, and the associated neural network must learn to process visual input and output behavior through interactions with an environment, similar to how animals learn. This learning process is crucial for the development of advanced VI, as it allows the system to adapt to different environments and tasks.
Computer vision has relied heavily on the creation of large datasets—such as ImageNet13 and PASCAL,14 among others15,16—that are used to train algorithms such as classification, detection, segmentation, or perception. However, relying on data collection as a method to generate novel designs for VI is limiting. This is because (1) a fixed dataset implies that the hardware and sensing component is fixed, which also hinders the discovery of more advanced and adaptable VI, and (2) the reliance on a human who is in the loop limits the rate at which novel VI designs can be proposed and evaluated.
To address these limitations, GenVI leverages physically-based rendering and simulations. Simulations allow GenVI to generate diverse datasets with various sensing configurations, enabling the exploration of a larger design space for hardware and sensing components. This addresses the issue of a fixed dataset by allowing the creation of datasets with different sensing designs. Additionally, physically-based rendering and simulations also enable GenVI to generate datasets with a wide range of visual phenomena, such as different lighting conditions, materials, and geometries. This entails learning more robust and generalizable functions that map visual sensory input to the desired behavior. Recently, computational imaging and computer vision have gained success in incorporating physically-based renderers such as Mitsuba,17,18 Blender,19 PyRedner,20 and Unity.21
For embodied AI applications, robotic and autonomous vehicle simulations such as CARLA,22 MuJoCo,23 and Drake24 have enabled rapid evaluation and iteration of designs and algorithms.25 GenVI can also use these simulations to iterate visual system designs based on interaction with the environment. These simulations allow GenVI to propose novel designs for various embodied tasks, such as navigation, manipulation, and obstacle avoidance, or various environments or terrains.
An important characteristic of GenVI is the cogeneration of artificial eyes and brains using artificial photoreceptor primitives and artificial neurons, respectively. The cogeneration in GenVI is crucial for the development of advanced VI. By allowing the learning process to influence the choice of sensory feedback and vice versa, GenVI can discover novel and efficient VI designs that are well adapted to specific environments and tasks, mirroring the coevolution of eyes and brains in nature.26
GenVI employs a generation block (Figure 1, part b) that proposes new VI designs by stacking or modifying artificial primitives. These designs are then passed to a simulation block (Figure 1, part d), where the visual system learns its behavior by interacting with the environment. Observations from the environment are captured in simulation with the eyes generated for the agent. The performance of the system is evaluated and fed back to the selection block (Figure 1, part c) which selects designs based on their ability to achieve the desired behavior. This information is used by the generation block to propose better VI designs. The simulation block serves not only as a verification of the designs proposed by the generation block but also as part of the design itself.
Our approach aims to gain scientific insights into interesting phenomena that occur in nature. Through simulation, we can control the factors that lead to the emergence of features observed in natural VI. By manipulating the simulated environment and the evolutionary and environmental pressures, we can study the underlying mechanisms that drive the development of these features. For example, by increasing the complexity of the environment or introducing specific predator-prey dynamics, we can observe how visual systems adapt and evolve to meet the challenges posed by these factors. This ability allows us to test hypotheses about the evolutionary progression of VI and identify the key environmental and biological conditions that have shaped the diverse array of visual systems found in nature today. In this section, we explore the applications of our approach in the study of natural vision and provide perspective on how it could lead to new discoveries and insights about VI in the natural world. A roadmap for this section with specific examples is shown in Figure 2.
Today, scientists can observe the visual systems of various species, from the simplest to the most complex, and study their evolutionary progression. However, the underlying mechanisms and mutations that have diversified life into the complex behaviors and morphologies we see today are often ambiguous.27 Our approach can help scientists understand the evolutionary progression of visual systems by creating a virtual world that mirrors the natural world. With this “digital twin,” i.e., the virtual replica of the real world, a one-to-one comparison between the outcomes observed in reality and simulation (and their underlying mechanisms) can be made, as depicted in the GenVI portion of Figure 1. This tool gives scientists the freedom to make observations in reality, ask entirely novel questions, and subsequently test their hypotheses to validate the principles and conditions that led to the diversity in VI we see today (as outlined in section 3.2).
It is often posited that vision played a key role in the rapid diversification of early life, as it allowed animals to interact with their environment with greater acuity and precision.28 The evidence with which this hypothesis is supported comes from fossil records and comparative anatomy with modern-day species.29,30 Leveraging advances in computational hardware and simulation technologies, it is now possible to examine the driving factors and individual branches of the evolutionary tree that led to the diversity of visual systems we see today.31 Moreover, by focusing on the codependence of hardware and software in evolutionary contexts, scientists can analyze the behavioral aspects of ancient animals, something fossils rarely reveal.
Although validating evolution, as described in section 3.1, is powerful, it does not necessarily indicate why certain evolutionary outcomes occurred. In order to do so, the scientific method must be employed, which involves testing hypotheses and analyzing outcomes of causal interactions. For evolution, this process requires manipulating the environment and observing the effects to pinpoint environmental and biological factors that drive the emergence of specific aspects of an animal’s morphology. Doing so builds intuition and explainability into the evolutionary model. In lab settings, the scientific method can be used for cellular processes due to the short period with which evolution can be observed,32 but it would be challenging when studying VI of complex animals, given their long life spans. Hence, simulation is required to study VI and introduce counterfactuals to analyze causal effects. For instance, by equipping animals with zoom-like vision components in a simulated environment, it was shown that foveated vision (i.e., areas of high and low focus in an animal’s field-of-view) effectively serves as a biologically viable solution for achieving focused vision.33 This intervention is an example of a counterfactual (zoom-like vision) that causally demonstrates when a fovea appears and when it does not, providing evidence for why animal morphologies (foveae) evolved as they did. This exploration into counterfactuals is a powerful tool exposed through GenVI for analyzing both hardware and software aspects of VI.
At the foundation of analysis through causality and counterfactuals is the ability to isolate and target specific environmental factors for which we have hypotheses. This ability is straightforward in a simulated environment, where multiple interventions can be run with and without specific variables to analyze their impact on the learned behavior of the agent or its morphology. Furthermore, the questions that can be asked using this framework are unique from reality; tweaking individual aspects of nature comes at the cost of scale (experiments would need to be run in a closed, lab-like environment) and at the cost of precision (it is difficult to finely tweak a natural ecosystem to study a specific hypothesis).
Reasoning about the causes for various evolutionary traits is a fundamental scientific endeavor to uncover the unknown. As such, beyond analyzing strictly the outcomes that we can observe, the scientific method would benefit from being able to test theories or ideas that are impossible to observe in reality. Similar to studying counterfactuals through the virtual world proposed with GenVI, simulating and drastically manipulating environmental conditions can lead to (1) a new perspective on evolution, and (2) possible novel natural imaging systems and behaviors. By asking questions like “What if the asteroid didn’t wipe out the dinosaurs?” or “Would land mammals be better off with three or more eyes?” GenVI can provide insights into why animals did not evolve in certain ways and what the possible outcomes could have been.
In a similar vein to section 3.3, where impossible past scenarios are simulated, GenVI can be used to simulate the future of the evolution of VI. This methodology would form inherently novel predictions about animal evolution, as well as design solutions for possible environmental conditions in the future. Then, with this framework, counterfactuals can be explored to understand whether the current conditions (i.e., the initial conditions for any future evolution) are conducive to the evolution of more intelligent visual systems. For instance, by projecting climate models, we can observe how animals would evolve to survive in these conditions and make predictions on possible population losses or traits that might be selected.
GenVI not only enables the design and study of natural vision but also the creation of new forms of artificial vision. We define artificial vision as human- or AI-engineered vision systems, consisting of both sensing hardware and data processing. Researchers and industry have long used generative approaches to create solutions in other areas of engineering, such as architecture,34 topology optimization,35 material science,36 aerospace,37 and much more.38 In contrast, GenVI aims to create new forms of VI through the cogeneration of sensing hardware and algorithms.
GenVI could significantly accelerate research and development processes across application areas that may otherwise rely on independent and manual hardware and algorithm design. GenVI could accelerate development while preserving or surpassing the efficiency of systems found in nature. While nature has served as inspiration for the development of new technologies,39,40 as illustrated in Figure 5b, evolution can take millions of years to converge on a solution. In contrast, human-engineered solutions, although less efficient, are developed in a fraction of the time. GenVI can bridge this gap, enabling high development speed while maintaining efficiency. This is critical across engineering applications where development can otherwise be a slow and tedious process, requiring many teams and trial and error.
GenVI can also enable the democratization of VI. Whereas large corporations have the resources to invest in research and development to manually create and optimize VI, smaller corporations and individuals often may not. As a result, tools like GenVI provide an avenue for cheap and automated design in simulation, reducing both the cost and knowledge needed.
When applied to applications in artificial vision, such as robotics, biomedical imaging design, and more, GenVI can enable (1) the maximization of performance while enforcing constraints on size, weight, power, and cost (SWaP-C), and also manufacturability and sustainability, (2) automatic discovery and exploitation of available signals or cues in the environment, and (3) generation of new vision and imaging methods in engineering applications (Figure 3). We discuss each of these three benefits of engineering in more detail in the following sections.
In nature, SWaP-C constrain the morphology of animals’ bodies, vision, and brains. While higher-resolution eyes allow better detection of prey and avoidance of predators, they require more power than lower-resolution vision and necessitate larger brain sizes to process the captured information. As a result, animals have evolved to learn to balance SWaP-C and performance given their environment. However, this trade-off remains challenging to balance in human-engineered VI. We argue that GenVI can significantly help in balancing SWaP-C and performance. GenVI allows for both SWaP-C constraints and other constraints, such as manufacturability and sustainability, to easily be incorporated in the fitness function, reward, and/or loss, depending on the search mechanism used (different search mechanisms are discussed in section 5.2). For example, by constraining the number of cameras placed on an autonomous vehicle, a novel camera rig design emerges with high perception performance (Figure 4),41 which could enable it to better act in its environment (e.g., a CARLA simulator).42
Signal-to-noise ratio (SNR) is a fundamental constraint in designing visual systems in challenging environments such as biomedical imaging or night vision. For such applications, VI is considered to be the sensor hardware and reconstruction algorithms instead of an embodied agent. GenVI could propose designs that address this physical constraint directly using physics-based simulations. Consider the case of microscopy, where engineers have designed visual systems that use phase shift and illumination cues for phase-contrast microscopy and light field microscopy, respectively, to overcome the SNR challenge.43 For microscopy, the desired output is high-quality image reconstruction. GenVI can iterate sensor designs and learned algorithms using a simulated environment that imposes fundamental trade-offs in SNR, resolution, or field-of-view. In addition, when SNR is low, GenVI can use other cues in the environment to increase the signal. For example, in remote sensing, faint cues (such as shadows and reflections) could provide information not otherwise visible.
Many forms of human-engineered VI are inspired by natural vision. A classic example is stereo vision, which is based on the binocular vision found in humans and many other animals, which enables depth sensing. While natural vision is a result of evolution and environmental constraints, we can use GenVI to generate new forms of vision. If GenVI were exposed to a task and environment never seen before in nature, could it invent new ways to jointly sense and learn? Consider the problem of non–line-of-sight imaging, which focuses on creating cameras to see around corners and through occlusions. Scientists have designed many solutions for this problem using multiple bounces of light to infer hidden geometry,44 but GenVI could be used to automatically find new approaches. For example, could GenVI create a drone that can fly at high speeds through a dense forest? To do so, the drone may need to exploit cues, such as time of flight, reflections, and more, to anticipate what is around each turn.
The ability to cogenerate hardware (i.e., eyes and brain topologies) and software (i.e., perception and behavior) can unlock both the ability to understand VI from the past (section 3) and the ability to generate new forms of VI to enable engineering systems for the future (section 4). We anticipate a new paradigm in artificial VI resulting from the cogeneration of vision sensing and behavior. By cogenerating sensing and behavior, just as is done in nature, new forms of VI can be generated that are more performant for their task and environment than otherwise possible. Existing human-engineered VI typically relies on off-the-shelf sensors to capture observations of the environment. In the era of deep learning, these observations are used for both training and inference of neural networks, which map the observations and make predictions about the environment, such as in reconstructions, detections, or actions to take in the environment. When cogenerating sensing and behavior to generate VI, whether from the past or for the future, there are two key differences from cogeneration in nature: (1) the use of a human-defined design space, and (2) the ability to use advanced techniques from optimization and GenAI to search the design space. These differences are highlighted in Figure 5, which shows the coevolution of sensing and behavior in nature contrasted with our proposed paradigm of cogeneration for creating artificial vision and studying natural vision. In the following sections, we explore these differences and corresponding methods.
Whereas the design space of natural systems is defined by the basic building blocks of biology, which we refer to as biological primitives, the design space of human-engineered systems depends on available resources, known constraints, and simulation capabilities. We propose creating design spaces that consist of both sensing components (e.g., hardware of the eyes) and behavior components (e.g., artificial neurons of the brain). The way we structure the design space informs how it can then be searched. In the next section, we discuss how a design space can be created that allows the search to be directly applied to cogenerate sensing and behavior.
We can think of the design space as being a language with an alphabet (symbols) and rules, which define how these symbols can be combined to create sentences. When the alphabet consists of symbols for different hardware needed for VI, such as lenses, vision sensors, etc., the sentences correspond to vision systems. Context-free grammar (CFG) are used to describe context-free languages and is commonly used to define design spaces across disciplines, ranging from drone design,45 molecular design,46 and imaging system design.47 Arising from formal language theory, CFG consists of production rules, which define how symbols in the language can be combined. In CFG, G can be represented as G = (V, Σ, P, S), where V contains nonterminal symbols, Σ contains terminal symbols, P contains production rules, and S is the start symbol. CFG can define both the components being searched and how they can be combined in a physically plausible manner. For example, a production rule might indicate that a lens can only be placed in front of a sensor, not behind it. By defining our design space with CFG, we allow the search to be done in a confined space of plausible solutions. In addition, design spaces ranging from those found in nature (e.g., genotypes and phenotypes) to those found in engineering (e.g., catalogs of parts) can be defined with CFG. For example, in the design space for a natural system, a list of mutations to a genotype or phenotype can be defined with simple CFG where the alphabet contains the genotypes and phenotypes and the production rules contain the possible mutations. While CFG is the most general form of a design space, CFG can be transformed into a list of parameters and mutations (e.g., for genetic algorithms48) or an action space (e.g., for reinforcement learning algorithms).
Rather than manually defining the design space, a data-driven approach can be adopted to generate it with either of the above structures. In recent years, applications of large language models (LLMs) have exploded across domains. Given a pre-trained LLM, prompt tuning can be used to instruct it to provide a design space in the desired format (e.g., a list of properties and mutations, CFG, etc.). In addition, multiple recent works have shown promise in fine-tuning LLMs with a limited number of examples, which could enable such models to automatically create a plausible design space that can then be searched over.
Once a design space has been defined, a search can be done over both sensing and behavior to generate new forms of artificial intelligence. The goal of this search can be defined depending on the hypothesis under study. For example, an evolutionary biologist may want to search the design space in order to simulate different forms of natural VI that evolved over billions of years. Conversely, an engineer may want to use a search to codesign a self-driving car’s sensor rig with its navigation policy. In the following sections, we discuss different search strategies that can be employed to cogenerate sensing and behavior.
Inspired by nature, genetic algorithms, such as an evolutionary search,49 are one way to search the design space. Genetic algorithms are broadly based on the concept of natural selection from biological evolution. A population is initialized and then repeatedly mutated over time. The top performing individuals within a mutated population are selected based on the fitness function, which serves as an overall objective for evolution. Only the most fit individuals in the population will survive through future generations. After many generations, the individual with the best fitness is selected. This search strategy is complementary to design spaces defined as mutations, as discussed in section 5.1. Because genetic algorithms mimic natural selection, their use may be ideal in scientific applications where simulating evolution is of interest (e.g., understanding the evolutionary pathway of eyes). Genetic algorithms can also be applied to engineering use cases and have been used for gradient-free optimization in areas such as joint optimization of the image signal processor and algorithm.50 Genetic algorithms have also been used to codesign sensing with robotic perception tasks.51 Other nature-inspired optimization strategies, outside of genetic algorithms, such as the Artificial Bee Colony algorithm,52 particle swarm,53 simulated annealing,54 and more, can also be used for searches, whether for scientific or engineering applications.
While not explicitly designed to mimic biological processes found in nature, reinforcement learning (RL) offers a way to do search that is consistent with how organisms learn and adapt to their environments. RL has been applied to many search problems, ranging from the neural architecture search55 to designing imaging systems.56 By formulating a search as a Markov-decision process, RL allows the dynamics of state transitions over time to be modeled with states, actions, and rewards. Constraints and penalties, similar to those that might be modeled in the fitness function of genetic algorithms, can be modeled by the reward used by the RL algorithm, e.g., by size, power, and weight. RL methods commonly suffer from the exploration versus exploitation problem in which finding the optimal balance between exploring new areas in the design space and exploiting areas with known high reward is challenging. However, despite this challenge, RL offers a promising direction for searches in high dimensional spaces, such as what might be encountered in the proposed GenVI regime. In addition, RL can be used whether or not the computational environmental, i.e., the simulator, is differentiable, widening its applicability.
Gradient-based methods, such as stochastic gradient descent,57 not only dominate much of supervised machine learning but also can be used for search in cogeneration of sensing and behavior when generating new forms of artificial VI. Unlike nature-inspired searches, the use of gradient-based searches requires that the computational environment, i.e., the simulator, be differentiable with respect to the parameters in the design space. This differentiability becomes challenging when the design space contains discrete parameters, such as the number of eyes on an animal. Nonetheless, gradient-based approaches have become widely used in the field of deep optics, where a small number of optics parameters are optimized with respect to the performance of an algorithm on a downstream task.58 Since the parameters are continuous and the forward model is differentiable, gradient-based optimization can be applied. For a more thorough introduction to deep optics, please refer to Klinghoffer et al.59 Unlike deep optics, our proposed paradigm of GenVI focuses on a much more general design space and on maximizing the performance of an embodied agent.
Recent advances in GenAI can be leveraged to perform a search over a large design space containing both continuous and discrete parameters. In particular, optimizing discrete parameters can be challenging. Past work has shown that these parameters can be encoded into the continuous space of a learned latent code.60 Variational autoencoders, generative adversarial networks, LLMs, and other recent advances in GenAI provide different mechanisms for learning a rich latent space. Once encoded in a continuous manner, this latent space can be optimized with stochastic gradient descent. However, these methods typically require that the generative model be learned over some fixed dataset. Once learned, this model could be used to search discrete parameters by interacting with the environment.
Simulation is a critical step in GenVI. As shown in Figure 4, simulation allows generated designs to be tested. The simulator allows the agent to interact with the environment using generated hardware (e.g., eyes and sensors). The simulated data gathered through interactions is then used for learning and selection, as discussed in section 5.4. Simulation enables GenVI to operate at much faster speeds than natural evolution since interactions can be executed in parallel on GPU. In addition, humans can intervene and add specific scenarios into simulation that cause agents with desired behaviors to emerge. For example, if we want to optimize autonomous vehicle performance in tunnels, we can increase the number of tunnel scenarios given to the agent in simulation. This controllability given by GenVI is not possible in nature, as shown in Figure 5.
While recent advances in simulation have enabled GenVI, simulation remains an open challenge. In particular, simulation fidelity limits the experiments that can be run. For example, to study the effect of climate change on the evolution of animal visual systems, the climate and its interactions with the environment has to be modeled. Even if an experiment can be run in simulation, the performance of the agent trained in simulation might not hold true if deployed in reality. This challenge, often referred to as the sim-to-real gap, is an active area of research. Since simulation is a core component of GenVI, continued investment in simulation technology can help bring GenVI to reality.
Once hardware has been sampled via generation, data from this hardware can be rendered based on the agent’s interactions with the simulated environment. Thus, the agent can learn by interacting with the simulated world and gathering more data. The agent’s learned behavior dictates whether it is able to successfully accomplish its task. For example, in the case of autonomous driving, the task might be navigation. If the hardware is not well suited to the environment and task—whether its the agent’s eyes (imaging) or its brain capacity (number of neurons)—then the learned behavior is unlikely to be effective. Thus, jointly learning behavior, while searching over hardware, is crucial for GenVI to discover a system that is effective. Learning directly affects which hardware is selected, since hardware that enables better learning (e.g., higher accuracy) will be selected over other hardware.
The promise of GenVI goes beyond simply generating static outputs, such as text and images, instead of generating novel visual systems. This can lead to scientific discoveries in the vision sciences and efficient vision systems for engineering applications.
Today, simulation technologies have advanced to model reality effectively, and despite the persistent
GenVI has the potential to revolutionize the development of artificial vision systems and the study of natural vision. However, to fully realize the potential of GenVI, key stakeholders must collaborate and address the limitations and challenges associated with this approach.
Scientists studying biological vision can use GenVI to simulate environments that study specific hypotheses and counterfactuals in the evolution and workings of biological vision. Simulating every detail within the biosphere is infeasible, so meticulous problem formulation that can be simulated and studied must be considered.
GenVI can generate novel visual systems for engineering applications by incorporating engineering-specific constraints. Sensing and computer vision engineers should consider which design process can be automated and also develop better simulators that aim to minimize the sim-to-real gap such that generated designs in simulation function reliably in reality.
GenVI-driven design can revolutionize vision sciences and engineering, and so to unlock its full capabilities, researchers must apply this AI-driven design to traditional vision sensor design and algorithm development processes. Additionally, faster optimization techniques that can generate more optimal designs can also be a focus.