The website thispersondoesnotexist.com shows off images created by StyleGAN, a kind of machine learning model trained on around 70,000 headshots from Flickr. Although the photos on the site resemble real ones, they are wholly made up: StyleGAN can generate an unlimited number of convincing faces on the fly, each unique and never to be seen again.
Under the hood, StyleGAN starts with a set of 512 numbers, known as the “latent state”, like:
[-2.0763590228145516, 1.6445363589507724, 0.5918615055274108, -0.0264190200752720, -0.2397198207971201, ...
Normally these are chosen arbitrarily, and the model will generate a unique face for every possible list. But we can also dig into what the numbers mean. Similar lists produce similar faces, making it possible to manipulate StyleGAN’s output in interesting ways. For example, we can create an animation by showing all the faces between two starting points.
We can also ask whether the generated faces have some quality – for example, smiling or not smiling, or appearing male or female – and then tell StyleGAN to generate images that are likely to be classified one way or the other. We can even tweak one trait while fixing others, like age and pose.1
Of course, the model doesn’t really know about gender (or anything else), but blindly spits out any features of the dataset that correlate with our classifications. If Flickr’s men tend to be pictured on grey days, for example, StyleGAN will act as if overhead clouds were crucial to the art of manliness. As it happens, the women in the dataset tend to have makeup and longer hair, as well as wearing different outfits, so the results reflect those conventions. Still, highlighting that skew in the data can be useful, just as when analysis of word embeddings reveals bias in language use.
The images below were created by first generating a StyleGAN face as normal, then “flipping” the latent state in an attempt to alter the classification. The multi-panelled images interpolate more ambiguous faces in between the two extremes. The animations extend this to a continuum, and they are squarely in the uncanny valley. Partly that’s because they look so surreal and unnatural when moving; yet when paused, each frame is realistic.
The method is basically the same as the one used for word embeddings in this post. StyleGAN’s latent space lets us define a geometry on faces; by subtracting one face from another we can represent the difference between the two. The average difference between faces with and without some trait defines a direction we can move along to add or remove that trait from the final image. ↩