Org Status: π’ Active Cloudflare: N/A Last Audited: 2026-04-28
You want your AI assistant to have a face. Not a static icon, not a pulsing circle β a real animated face that blinks, lip-syncs to speech, and shifts expression based on system state. Think VIKI from I, Robot, or a VTuber-style avatar that lives in your macOS menu bar. This article maps every viable path from βI want a programmable faceβ to βitβs rendering at 60fps in a 50x60pt SwiftUI view.β
What youβll learn:
- How the major face animation systems work (Live2D, SceneKit, Three.js, sprite sheets) and which ones you can actually control programmatically
- The ARKit blend shape standard and why it matters even when youβre not doing face tracking
- How to map TTS output to mouth shapes in real-time (Rhubarb, OVR Visemes, Azure, Apple AVSpeechSynthesizer)
- Which cloud avatar APIs (HeyGen, Simli, D-ID) offer embeddable streaming and which are just video generators
- Why a 2D sprite-sheet approach might beat full 3D for a notch-sized display
- Production-ready code for every approach: SceneKit blend shapes, SpriteKit state machines, WKWebView + Three.js, and pure SwiftUI animation
Every AI assistant looks the same: a colored circle that pulses when itβs βthinking.β Siri has one. Alexa has one. ChatGPT has one. This is a missed opportunity. Humans are wired to read faces β micro-expressions, gaze direction, mouth movement. A face creates presence in a way that no amount of smooth gradient animation can match.
But building a programmable face is hard for several specific reasons:
1. Most face engines assume face tracking, not face generation. ARKitβs blend shapes are designed to read your face via a TrueDepth camera, not to drive a synthetic face from code. You need to invert the pipeline.
2. Game engines are overkill. Unity and Unreal can absolutely render a talking face. But embedding a game engine runtime in a macOS notch HUD β a 50x60pt overlay window β is absurd. You need something lightweight.
3. Lip sync is a separate problem from expression. Getting the mouth to match speech requires phoneme-to-viseme mapping with sub-frame timing. Most βavatarβ products handle this internally but donβt expose the control surface.
4. The display is tiny. At 50x60 points (100x120 pixels on Retina), you donβt need subsurface scattering or physically-based skin shading. You need readable expression at low resolution. This changes the entire technical calculus.
Hereβs the concrete scenario: youβre building βJane,β an AI assistant that lives in a macOS notch-area HUD. The HUD is a borderless NSPanel that overlays the MacBook screen. Janeβs face occupies roughly 50x60pt in the left ear of the notch. She needs to:
- Show 4-6 discrete expression states (calm, thinking, alert, alarmed, happy, speaking)
- Lip sync to TTS audio output
- Blink and perform idle animations (subtle head movement, gaze shifts)
- Transition smoothly between states
- Render at 60fps with negligible CPU/GPU impact
- Be driven entirely by code β no camera, no tracking, no user input
// The target API β what we want to be able to do
protocol AvatarController {
func setExpression(_ expression: Expression, duration: TimeInterval)
func setViseme(_ viseme: Viseme)
func startIdleAnimation()
func stopIdleAnimation()
func blink()
func setGazeDirection(_ direction: CGPoint) // -1...1 range
}
enum Expression: String, CaseIterable {
case calm, thinking, alert, alarmed, happy, speaking, listening
}
enum Viseme: String, CaseIterable {
case sil // silence
case pp // p, b, m
case ff // f, v
case th // th
case dd // t, d
case kk // k, g
case ch // ch, j, sh
case ss // s, z
case nn // n, l
case rr // r
case aa // a
case ee // e
case ih // i
case oh // o
case ou // u
}
If you get this right, your assistant goes from a colored dot to a character with presence. Users will anthropomorphize it. Theyβll say βshe looks worriedβ or βsheβs thinking about it.β Thatβs the goal.
Blend Shapes (Morph Targets)
A blend shape is a named deformation of a base mesh. You have a neutral face mesh, and a set of target meshes that represent extremes β mouth fully open, left eyebrow fully raised, eyes fully closed. Each blend shape has a weight from 0.0 to 1.0. At runtime, you interpolate between the neutral mesh and the target meshes based on weights.
// The blend shape concept as a type
interface BlendShapeState {
// Mouth shapes (visemes)
jawOpen: number; // 0.0 - 1.0
mouthFunnel: number;
mouthPucker: number;
mouthLeft: number;
mouthRight: number;
mouthSmileLeft: number;
mouthSmileRight: number;
mouthFrownLeft: number;
mouthFrownRight: number;
// Eyes
eyeBlinkLeft: number;
eyeBlinkRight: number;
eyeLookUpLeft: number;
eyeLookDownLeft: number;
eyeLookInLeft: number;
eyeLookOutLeft: number;
// Brows
browDownLeft: number;
browDownRight: number;
browInnerUp: number;
browOuterUpLeft: number;
browOuterUpRight: number;
}
// An expression is just a preset blend shape configuration
const EXPRESSIONS: Record<string, Partial<BlendShapeState>> = {
calm: {
mouthSmileLeft: 0.1,
mouthSmileRight: 0.1,
eyeBlinkLeft: 0.05,
eyeBlinkRight: 0.05,
},
thinking: {
browInnerUp: 0.3,
eyeLookUpLeft: 0.2,
eyeLookUpRight: 0.2,
mouthPucker: 0.15,
},
alarmed: {
eyeBlinkLeft: 0.0,
eyeBlinkRight: 0.0,
browInnerUp: 0.7,
browOuterUpLeft: 0.5,
browOuterUpRight: 0.5,
jawOpen: 0.2,
mouthFrownLeft: 0.3,
mouthFrownRight: 0.3,
},
};
Key insight: ARKit defines 52 blend shape locations. This has become the de facto standard β not because ARKit is the only system, but because every major avatar platform (Ready Player Me, VRChat, MetaHuman) supports ARKit-compatible blend shapes. If your face model uses ARKit naming, it will work with virtually any animation pipeline.
The ARKit blend shape set is documented at ARFaceAnchor.BlendShapeLocation. The critical thing to understand is that these values are just numbers. You donβt need a TrueDepth camera to use them. You can set jawOpen = 0.6 from code and render the result in SceneKit.
Visemes
A viseme is the visual representation of a phoneme β the mouth shape that corresponds to a speech sound. The most widely used standard is the Oculus OVR LipSync set of 15 visemes:
// The OVR viseme set β 15 mouth shapes cover all English speech
interface VisemeSet {
sil: BlendShapeState; // Silence / neutral
PP: BlendShapeState; // p, b, m β lips pressed together
FF: BlendShapeState; // f, v β lower lip to upper teeth
TH: BlendShapeState; // th β tongue between teeth
DD: BlendShapeState; // t, d β tongue to palate
kk: BlendShapeState; // k, g β back of tongue up
CH: BlendShapeState; // ch, j, sh β wide mouth
SS: BlendShapeState; // s, z β teeth together, lips slightly apart
nn: BlendShapeState; // n, l β tongue to palate, mouth relaxed
RR: BlendShapeState; // r β lips slightly rounded
aa: BlendShapeState; // a β wide open
E: BlendShapeState; // e β wide, mid-height
ih: BlendShapeState; // i β narrow, high
oh: BlendShapeState; // o β rounded, mid-height
ou: BlendShapeState; // u β rounded, small
}
// Mapping visemes to ARKit blend shapes
const VISEME_TO_BLENDSHAPE: Record<string, Partial<BlendShapeState>> = {
sil: { jawOpen: 0.0, mouthFunnel: 0.0, mouthPucker: 0.0 },
PP: { jawOpen: 0.0, mouthPucker: 0.3, mouthFunnel: 0.0 },
FF: { jawOpen: 0.05, mouthFunnel: 0.2, mouthSmileLeft: 0.1 },
TH: { jawOpen: 0.1, mouthFunnel: 0.1 },
DD: { jawOpen: 0.15, mouthSmileLeft: 0.05, mouthSmileRight: 0.05 },
kk: { jawOpen: 0.2, mouthFunnel: 0.1 },
CH: { jawOpen: 0.15, mouthSmileLeft: 0.2, mouthSmileRight: 0.2 },
SS: { jawOpen: 0.05, mouthSmileLeft: 0.15, mouthSmileRight: 0.15 },
nn: { jawOpen: 0.1, mouthSmileLeft: 0.05, mouthSmileRight: 0.05 },
RR: { jawOpen: 0.15, mouthPucker: 0.2, mouthFunnel: 0.15 },
aa: { jawOpen: 0.5, mouthSmileLeft: 0.0, mouthSmileRight: 0.0 },
E: { jawOpen: 0.3, mouthSmileLeft: 0.2, mouthSmileRight: 0.2 },
ih: { jawOpen: 0.15, mouthSmileLeft: 0.3, mouthSmileRight: 0.3 },
oh: { jawOpen: 0.35, mouthPucker: 0.15, mouthFunnel: 0.25 },
ou: { jawOpen: 0.2, mouthPucker: 0.35, mouthFunnel: 0.3 },
};
Key insight: 15 visemes is the sweet spot. Fewer and the mouth looks robotic. More and you canβt interpolate smoothly at real-time speeds. The OVR set is language-agnostic β the same 15 shapes cover English, Japanese, French, and most other languages because visemes map to mouth shapes, not linguistic units.
Expression State Machine
Expressions arenβt instantaneous β you transition between them. A state machine manages this:
interface ExpressionTransition {
from: string;
to: string;
duration: number; // seconds
easing: EasingFunction;
blendShapeDeltas: Partial<BlendShapeState>;
}
type EasingFunction = 'linear' | 'easeIn' | 'easeOut' | 'easeInOut' | 'spring';
// Expression layers β allow lip sync and expression to coexist
interface AnimationLayer {
name: string;
weight: number;
blendShapes: Partial<BlendShapeState>;
priority: number; // higher priority layers override lower ones
}
// Layer system: expression + lip sync + idle can all run simultaneously
// Layer 0: Idle animation (blinks, micro-movements)
// Layer 1: Expression state (calm, thinking, alarmed)
// Layer 2: Lip sync (viseme animation β overrides mouth-related shapes only)
// Layer 3: Overrides (forced blink, gaze direction)
function composeLayers(layers: AnimationLayer[]): BlendShapeState {
const sorted = layers.sort((a, b) => a.priority - b.priority);
const result: Partial<BlendShapeState> = {};
for (const layer of sorted) {
for (const [key, value] of Object.entries(layer.blendShapes)) {
const existing = result[key as keyof BlendShapeState] ?? 0;
// Blend based on layer weight
result[key as keyof BlendShapeState] =
existing * (1 - layer.weight) + (value as number) * layer.weight;
}
}
return result as BlendShapeState;
}
Key insight: Never apply lip sync and expression on the same blend shapes without a layering system. If βalarmedβ sets
jawOpen: 0.2and the current viseme setsjawOpen: 0.5, you need a rule for resolving the conflict. The standard approach is to let lip sync override mouth shapes while expression controls brows and eyes.
Before diving into specific engines, you need to make a fundamental choice. At 50x60pt (100x120 pixels on Retina), do you even need 3D?
Option A: Full 3D (SceneKit / RealityKit / Metal)
A 3D face mesh with blend shapes, lit and rendered in real-time.
Pros:
- Smooth interpolation between any expression
- Continuous viseme blending (not discrete frames)
- Head rotation, lighting changes, particle effects
- Looks sophisticated even at small sizes β 3D depth cues read well
Cons:
- Requires a 3D model with properly weighted blend shapes
- GPU overhead (small but nonzero)
- More complex pipeline β model authoring, rigging, shader setup
Option B: 2D Sprite Sheets (SpriteKit / SwiftUI Animation)
Pre-rendered frames for each expression and viseme, played as sprite animations.
Pros:
- Zero GPU overhead β itβs just image swapping
- Perfect quality at target size (pre-rendered at exact resolution)
- Dead simple implementation
- Artist-friendly β any illustrator can make the frames
Cons:
- Combinatorial explosion: 7 expressions x 15 visemes x 3 blink states = 315 frames minimum
- Transitions between expressions are stepped, not smooth
- No continuous head rotation or gaze direction
- Adding a new expression means drawing more frames
Option C: Hybrid (2D layers with programmatic composition)
Separate layers for face base, eyes, mouth, brows β each independently animated.
Pros:
- Reduces combinatorial explosion (separate layers multiply, donβt explode)
- Smooth animation via layer-level transforms
- Good quality at small sizes
- Moderate complexity
Cons:
- Parallax and depth cues are limited
- Layer registration requires precision
- Still need pre-rendered assets per layer
Option D: WebView-based (WKWebView + Three.js / TalkingHead)
A web-based renderer embedded in a native macOS app via WKWebView.
Pros:
- Enormous ecosystem of Three.js face animation tools
- Ready Player Me avatars work out of the box
- TalkingHead library provides complete lip sync pipeline
- Cross-platform compatible
Cons:
- WKWebView has overhead (separate process, IPC)
- WebGL performance in WKWebView is inconsistent on macOS
- Bridge between Swift and JavaScript adds latency
- Memory footprint is much larger than native approaches
Recommendation for the Jane Use Case
For a 50x60pt notch display, Option C (Hybrid 2D layers) is the best starting point, with Option A (SceneKit) as the upgrade path. Hereβs why:
At this size, 3D depth cues are barely perceptible. What matters is readable expression. A well-designed set of 2D layers β stylized, high-contrast, designed for the target resolution β will read better than a realistic 3D face shrunk to 100x120 pixels. The VIKI aesthetic (translucent, holographic, stylized) actually plays to 2Dβs strengths.
If you later want the avatar to appear in a larger context (a floating panel, a full-screen takeover), thatβs when you upgrade to SceneKit with blend shapes.
This is the gold standard for native macOS 3D face rendering. SceneKit is Appleβs high-level 3D framework, built on Metal, with first-class SwiftUI integration.
Setup: Loading a Face Model with Morph Targets
You need a 3D model (GLTF/USDZ/DAE) with blend shape targets. The model should have ARKit-compatible blend shape names. You can create one in Blender using the ARKit Blendshape Helper addon, or use a Ready Player Me avatar exported as GLB.
import SceneKit
import SwiftUI
class FaceSceneController: ObservableObject {
let scene = SCNScene()
private var faceNode: SCNNode?
private var morpher: SCNMorpher?
// Blend shape target indices β mapped from ARKit names
private var blendShapeIndices: [String: Int] = [:]
// Animation layers
private var expressionLayer: [String: CGFloat] = [:]
private var visemeLayer: [String: CGFloat] = [:]
private var idleLayer: [String: CGFloat] = [:]
// Idle animation state
private var idleTimer: Timer?
private var blinkTimer: Timer?
private var nextBlinkTime: TimeInterval = 0
func loadFaceModel(named modelName: String) {
guard let modelScene = SCNScene(named: modelName) else {
fatalError("Could not load face model: \(modelName)")
}
// Find the face mesh node (assumes it's named "Face" or "Head")
guard let face = modelScene.rootNode.childNode(
withName: "Face", recursively: true
) else {
fatalError("No 'Face' node found in model")
}
faceNode = face
scene.rootNode.addChildNode(face)
// Access the morpher (blend shape controller)
guard let m = face.morpher else {
fatalError("Face node has no morpher β ensure model has blend shapes")
}
morpher = m
m.calculationMode = .additive
// Build index map from target names
for (index, target) in m.targets.enumerated() {
if let name = target.name {
blendShapeIndices[name] = index
}
}
setupCamera()
setupLighting()
}
private func setupCamera() {
let camera = SCNCamera()
camera.fieldOfView = 30
camera.zNear = 0.1
camera.zFar = 100
let cameraNode = SCNNode()
cameraNode.camera = camera
cameraNode.position = SCNVector3(0, 0, 0.5) // Close-up on face
cameraNode.look(at: SCNVector3Zero)
scene.rootNode.addChildNode(cameraNode)
}
private func setupLighting() {
// Soft ambient + directional for readable features at small sizes
let ambient = SCNLight()
ambient.type = .ambient
ambient.color = NSColor(white: 0.4, alpha: 1)
let ambientNode = SCNNode()
ambientNode.light = ambient
scene.rootNode.addChildNode(ambientNode)
let directional = SCNLight()
directional.type = .directional
directional.color = NSColor(white: 0.8, alpha: 1)
directional.castsShadow = false
let dirNode = SCNNode()
dirNode.light = directional
dirNode.eulerAngles = SCNVector3(-Float.pi / 4, Float.pi / 6, 0)
scene.rootNode.addChildNode(dirNode)
}
// MARK: - Blend Shape Control
func setBlendShape(_ name: String, weight: CGFloat, animated: Bool = true) {
guard let index = blendShapeIndices[name] else { return }
if animated {
SCNTransaction.begin()
SCNTransaction.animationDuration = 0.1
morpher?.setWeight(weight, forTargetAt: index)
SCNTransaction.commit()
} else {
morpher?.setWeight(weight, forTargetAt: index)
}
}
func setExpression(_ expression: Expression, duration: TimeInterval = 0.3) {
let targets = expression.blendShapeTargets
SCNTransaction.begin()
SCNTransaction.animationDuration = duration
SCNTransaction.animationTimingFunction =
CAMediaTimingFunction(name: .easeInEaseOut)
// Reset all expression-controlled shapes
for name in Expression.expressionShapeNames {
if let index = blendShapeIndices[name] {
let visemeValue = visemeLayer[name] ?? 0
let targetValue = targets[name] ?? 0
// Expression sets base, viseme overrides mouth shapes
if Expression.mouthShapeNames.contains(name) {
morpher?.setWeight(max(targetValue, visemeValue), forTargetAt: index)
} else {
morpher?.setWeight(targetValue, forTargetAt: index)
}
}
}
SCNTransaction.commit()
expressionLayer = targets
}
func setViseme(_ viseme: Viseme, weight: CGFloat = 1.0) {
let targets = viseme.blendShapeTargets
// Visemes update immediately β no animation duration
// The calling code handles interpolation timing
for (name, value) in targets {
guard let index = blendShapeIndices[name] else { continue }
let scaledValue = value * weight
morpher?.setWeight(scaledValue, forTargetAt: index)
visemeLayer[name] = scaledValue
}
}
// MARK: - Idle Animation
func startIdleAnimation() {
// Blink timer β humans blink every 2-10 seconds
blinkTimer = Timer.scheduledTimer(withTimeInterval: 0.1, repeats: true) {
[weak self] _ in
self?.updateBlink()
}
scheduleNextBlink()
// Micro-movement timer β subtle head sway
idleTimer = Timer.scheduledTimer(withTimeInterval: 1.0 / 30.0, repeats: true) {
[weak self] _ in
self?.updateIdleMovement()
}
}
func stopIdleAnimation() {
blinkTimer?.invalidate()
idleTimer?.invalidate()
blinkTimer = nil
idleTimer = nil
}
private func scheduleNextBlink() {
nextBlinkTime = CACurrentMediaTime() + Double.random(in: 2.0...6.0)
}
private func updateBlink() {
let now = CACurrentMediaTime()
guard now >= nextBlinkTime else { return }
// Blink animation: close over 0.05s, hold 0.05s, open over 0.1s
SCNTransaction.begin()
SCNTransaction.animationDuration = 0.05
setBlendShape("eyeBlinkLeft", weight: 1.0, animated: false)
setBlendShape("eyeBlinkRight", weight: 1.0, animated: false)
SCNTransaction.commit()
DispatchQueue.main.asyncAfter(deadline: .now() + 0.1) { [weak self] in
SCNTransaction.begin()
SCNTransaction.animationDuration = 0.1
self?.setBlendShape("eyeBlinkLeft", weight: 0.0, animated: false)
self?.setBlendShape("eyeBlinkRight", weight: 0.0, animated: false)
SCNTransaction.commit()
}
scheduleNextBlink()
}
private func updateIdleMovement() {
let time = CACurrentMediaTime()
// Perlin-noise-like subtle head movement
let yaw = sin(time * 0.3) * 0.02
let pitch = sin(time * 0.5 + 1.0) * 0.01
faceNode?.eulerAngles.y = Float(yaw)
faceNode?.eulerAngles.x = Float(pitch)
}
func setGazeDirection(_ direction: CGPoint) {
// Map -1...1 to blend shape weights
let lookRight = max(0, direction.x)
let lookLeft = max(0, -direction.x)
let lookUp = max(0, direction.y)
let lookDown = max(0, -direction.y)
setBlendShape("eyeLookOutLeft", weight: lookLeft * 0.5)
setBlendShape("eyeLookInRight", weight: lookLeft * 0.5)
setBlendShape("eyeLookInLeft", weight: lookRight * 0.5)
setBlendShape("eyeLookOutRight", weight: lookRight * 0.5)
setBlendShape("eyeLookUpLeft", weight: lookUp * 0.3)
setBlendShape("eyeLookUpRight", weight: lookUp * 0.3)
setBlendShape("eyeLookDownLeft", weight: lookDown * 0.3)
setBlendShape("eyeLookDownRight", weight: lookDown * 0.3)
}
}
Embedding in SwiftUI
import SwiftUI
import SceneKit
struct AvatarSceneView: NSViewRepresentable {
@ObservedObject var controller: FaceSceneController
func makeNSView(context: Context) -> SCNView {
let scnView = SCNView()
scnView.scene = controller.scene
scnView.backgroundColor = .clear
scnView.allowsCameraControl = false
scnView.preferredFramesPerSecond = 60
scnView.antialiasingMode = .multisampling4X
scnView.isJitteringEnabled = false
// For small views, reduce rendering quality for performance
scnView.antialiasingMode = .none // At 50x60pt, AA is invisible
scnView.preferredFramesPerSecond = 30 // 30fps is sufficient for expressions
return scnView
}
func updateNSView(_ nsView: SCNView, context: Context) {
// Scene updates happen through the controller, not through SwiftUI diffing
}
}
// Usage in the HUD
struct AvatarZone: View {
@StateObject private var faceController = FaceSceneController()
var body: some View {
AvatarSceneView(controller: faceController)
.frame(maxWidth: .infinity, maxHeight: .infinity)
.clipShape(UnevenRoundedRectangle(
topLeadingRadius: 12,
bottomLeadingRadius: 12,
bottomTrailingRadius: 0,
topTrailingRadius: 0
))
.onAppear {
faceController.loadFaceModel(named: "jane_face.usdz")
faceController.startIdleAnimation()
faceController.setExpression(.calm)
}
}
}
Key insight: At 50x60pt, turn off antialiasing (
antialiasingMode = .none) and drop to 30fps. The visual difference is imperceptible at this size, but the GPU savings are real. You can always render at higher quality when the avatar appears in a larger context.
For the absolute lightest-weight approach, use pre-rendered sprite frames driven by a state machine. This is ideal for the Jane notch HUD.
import SpriteKit
import SwiftUI
// MARK: - Avatar Sprite Sheet Definition
struct AvatarSpriteSheet {
// Each expression has a set of mouth frames (one per viseme)
// and eye states (open, half, closed)
struct ExpressionSet {
let baseFace: String // Asset name for the base face
let visemeFrames: [Viseme: String] // Asset name per viseme
let eyeOpen: String
let eyeHalf: String
let eyeClosed: String
let browTexture: String // Expression-specific brow
}
static let expressions: [Expression: ExpressionSet] = [
.calm: ExpressionSet(
baseFace: "face_calm",
visemeFrames: Viseme.allCases.reduce(into: [:]) { $0[$1] = "mouth_calm_\($1.rawValue)" },
eyeOpen: "eyes_calm_open",
eyeHalf: "eyes_calm_half",
eyeClosed: "eyes_calm_closed",
browTexture: "brow_calm"
),
.thinking: ExpressionSet(
baseFace: "face_thinking",
visemeFrames: Viseme.allCases.reduce(into: [:]) { $0[$1] = "mouth_thinking_\($1.rawValue)" },
eyeOpen: "eyes_thinking_open",
eyeHalf: "eyes_thinking_half",
eyeClosed: "eyes_thinking_closed",
browTexture: "brow_thinking"
),
.alarmed: ExpressionSet(
baseFace: "face_alarmed",
visemeFrames: Viseme.allCases.reduce(into: [:]) { $0[$1] = "mouth_alarmed_\($1.rawValue)" },
eyeOpen: "eyes_alarmed_open",
eyeHalf: "eyes_alarmed_half",
eyeClosed: "eyes_alarmed_closed",
browTexture: "brow_alarmed"
),
// ... more expressions
]
}
// MARK: - SpriteKit Avatar Scene
class AvatarSpriteScene: SKScene {
// Sprite layers (back to front)
private let faceSprite = SKSpriteNode()
private let mouthSprite = SKSpriteNode()
private let eyesSprite = SKSpriteNode()
private let browSprite = SKSpriteNode()
private let overlaySprite = SKSpriteNode() // holographic effects
// State
private var currentExpression: Expression = .calm
private var currentViseme: Viseme = .sil
private var isBlinking = false
private var nextBlinkTime: TimeInterval = 0
private var gazeOffset: CGPoint = .zero
override func didMove(to view: SKView) {
backgroundColor = .clear
scaleMode = .aspectFill
// Layer setup β all centered
let layers: [SKSpriteNode] = [faceSprite, mouthSprite, eyesSprite, browSprite, overlaySprite]
for (index, layer) in layers.enumerated() {
layer.position = CGPoint(x: size.width / 2, y: size.height / 2)
layer.zPosition = CGFloat(index)
layer.size = size
addChild(layer)
}
setExpression(.calm)
scheduleNextBlink()
}
// MARK: - Expression Control
func setExpression(_ expression: Expression, animated: Bool = true) {
guard let set = AvatarSpriteSheet.expressions[expression] else { return }
currentExpression = expression
let update = {
self.faceSprite.texture = SKTexture(imageNamed: set.baseFace)
self.eyesSprite.texture = SKTexture(imageNamed: set.eyeOpen)
self.browSprite.texture = SKTexture(imageNamed: set.browTexture)
self.updateMouthForCurrentViseme()
}
if animated {
// Cross-fade transition
let fadeOut = SKAction.fadeAlpha(to: 0.7, duration: 0.1)
let changeTextures = SKAction.run(update)
let fadeIn = SKAction.fadeAlpha(to: 1.0, duration: 0.15)
let sequence = SKAction.sequence([fadeOut, changeTextures, fadeIn])
faceSprite.run(sequence)
eyesSprite.run(sequence)
browSprite.run(sequence)
} else {
update()
}
}
// MARK: - Viseme Control
func setViseme(_ viseme: Viseme) {
currentViseme = viseme
updateMouthForCurrentViseme()
}
private func updateMouthForCurrentViseme() {
guard let set = AvatarSpriteSheet.expressions[currentExpression] else { return }
if let textureName = set.visemeFrames[currentViseme] {
mouthSprite.texture = SKTexture(imageNamed: textureName)
}
}
// MARK: - Blink System
private func scheduleNextBlink() {
nextBlinkTime = CACurrentMediaTime() + Double.random(in: 2.5...5.5)
}
override func update(_ currentTime: TimeInterval) {
// Blink check
if !isBlinking && CACurrentMediaTime() >= nextBlinkTime {
performBlink()
}
// Subtle eye position from gaze
eyesSprite.position = CGPoint(
x: size.width / 2 + gazeOffset.x * 2, // 2pt max eye shift
y: size.height / 2 + gazeOffset.y * 1.5
)
}
private func performBlink() {
isBlinking = true
guard let set = AvatarSpriteSheet.expressions[currentExpression] else { return }
let halfClose = SKAction.run {
self.eyesSprite.texture = SKTexture(imageNamed: set.eyeHalf)
}
let fullClose = SKAction.run {
self.eyesSprite.texture = SKTexture(imageNamed: set.eyeClosed)
}
let fullOpen = SKAction.run {
self.eyesSprite.texture = SKTexture(imageNamed: set.eyeOpen)
self.isBlinking = false
self.scheduleNextBlink()
}
let sequence = SKAction.sequence([
halfClose,
SKAction.wait(forDuration: 0.03),
fullClose,
SKAction.wait(forDuration: 0.06),
halfClose,
SKAction.wait(forDuration: 0.03),
fullOpen,
])
eyesSprite.run(sequence)
}
// MARK: - Gaze
func setGazeDirection(_ direction: CGPoint) {
gazeOffset = direction
}
}
// MARK: - SwiftUI Integration
struct SpriteAvatarView: View {
let scene: AvatarSpriteScene
init() {
let s = AvatarSpriteScene(size: CGSize(width: 100, height: 120))
s.scaleMode = .aspectFill
scene = s
}
var body: some View {
SpriteView(scene: scene, options: [.allowsTransparency])
.frame(maxWidth: .infinity, maxHeight: .infinity)
}
}
Asset production for the sprite approach:
For a layered 2D avatar, you need these assets per expression:
| Layer | Assets needed | Per expression |
|---|---|---|
| Base face | 1 texture | x 7 expressions = 7 |
| Mouth | 15 viseme textures | x 7 = 105 |
| Eyes | 3 states (open/half/closed) | x 7 = 21 |
| Brows | 1 texture | x 7 = 7 |
| Total | 140 textures |
At 100x120px (Retina), each PNG is about 5-15KB. Total asset size: ~1-2MB. Trivial.
Key insight: The layered sprite approach separates the combinatorial problem. Instead of 7 x 15 x 3 = 315 pre-composed frames, you have 140 layer textures that combine at runtime. The SpriteKit compositor handles the layering with zero effort.
If you want a full 3D avatar with minimal native code, embed a web-based renderer. The TalkingHead library is the most complete open-source solution β it handles Ready Player Me avatars, lip sync, expressions, and idle animations out of the box.
import SwiftUI
import WebKit
class AvatarWebViewController: NSObject, ObservableObject, WKScriptMessageHandler {
let webView: WKWebView
init(frame: CGRect = .zero) {
let config = WKWebViewConfiguration()
config.preferences.setValue(true, forKey: "developerExtrasEnabled")
// Enable WebGL
let prefs = WKWebpagePreferences()
prefs.allowsContentJavaScript = true
config.defaultWebpagePreferences = prefs
let contentController = WKUserContentController()
config.userContentController = contentController
webView = WKWebView(frame: frame, configuration: config)
webView.setValue(false, forKey: "drawsBackground") // Transparent background
super.init()
contentController.add(self, name: "avatarBridge")
loadAvatarPage()
}
private func loadAvatarPage() {
let html = Self.avatarHTML
webView.loadHTMLString(html, baseURL: Bundle.main.resourceURL)
}
// MARK: - Control API
func setExpression(_ expression: String, duration: Double = 0.3) {
callJS("setExpression('\(expression)', \(duration))")
}
func setViseme(_ viseme: String, weight: Double = 1.0) {
callJS("setViseme('\(viseme)', \(weight))")
}
func speak(_ text: String) {
let escaped = text.replacingOccurrences(of: "'", with: "\\'")
callJS("speak('\(escaped)')")
}
func setMood(_ mood: String, intensity: Double = 0.5) {
callJS("setMood('\(mood)', \(intensity))")
}
private func callJS(_ script: String) {
webView.evaluateJavaScript(script) { _, error in
if let error = error {
print("JS error: \(error.localizedDescription)")
}
}
}
// MARK: - WKScriptMessageHandler
func userContentController(
_ userContentController: WKUserContentController,
didReceive message: WKScriptMessage
) {
// Receive events from the web avatar (e.g., animation complete, error)
guard let body = message.body as? [String: Any] else { return }
if let event = body["event"] as? String {
print("Avatar event: \(event)")
}
}
// MARK: - HTML Template
static let avatarHTML = """
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<style>
* { margin: 0; padding: 0; }
body { background: transparent; overflow: hidden; }
canvas { width: 100%; height: 100%; }
</style>
</head>
<body>
<script type="importmap">
{
"imports": {
"three": "https://cdn.jsdelivr.net/npm/three@0.170.0/build/three.module.js",
"three/addons/": "https://cdn.jsdelivr.net/npm/three@0.170.0/examples/jsm/",
"talkinghead": "https://cdn.jsdelivr.net/npm/talkinghead@latest/modules/talkinghead.mjs"
}
}
</script>
<script type="module">
import { TalkingHead } from 'talkinghead';
let head;
async function init() {
const container = document.body;
head = new TalkingHead(container, {
ttsEndpoint: null, // We'll drive TTS from Swift
cameraView: 'head',
cameraDistance: 0.4,
cameraRotateX: 0,
cameraRotateY: 0,
lightAmbientColor: 0x404040,
lightDirectColor: 0xcccccc,
background: { color: 'transparent' },
});
// Load a Ready Player Me avatar
await head.showAvatar({
url: 'https://models.readyplayer.me/64bfa15f0e72c63d7c3934a6.glb?morphTargets=ARKit&textureAtlas=1024',
body: 'F',
avatarMood: 'neutral',
lipsyncLang: 'en',
});
// Notify Swift that avatar is ready
window.webkit.messageHandlers.avatarBridge.postMessage({
event: 'ready'
});
}
// API exposed to Swift via evaluateJavaScript
window.setExpression = (name, duration) => {
if (!head) return;
const moods = {
calm: { mood: 'neutral', intensity: 0.3 },
thinking: { mood: 'concern', intensity: 0.4 },
alert: { mood: 'surprise', intensity: 0.5 },
alarmed: { mood: 'anger', intensity: 0.6 },
happy: { mood: 'joy', intensity: 0.7 },
listening: { mood: 'interest', intensity: 0.4 },
};
const m = moods[name] || moods.calm;
head.setMood(m.mood);
};
window.setViseme = (viseme, weight) => {
if (!head) return;
// TalkingHead handles viseme interpolation internally
// For manual control, directly set morph targets
head.setMorphTarget(viseme, weight);
};
window.speak = (text) => {
if (!head) return;
head.speakText(text);
};
window.setMood = (mood, intensity) => {
if (!head) return;
head.setMood(mood);
};
init().catch(console.error);
</script>
</body>
</html>
""";
}
// MARK: - SwiftUI Wrapper
struct WebAvatarView: NSViewRepresentable {
@ObservedObject var controller: AvatarWebViewController
func makeNSView(context: Context) -> WKWebView {
controller.webView
}
func updateNSView(_ nsView: WKWebView, context: Context) {}
}
Key insight: The WKWebView approach trades performance for ecosystem access. TalkingHead gives you Ready Player Me compatibility, built-in lip sync for multiple languages, emotion-to-blend-shape mapping, and Mixamo animation support β all for free. The cost is ~50MB of memory overhead and potential WebGL rendering inconsistencies. For a notch HUD, this may be too heavy. For a floating panel or full-screen mode, itβs excellent.
For the absolute simplest approach β no SpriteKit, no SceneKit, no WebView β you can build a face entirely in SwiftUI with shapes and animations.
import SwiftUI
// MARK: - Stylized Face Components
struct StylizedEye: View {
let openness: CGFloat // 0 = closed, 1 = fully open
let lookDirection: CGPoint // -1...1
var body: some View {
ZStack {
// Eye socket
Ellipse()
.fill(.white.opacity(0.9))
.frame(width: 14, height: 10 * openness)
.animation(.easeInOut(duration: 0.05), value: openness)
// Iris
Circle()
.fill(
RadialGradient(
colors: [.cyan, .blue.opacity(0.8)],
center: .center,
startRadius: 0,
endRadius: 4
)
)
.frame(width: 6, height: 6)
.offset(
x: lookDirection.x * 2,
y: lookDirection.y * 1.5
)
.opacity(openness > 0.3 ? 1 : 0)
.animation(.easeOut(duration: 0.1), value: lookDirection)
// Pupil
Circle()
.fill(.black)
.frame(width: 3, height: 3)
.offset(
x: lookDirection.x * 2,
y: lookDirection.y * 1.5
)
.opacity(openness > 0.3 ? 1 : 0)
}
}
}
struct StylizedMouth: View {
let viseme: Viseme
let expressionSmile: CGFloat // -1 (frown) to 1 (smile)
private var mouthShape: some Shape {
MouthPath(viseme: viseme, smile: expressionSmile)
}
var body: some View {
mouthShape
.fill(.white.opacity(0.7))
.frame(width: 16, height: 10)
.animation(.easeOut(duration: 0.05), value: viseme)
.animation(.easeInOut(duration: 0.2), value: expressionSmile)
}
}
struct MouthPath: Shape {
var viseme: Viseme
var smile: CGFloat
var animatableData: CGFloat {
get { smile }
set { smile = newValue }
}
func path(in rect: CGRect) -> Path {
var path = Path()
let w = rect.width
let h = rect.height
let cx = rect.midX
let cy = rect.midY
switch viseme {
case .sil:
// Closed mouth β slight line
let smileY = -smile * h * 0.1
path.move(to: CGPoint(x: cx - w * 0.3, y: cy + smileY))
path.addQuadCurve(
to: CGPoint(x: cx + w * 0.3, y: cy + smileY),
control: CGPoint(x: cx, y: cy - smileY * 2)
)
case .aa:
// Wide open mouth
path.addEllipse(in: CGRect(
x: cx - w * 0.25, y: cy - h * 0.3,
width: w * 0.5, height: h * 0.6
))
case .oh:
// Rounded O shape
path.addEllipse(in: CGRect(
x: cx - w * 0.15, y: cy - h * 0.25,
width: w * 0.3, height: h * 0.5
))
case .ee:
// Wide smile
path.addEllipse(in: CGRect(
x: cx - w * 0.3, y: cy - h * 0.1,
width: w * 0.6, height: h * 0.25
))
case .pp:
// Lips pressed together
path.move(to: CGPoint(x: cx - w * 0.2, y: cy))
path.addLine(to: CGPoint(x: cx + w * 0.2, y: cy))
default:
// Generic slightly open mouth for other visemes
path.addEllipse(in: CGRect(
x: cx - w * 0.2, y: cy - h * 0.15,
width: w * 0.4, height: h * 0.3
))
}
return path
}
}
// MARK: - Holographic Face Composite
struct HolographicFace: View {
@StateObject private var animator = FaceAnimator()
var body: some View {
ZStack {
// Scanline overlay (VIKI aesthetic)
ScanlineOverlay()
.opacity(0.15)
// Face outline β holographic glow
FaceOutline()
.stroke(
LinearGradient(
colors: [.cyan.opacity(0.6), .blue.opacity(0.3)],
startPoint: .top,
endPoint: .bottom
),
lineWidth: 1
)
// Eyes
HStack(spacing: 10) {
StylizedEye(
openness: animator.eyeOpenness,
lookDirection: animator.gazeDirection
)
StylizedEye(
openness: animator.eyeOpenness,
lookDirection: animator.gazeDirection
)
}
.offset(y: -8)
// Brows
HStack(spacing: 14) {
BrowShape(raise: animator.browRaise, furrow: animator.browFurrow)
.rotation3DEffect(.degrees(180), axis: (x: 0, y: 1, z: 0))
BrowShape(raise: animator.browRaise, furrow: animator.browFurrow)
}
.offset(y: -16)
// Mouth
StylizedMouth(
viseme: animator.currentViseme,
expressionSmile: animator.mouthSmile
)
.offset(y: 10)
}
.frame(width: 50, height: 60)
.background(
RoundedRectangle(cornerRadius: 8)
.fill(.black.opacity(0.8))
)
.onAppear {
animator.startIdleLoop()
}
}
}
struct ScanlineOverlay: View {
@State private var offset: CGFloat = 0
var body: some View {
GeometryReader { geo in
VStack(spacing: 2) {
ForEach(0..<30, id: \.self) { _ in
Rectangle()
.fill(.white.opacity(0.05))
.frame(height: 1)
}
}
.offset(y: offset)
.onAppear {
withAnimation(.linear(duration: 3).repeatForever(autoreverses: false)) {
offset = 4
}
}
}
}
}
struct FaceOutline: Shape {
func path(in rect: CGRect) -> Path {
Path(ellipseIn: rect.insetBy(dx: 6, dy: 3))
}
}
struct BrowShape: View {
let raise: CGFloat // 0 to 1
let furrow: CGFloat // 0 to 1
var body: some View {
Path { path in
path.move(to: CGPoint(x: 0, y: 4 - raise * 3 + furrow * 2))
path.addQuadCurve(
to: CGPoint(x: 10, y: 2 - raise * 2)),
control: CGPoint(x: 5, y: 0 - raise * 4 + furrow * 3)
)
}
.stroke(.white.opacity(0.6), lineWidth: 1.5)
.frame(width: 10, height: 8)
}
}
// MARK: - Face Animator
@MainActor
class FaceAnimator: ObservableObject {
@Published var currentViseme: Viseme = .sil
@Published var eyeOpenness: CGFloat = 1.0
@Published var gazeDirection: CGPoint = .zero
@Published var browRaise: CGFloat = 0
@Published var browFurrow: CGFloat = 0
@Published var mouthSmile: CGFloat = 0.1
private var blinkTimer: Timer?
private var idleTimer: Timer?
func startIdleLoop() {
// Random blinks
scheduleNextBlink()
// Subtle gaze drift
idleTimer = Timer.scheduledTimer(withTimeInterval: 2.0, repeats: true) { [weak self] _ in
Task { @MainActor in
guard let self = self else { return }
withAnimation(.easeInOut(duration: 1.5)) {
self.gazeDirection = CGPoint(
x: CGFloat.random(in: -0.3...0.3),
y: CGFloat.random(in: -0.2...0.2)
)
}
}
}
}
private func scheduleNextBlink() {
let delay = Double.random(in: 2.0...5.0)
blinkTimer = Timer.scheduledTimer(withTimeInterval: delay, repeats: false) { [weak self] _ in
Task { @MainActor in
self?.performBlink()
}
}
}
private func performBlink() {
withAnimation(.easeIn(duration: 0.05)) {
eyeOpenness = 0.0
}
DispatchQueue.main.asyncAfter(deadline: .now() + 0.08) { [weak self] in
withAnimation(.easeOut(duration: 0.1)) {
self?.eyeOpenness = 1.0
}
self?.scheduleNextBlink()
}
}
func setExpression(_ expression: Expression) {
withAnimation(.easeInOut(duration: 0.3)) {
switch expression {
case .calm:
browRaise = 0; browFurrow = 0; mouthSmile = 0.1
case .thinking:
browRaise = 0.3; browFurrow = 0.2; mouthSmile = 0
case .alert:
browRaise = 0.5; browFurrow = 0; mouthSmile = -0.1
case .alarmed:
browRaise = 0.7; browFurrow = 0.5; mouthSmile = -0.3
case .happy:
browRaise = 0.2; browFurrow = 0; mouthSmile = 0.5
case .speaking:
browRaise = 0.1; browFurrow = 0; mouthSmile = 0
case .listening:
browRaise = 0.2; browFurrow = 0.1; mouthSmile = 0.05
}
}
}
}
Key insight: The pure SwiftUI approach has a unique advantage for the VIKI aesthetic β you can draw a stylized, holographic face with scan lines, glow effects, and translucent geometry that would be harder to achieve with a pre-made 3D model. The downside is that complex path animations at 60fps can cause SwiftUI body recomputation overhead. Keep the animated paths simple.
The lip sync pipeline is independent of the rendering approach. Whether youβre using SceneKit, SpriteKit, or SwiftUI, you need to convert speech audio into a stream of viseme events.
Approach A: Apple AVSpeechSynthesizer with Word-Level Timing
Appleβs TTS gives you word-level callbacks but not phoneme-level. You need to estimate visemes from the text.
import AVFoundation
class TTSVisemeDriver: NSObject, AVSpeechSynthesizerDelegate, ObservableObject {
private let synthesizer = AVSpeechSynthesizer()
private var currentUtterance: AVSpeechUtterance?
private var fullText: String = ""
// Callback for each viseme
var onViseme: ((Viseme, TimeInterval) -> Void)?
var onExpressionHint: ((Expression) -> Void)?
override init() {
super.init()
synthesizer.delegate = self
}
func speak(_ text: String) {
fullText = text
let utterance = AVSpeechUtterance(string: text)
utterance.voice = AVSpeechSynthesisVoice(language: "en-US")
utterance.rate = AVSpeechUtteranceDefaultSpeechRate
utterance.pitchMultiplier = 1.0
currentUtterance = utterance
synthesizer.speak(utterance)
}
// MARK: - Delegate
func speechSynthesizer(
_ synthesizer: AVSpeechSynthesizer,
willSpeakRangeOfSpeechString characterRange: NSRange,
utterance: AVSpeechUtterance
) {
let nsString = fullText as NSString
let word = nsString.substring(with: characterRange)
// Convert word to phoneme sequence (simplified)
let visemes = Self.wordToVisemes(word)
// Schedule visemes across the estimated word duration
let estimatedWordDuration = 0.15 * Double(word.count) // rough estimate
let visemeDuration = estimatedWordDuration / Double(max(visemes.count, 1))
for (index, viseme) in visemes.enumerated() {
let delay = Double(index) * visemeDuration
DispatchQueue.main.asyncAfter(deadline: .now() + delay) { [weak self] in
self?.onViseme?(viseme, visemeDuration)
}
}
}
func speechSynthesizer(
_ synthesizer: AVSpeechSynthesizer,
didFinish utterance: AVSpeechUtterance
) {
onViseme?(.sil, 0.1)
}
// MARK: - Phoneme-to-Viseme Mapping
/// Simplified English text-to-viseme mapping
/// For production use, integrate Rhubarb Lip Sync or a phoneme dictionary
static func wordToVisemes(_ word: String) -> [Viseme] {
var visemes: [Viseme] = []
let lower = word.lowercased()
var i = lower.startIndex
while i < lower.endIndex {
let char = lower[i]
let next = lower.index(after: i) < lower.endIndex
? lower[lower.index(after: i)]
: nil
switch char {
case "a": visemes.append(.aa)
case "e": visemes.append(.ee)
case "i": visemes.append(.ih)
case "o": visemes.append(.oh)
case "u": visemes.append(.ou)
case "p", "b", "m": visemes.append(.pp)
case "f", "v": visemes.append(.ff)
case "t", "d":
if next == "h" {
visemes.append(.th)
i = lower.index(after: i)
} else {
visemes.append(.dd)
}
case "k", "g": visemes.append(.kk)
case "s", "z": visemes.append(.ss)
case "n", "l": visemes.append(.nn)
case "r": visemes.append(.rr)
case "c":
if next == "h" {
visemes.append(.ch)
i = lower.index(after: i)
} else {
visemes.append(.kk)
}
case "j": visemes.append(.ch)
case "w": visemes.append(.ou)
case "y": visemes.append(.ih)
default: break // Skip non-letter characters
}
i = lower.index(after: i)
}
return visemes.isEmpty ? [.sil] : visemes
}
}
Approach B: Rhubarb Lip Sync (Offline Pre-Processing)
Rhubarb Lip Sync is a command-line tool that analyzes audio files and produces frame-accurate viseme data. It uses speech recognition for better accuracy than frequency analysis alone.
import Foundation
struct RhubarbVisemeEvent: Decodable {
let start: Double
let end: Double
let value: String // A-H mouth shape codes
}
struct RhubarbOutput: Decodable {
let mouthCues: [RhubarbVisemeEvent]
}
class RhubarbLipSync {
private let rhubarbPath: String
init(rhubarbPath: String = "/usr/local/bin/rhubarb") {
self.rhubarbPath = rhubarbPath
}
/// Process an audio file and return viseme timeline
func process(audioFile: URL, dialogText: String? = nil) async throws -> [RhubarbVisemeEvent] {
var args = [
audioFile.path,
"--machineReadable",
"-f", "json",
"--extendedShapes", "GHX" // Include extended mouth shapes
]
// Providing dialog text dramatically improves accuracy
if let text = dialogText {
let textFile = FileManager.default.temporaryDirectory
.appendingPathComponent("dialog.txt")
try text.write(to: textFile, atomically: true, encoding: .utf8)
args.append(contentsOf: ["-d", textFile.path])
}
let process = Process()
process.executableURL = URL(fileURLWithPath: rhubarbPath)
process.arguments = args
let pipe = Pipe()
process.standardOutput = pipe
try process.run()
process.waitUntilExit()
let data = pipe.fileHandleForReading.readDataToEndOfFile()
let output = try JSONDecoder().decode(RhubarbOutput.self, from: data)
return output.mouthCues
}
/// Map Rhubarb mouth shape codes to OVR visemes
static func rhubarbToViseme(_ code: String) -> Viseme {
switch code {
case "A": return .pp // Closed mouth (m, b, p)
case "B": return .ee // Slightly open (most consonants)
case "C": return .ee // Open (e, ae)
case "D": return .aa // Wide open (a)
case "E": return .oh // Rounded (o)
case "F": return .ou // Tight rounded (u, w)
case "G": return .ff // Upper teeth on lower lip (f, v)
case "H": return .nn // Tongue behind teeth (l, th)
case "X": return .sil // Idle / silence
default: return .sil
}
}
}
/// Drives viseme playback from a Rhubarb timeline
class VisemePlayer {
private var timeline: [RhubarbVisemeEvent] = []
private var displayLink: CVDisplayLink?
private var startTime: Double = 0
private var currentIndex = 0
var onViseme: ((Viseme) -> Void)?
func play(timeline: [RhubarbVisemeEvent]) {
self.timeline = timeline
currentIndex = 0
startTime = CACurrentMediaTime()
// Use a timer instead of CVDisplayLink for simplicity
Timer.scheduledTimer(withTimeInterval: 1.0/60.0, repeats: true) { [weak self] timer in
guard let self = self else { timer.invalidate(); return }
let elapsed = CACurrentMediaTime() - self.startTime
// Find current viseme based on elapsed time
while self.currentIndex < self.timeline.count {
let event = self.timeline[self.currentIndex]
if elapsed >= event.start && elapsed < event.end {
let viseme = RhubarbLipSync.rhubarbToViseme(event.value)
self.onViseme?(viseme)
break
} else if elapsed >= event.end {
self.currentIndex += 1
} else {
break
}
}
if self.currentIndex >= self.timeline.count {
timer.invalidate()
self.onViseme?(.sil)
}
}
}
}
Approach C: Azure Speech Service Viseme API
For the highest quality lip sync, Azure Speech Service provides real-time blend shape output β 55 facial positions per frame, streamed alongside TTS audio.
// Azure Speech SDK β TypeScript example for the pipeline concept
import * as SpeechSDK from 'microsoft-cognitiveservices-speech-sdk';
interface AzureVisemeFrame {
frameIndex: number;
blendShapes: number[][]; // 55 blend shape values per frame
}
class AzureVisemeDriver {
private synthesizer: SpeechSDK.SpeechSynthesizer;
private visemeQueue: AzureVisemeFrame[] = [];
constructor(subscriptionKey: string, region: string) {
const speechConfig = SpeechSDK.SpeechConfig.fromSubscription(
subscriptionKey,
region
);
speechConfig.speechSynthesisVoiceName = 'en-US-JennyNeural';
const audioConfig = SpeechSDK.AudioConfig.fromDefaultSpeakerOutput();
this.synthesizer = new SpeechSDK.SpeechSynthesizer(speechConfig, audioConfig);
// Subscribe to viseme events
this.synthesizer.visemeReceived = (sender, event) => {
if (event.animation) {
const animationData = JSON.parse(event.animation);
// animationData.BlendShapes is an array of frames
// Each frame is an array of 55 float values
for (const frame of animationData.BlendShapes) {
this.visemeQueue.push({
frameIndex: this.visemeQueue.length,
blendShapes: frame,
});
}
}
};
}
async speak(text: string): Promise<void> {
this.visemeQueue = [];
// Use SSML to request blend shapes
const ssml = `
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-JennyNeural">
<mstts:viseme type="FacialExpression"/>
${text}
</voice>
</speak>
`;
return new Promise((resolve, reject) => {
this.synthesizer.speakSsmlAsync(
ssml,
result => {
if (result.reason === SpeechSDK.ResultReason.SynthesizingAudioCompleted) {
resolve();
} else {
reject(new Error(`Speech synthesis failed: ${result.errorDetails}`));
}
},
error => reject(error)
);
});
}
// Azure's 55 blend shapes map to ARKit names
// See: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis-viseme
static readonly AZURE_TO_ARKIT: Record<number, string> = {
0: 'eyeBlinkLeft',
1: 'eyeLookDownLeft',
2: 'eyeLookInLeft',
3: 'eyeLookOutLeft',
4: 'eyeLookUpLeft',
5: 'eyeSquintLeft',
6: 'eyeWideLeft',
7: 'eyeBlinkRight',
8: 'eyeLookDownRight',
9: 'eyeLookInRight',
10: 'eyeLookOutRight',
11: 'eyeLookUpRight',
12: 'eyeSquintRight',
13: 'eyeWideRight',
14: 'jawForward',
15: 'jawLeft',
16: 'jawRight',
17: 'jawOpen',
18: 'mouthClose',
19: 'mouthFunnel',
20: 'mouthPucker',
21: 'mouthLeft',
22: 'mouthRight',
23: 'mouthSmileLeft',
24: 'mouthSmileRight',
25: 'mouthFrownLeft',
26: 'mouthFrownRight',
27: 'mouthDimpleLeft',
28: 'mouthDimpleRight',
29: 'mouthStretchLeft',
30: 'mouthStretchRight',
31: 'mouthRollLower',
32: 'mouthRollUpper',
33: 'mouthShrugLower',
34: 'mouthShrugUpper',
35: 'mouthPressLeft',
36: 'mouthPressRight',
37: 'mouthLowerDownLeft',
38: 'mouthLowerDownRight',
39: 'mouthUpperUpLeft',
40: 'mouthUpperUpRight',
41: 'browDownLeft',
42: 'browDownRight',
43: 'browInnerUp',
44: 'browOuterUpLeft',
45: 'browOuterUpRight',
46: 'cheekPuff',
47: 'cheekSquintLeft',
48: 'cheekSquintRight',
49: 'noseSneerLeft',
50: 'noseSneerRight',
51: 'tongueOut',
52: 'headRoll',
53: 'leftEyeRoll',
54: 'rightEyeRoll',
};
}
Key insight: Azureβs viseme API outputs 55 blend shape values per frame, mapped directly to ARKit blend shape names. This is the highest-fidelity lip sync option available β it handles coarticulation (how sounds blend into each other), emotional modulation, and timing that no rule-based system can match. The cost is API dependency and ~200ms latency.
Approach D: Real-Time Audio Analysis (wawa-lipsync Style)
For local, zero-latency lip sync, analyze the audio streamβs frequency spectrum in real-time.
import AVFoundation
import Accelerate
class AudioVisemeAnalyzer {
private let engine = AVAudioEngine()
private let fftSize = 1024
private var fftSetup: vDSP_DFT_Setup?
var onViseme: ((Viseme, CGFloat) -> Void)? // viseme + intensity
init() {
fftSetup = vDSP_DFT_zop_CreateSetup(
nil,
vDSP_Length(fftSize),
.FORWARD
)
}
func startAnalyzing(audioNode: AVAudioNode) {
let format = audioNode.outputFormat(forBus: 0)
audioNode.installTap(onBus: 0, bufferSize: AVAudioFrameCount(fftSize), format: format) {
[weak self] buffer, time in
self?.analyzeBuffer(buffer)
}
try? engine.start()
}
private func analyzeBuffer(_ buffer: AVAudioPCMBuffer) {
guard let channelData = buffer.floatChannelData?[0] else { return }
let frameCount = Int(buffer.frameLength)
// Calculate RMS energy
var rms: Float = 0
vDSP_rmsqv(channelData, 1, &rms, vDSP_Length(frameCount))
// If silence, emit neutral viseme
guard rms > 0.01 else {
DispatchQueue.main.async { self.onViseme?(.sil, 0) }
return
}
// Simple frequency band analysis for viseme estimation
var realPart = [Float](repeating: 0, count: fftSize)
var imagPart = [Float](repeating: 0, count: fftSize)
// Copy input data
for i in 0..<min(frameCount, fftSize) {
realPart[i] = channelData[i]
}
// Perform FFT
vDSP_DFT_Execute(fftSetup!, &realPart, &imagPart, &realPart, &imagPart)
// Calculate magnitude spectrum
var magnitudes = [Float](repeating: 0, count: fftSize / 2)
vDSP_zvmags(&realPart, 1, &magnitudes, vDSP_Length(fftSize / 2))
// Frequency band energy (simplified)
let sampleRate = 44100.0
let binWidth = sampleRate / Double(fftSize)
// Vowel formant ranges (Hz)
let f1Low = bandEnergy(magnitudes, binWidth: binWidth, lowHz: 200, highHz: 500)
let f1High = bandEnergy(magnitudes, binWidth: binWidth, lowHz: 500, highHz: 1000)
let f2Low = bandEnergy(magnitudes, binWidth: binWidth, lowHz: 1000, highHz: 2000)
let f2High = bandEnergy(magnitudes, binWidth: binWidth, lowHz: 2000, highHz: 4000)
let highFreq = bandEnergy(magnitudes, binWidth: binWidth, lowHz: 4000, highHz: 8000)
// Map formant energies to visemes (simplified heuristic)
let viseme: Viseme
let intensity = CGFloat(min(rms * 10, 1.0))
if highFreq > f1Low && highFreq > f2Low {
viseme = .ss // Sibilants
} else if f1High > f1Low && f2High > f2Low {
viseme = .ee // Front vowels
} else if f1Low > f1High && f2Low > f2High {
viseme = .oh // Back rounded vowels
} else if f1Low > 0.5 {
viseme = .aa // Open vowels
} else if rms > 0.05 {
viseme = .dd // Generic consonant
} else {
viseme = .sil
}
DispatchQueue.main.async {
self.onViseme?(viseme, intensity)
}
}
private func bandEnergy(_ magnitudes: [Float], binWidth: Double, lowHz: Double, highHz: Double) -> Float {
let lowBin = Int(lowHz / binWidth)
let highBin = min(Int(highHz / binWidth), magnitudes.count - 1)
guard lowBin < highBin else { return 0 }
var energy: Float = 0
vDSP_sve(Array(magnitudes[lowBin...highBin]), 1, &energy, vDSP_Length(highBin - lowBin + 1))
return energy / Float(highBin - lowBin + 1)
}
func stop() {
engine.stop()
}
deinit {
if let setup = fftSetup {
vDSP_DFT_DestroySetup(setup)
}
}
}
Key insight: Real-time audio analysis gives you zero-latency lip sync but lower accuracy. It canβt distinguish βpβ from βbβ from βmβ β they all look the same in the frequency domain. For a 50x60pt display, this inaccuracy is invisible. For a large display, use Rhubarb or Azure.
Example 1: Minimal SceneKit Face in SwiftUI
import SwiftUI
import SceneKit
struct MinimalFaceView: View {
var body: some View {
SceneView(
scene: makeFaceScene(),
options: [.allowsCameraControl]
)
.frame(width: 200, height: 200)
}
func makeFaceScene() -> SCNScene {
let scene = SCNScene()
// Create a simple face from primitives
let head = SCNSphere(radius: 0.3)
head.firstMaterial?.diffuse.contents = NSColor.systemTeal.withAlphaComponent(0.7)
let headNode = SCNNode(geometry: head)
scene.rootNode.addChildNode(headNode)
// Eyes
for xOffset: Float in [-0.1, 0.1] {
let eye = SCNSphere(radius: 0.04)
eye.firstMaterial?.diffuse.contents = NSColor.white
let eyeNode = SCNNode(geometry: eye)
eyeNode.position = SCNVector3(xOffset, 0.05, 0.27)
headNode.addChildNode(eyeNode)
let pupil = SCNSphere(radius: 0.02)
pupil.firstMaterial?.diffuse.contents = NSColor.black
let pupilNode = SCNNode(geometry: pupil)
pupilNode.position = SCNVector3(0, 0, 0.025)
eyeNode.addChildNode(pupilNode)
}
// Ambient light
let light = SCNLight()
light.type = .ambient
light.color = NSColor(white: 0.6, alpha: 1)
let lightNode = SCNNode()
lightNode.light = light
scene.rootNode.addChildNode(lightNode)
return scene
}
}
Example 2: Expression Enum with Blend Shape Presets
enum Expression: String, CaseIterable {
case calm, thinking, alert, alarmed, happy, speaking, listening
static let expressionShapeNames: Set<String> = [
"browInnerUp", "browDownLeft", "browDownRight",
"browOuterUpLeft", "browOuterUpRight",
"eyeSquintLeft", "eyeSquintRight",
"eyeWideLeft", "eyeWideRight",
"mouthSmileLeft", "mouthSmileRight",
"mouthFrownLeft", "mouthFrownRight",
]
static let mouthShapeNames: Set<String> = [
"jawOpen", "mouthFunnel", "mouthPucker",
"mouthSmileLeft", "mouthSmileRight",
"mouthFrownLeft", "mouthFrownRight",
]
var blendShapeTargets: [String: CGFloat] {
switch self {
case .calm:
return ["mouthSmileLeft": 0.1, "mouthSmileRight": 0.1]
case .thinking:
return [
"browInnerUp": 0.3, "eyeSquintLeft": 0.1,
"eyeSquintRight": 0.1, "mouthPucker": 0.1
]
case .alert:
return [
"eyeWideLeft": 0.3, "eyeWideRight": 0.3,
"browOuterUpLeft": 0.3, "browOuterUpRight": 0.3
]
case .alarmed:
return [
"eyeWideLeft": 0.6, "eyeWideRight": 0.6,
"browInnerUp": 0.5, "browOuterUpLeft": 0.4,
"browOuterUpRight": 0.4, "mouthFrownLeft": 0.3,
"mouthFrownRight": 0.3
]
case .happy:
return [
"mouthSmileLeft": 0.5, "mouthSmileRight": 0.5,
"eyeSquintLeft": 0.2, "eyeSquintRight": 0.2
]
case .speaking:
return ["browInnerUp": 0.1]
case .listening:
return [
"browInnerUp": 0.15, "mouthSmileLeft": 0.05,
"mouthSmileRight": 0.05
]
}
}
}
Example 3: Viseme Interpolation with Smoothing
class VisemeInterpolator {
private var currentWeights: [String: CGFloat] = [:]
private var targetWeights: [String: CGFloat] = [:]
private let smoothingFactor: CGFloat = 0.3 // 0 = instant, 1 = never reach target
func setTargetViseme(_ viseme: Viseme) {
targetWeights = viseme.blendShapeTargets
}
/// Call this every frame to get smoothed blend shape values
func tick() -> [String: CGFloat] {
// Lerp current toward target
var allKeys = Set(currentWeights.keys).union(targetWeights.keys)
for key in allKeys {
let current = currentWeights[key] ?? 0
let target = targetWeights[key] ?? 0
currentWeights[key] = current + (target - current) * (1 - smoothingFactor)
// Snap to zero if close enough
if abs(currentWeights[key]!) < 0.001 {
currentWeights[key] = 0
}
}
return currentWeights
}
}
Example 4: VIKI Holographic Shader (Metal)
// HolographicFace.metal β Metal shader for VIKI-style translucent face effect
#include <metal_stdlib>
using namespace metal;
struct VertexOut {
float4 position [[position]];
float3 worldNormal;
float3 worldPosition;
float2 texCoord;
};
fragment float4 holographicFragment(
VertexOut in [[stage_in]],
constant float &time [[buffer(0)]],
constant float3 &cameraPosition [[buffer(1)]]
) {
// Fresnel rim lighting β glows at edges
float3 viewDir = normalize(cameraPosition - in.worldPosition);
float fresnel = 1.0 - abs(dot(viewDir, in.worldNormal));
fresnel = pow(fresnel, 2.0);
// Base color β translucent cyan
float3 baseColor = float3(0.0, 0.8, 1.0);
// Scanline effect
float scanline = sin(in.worldPosition.y * 200.0 + time * 2.0) * 0.5 + 0.5;
scanline = step(0.4, scanline); // Hard-edged scanlines
// Grid pattern
float gridX = step(0.95, fract(in.worldPosition.x * 50.0));
float gridY = step(0.95, fract(in.worldPosition.y * 50.0));
float grid = max(gridX, gridY) * 0.3;
// Combine
float alpha = fresnel * 0.6 + grid + scanline * 0.1;
alpha = clamp(alpha, 0.0, 0.8);
float3 color = baseColor * (fresnel + 0.3) + float3(grid * 0.5);
// Pulse effect
float pulse = sin(time * 1.5) * 0.1 + 0.9;
color *= pulse;
return float4(color, alpha);
}
Example 5: Connecting Status Watcher to Expression
import Combine
class JaneAvatarBridge: ObservableObject {
private let statusWatcher: StatusWatcher
private let faceController: FaceSceneController
private let ttsDriver: TTSVisemeDriver
private var cancellables = Set<AnyCancellable>()
init(statusWatcher: StatusWatcher, faceController: FaceSceneController) {
self.statusWatcher = statusWatcher
self.faceController = faceController
self.ttsDriver = TTSVisemeDriver()
// Map system status to expressions
statusWatcher.$currentStatus
.removeDuplicates()
.sink { [weak self] status in
self?.updateExpression(for: status)
}
.store(in: &cancellables)
// Wire TTS visemes to face controller
ttsDriver.onViseme = { [weak faceController] viseme, duration in
faceController?.setViseme(viseme)
}
}
private func updateExpression(for status: SystemStatus) {
let expression: Expression
switch status {
case .idle:
expression = .calm
case .processing:
expression = .thinking
case .alert(let severity):
expression = severity > 0.7 ? .alarmed : .alert
case .speaking:
expression = .speaking
case .listening:
expression = .listening
case .success:
expression = .happy
}
faceController.setExpression(expression)
}
func speak(_ text: String) {
faceController.setExpression(.speaking)
ttsDriver.speak(text)
}
}
Example 6: Loading a Ready Player Me Avatar as USDZ
import Foundation
import SceneKit
class ReadyPlayerMeLoader {
/// Download and convert a Ready Player Me avatar for SceneKit use
/// RPM avatars are GLB format β SceneKit needs USDZ or DAE
static func loadAvatar(
avatarId: String,
morphTargets: Bool = true,
completion: @escaping (Result<SCNScene, Error>) -> Void
) {
var urlString = "https://models.readyplayer.me/\(avatarId).glb"
if morphTargets {
urlString += "?morphTargets=ARKit&textureAtlas=512"
}
guard let url = URL(string: urlString) else {
completion(.failure(URLError(.badURL)))
return
}
URLSession.shared.dataTask(with: url) { data, response, error in
if let error = error {
completion(.failure(error))
return
}
guard let data = data else {
completion(.failure(URLError(.cannotDecodeContentData)))
return
}
// Save to temp file
let tempFile = FileManager.default.temporaryDirectory
.appendingPathComponent("\(avatarId).glb")
try? data.write(to: tempFile)
// GLB to SCNScene β requires GLTFKit or Apple's ModelIO
// Using ModelIO for GLTF support (available since macOS 12)
do {
let asset = MDLAsset(url: tempFile)
let scene = SCNScene(mdlAsset: asset)
DispatchQueue.main.async {
completion(.success(scene))
}
} catch {
DispatchQueue.main.async {
completion(.failure(error))
}
}
}.resume()
}
}
Example 7: NVIDIA Audio2Face-3D Integration Concept
// Conceptual integration with NVIDIA's open-source Audio2Face-3D
// The SDK outputs ARKit-compatible blend shapes from audio input
// https://github.com/NVIDIA/Audio2Face-3D
interface Audio2FaceConfig {
modelPath: string; // Path to the ONNX model
sampleRate: number; // Audio sample rate (16000 Hz typical)
outputFps: number; // Target output frame rate
emotionStrength: number; // 0-1, how much emotion to add
}
interface Audio2FaceOutput {
timestamp: number;
blendShapes: Record<string, number>; // ARKit blend shape names
emotion: {
amazement: number;
anger: number;
cheekiness: number;
disgust: number;
fear: number;
grief: number;
joy: number;
outOfBreath: number;
pain: number;
sadness: number;
};
}
// The Audio2Face-3D SDK is C++ with CUDA dependency
// For Apple Silicon, you'd need to:
// 1. Export the ONNX model
// 2. Convert to Core ML using coremltools
// 3. Run inference via Core ML on the Neural Engine
//
// This is non-trivial but possible for the audio-to-blendshape model.
// The emotion model would also need conversion.
class Audio2FaceCoreMLBridge {
// Pseudocode for the Core ML conversion path
async convertModel(onnxPath: string): Promise<string> {
// python3 -c "
// import coremltools as ct
// import onnx
// model = onnx.load('audio2face.onnx')
// mlmodel = ct.converters.onnx.convert(model)
// mlmodel.save('audio2face.mlpackage')
// "
return 'audio2face.mlpackage';
}
}
Example 8: Complete Integration β Jane Avatar Controller
import SwiftUI
import Combine
/// The main orchestrator that ties everything together
@MainActor
class JaneController: ObservableObject {
// Rendering backend β swap between approaches
enum RenderingBackend {
case sceneKit // Full 3D with blend shapes
case spriteKit // 2D layered sprites
case swiftUI // Pure SwiftUI shapes
}
@Published var backend: RenderingBackend = .swiftUI
// Sub-controllers
let spriteScene = AvatarSpriteScene(size: CGSize(width: 100, height: 120))
let faceController = FaceSceneController()
let swiftUIAnimator = FaceAnimator()
// Lip sync
private let ttsDriver = TTSVisemeDriver()
private let audioAnalyzer = AudioVisemeAnalyzer()
// State
@Published var currentExpression: Expression = .calm
@Published var isSpeaking = false
init() {
// Wire up TTS viseme callbacks
ttsDriver.onViseme = { [weak self] viseme, duration in
Task { @MainActor in
self?.applyViseme(viseme)
}
}
}
func setExpression(_ expression: Expression) {
currentExpression = expression
switch backend {
case .sceneKit:
faceController.setExpression(expression)
case .spriteKit:
spriteScene.setExpression(expression)
case .swiftUI:
swiftUIAnimator.setExpression(expression)
}
}
func speak(_ text: String) {
isSpeaking = true
setExpression(.speaking)
ttsDriver.speak(text)
}
private func applyViseme(_ viseme: Viseme) {
switch backend {
case .sceneKit:
faceController.setViseme(viseme)
case .spriteKit:
spriteScene.setViseme(viseme)
case .swiftUI:
swiftUIAnimator.currentViseme = viseme
}
}
}
VIKI (Virtual Interactive Kinetic Intelligence) from I, Robot (2004) was created by Digital Domain using a combination of CGI techniques. The face is characterized by:
- Translucent, holographic appearance β the face seems made of light, not solid matter
- Particle/grid substrate β visible geometric structure beneath the face surface
- Edge glow / fresnel lighting β edges of the face glow brighter than the center
- Scanline artifacts β horizontal lines suggesting a display or projection
- Minimal features β eyes, mouth, and face outline are the primary readable elements
- Cool color palette β cyan, blue, white against dark backgrounds
To recreate this in a native macOS app, you donβt need a complex 3D pipeline. The aesthetic is actually easier to achieve with 2D techniques:
struct VIKIFaceView: View {
@StateObject private var animator = FaceAnimator()
var body: some View {
TimelineView(.animation) { timeline in
let time = timeline.date.timeIntervalSinceReferenceDate
Canvas { context, size in
let cx = size.width / 2
let cy = size.height / 2
// Background glow
let glowGradient = Gradient(colors: [
.cyan.opacity(0.05),
.clear
])
context.fill(
Path(ellipseIn: CGRect(
x: cx - 30, y: cy - 35,
width: 60, height: 70
)),
with: .radialGradient(
glowGradient,
center: CGPoint(x: cx, y: cy),
startRadius: 10,
endRadius: 40
)
)
// Scanlines
for y in stride(from: 0.0, to: size.height, by: 3) {
let opacity = 0.03 + 0.02 * sin(y * 0.5 + time * 2)
var path = Path()
path.move(to: CGPoint(x: 0, y: y))
path.addLine(to: CGPoint(x: size.width, y: y))
context.stroke(
path,
with: .color(.cyan.opacity(opacity)),
lineWidth: 0.5
)
}
// Face outline β fresnel glow
let faceRect = CGRect(x: cx - 18, y: cy - 22, width: 36, height: 44)
let faceOutline = Path(ellipseIn: faceRect)
context.stroke(
faceOutline,
with: .color(.cyan.opacity(0.4 + 0.1 * sin(time * 1.5))),
lineWidth: 1.5
)
// Grid pattern over face
for gridY in stride(from: faceRect.minY, to: faceRect.maxY, by: 4) {
var line = Path()
line.move(to: CGPoint(x: faceRect.minX + 5, y: gridY))
line.addLine(to: CGPoint(x: faceRect.maxX - 5, y: gridY))
context.stroke(
line,
with: .color(.cyan.opacity(0.08)),
lineWidth: 0.5
)
}
// Eyes
let eyeY = cy - 6
for eyeX in [cx - 7.0, cx + 7.0] {
let openness = animator.eyeOpenness
let eyeWidth: CGFloat = 7
let eyeHeight: CGFloat = 4 * openness
// Eye glow
let eyeRect = CGRect(
x: eyeX - eyeWidth/2,
y: eyeY - eyeHeight/2,
width: eyeWidth,
height: eyeHeight
)
context.fill(
Path(ellipseIn: eyeRect),
with: .color(.cyan.opacity(0.6))
)
// Pupil
if openness > 0.3 {
let pupilSize: CGFloat = 2
let pupilRect = CGRect(
x: eyeX - pupilSize/2 + animator.gazeDirection.x * 1.5,
y: eyeY - pupilSize/2 + animator.gazeDirection.y * 1,
width: pupilSize,
height: pupilSize
)
context.fill(
Path(ellipseIn: pupilRect),
with: .color(.white.opacity(0.9))
)
}
}
// Mouth β viseme-dependent
drawMouth(
context: &context,
center: CGPoint(x: cx, y: cy + 10),
viseme: animator.currentViseme,
time: time
)
}
.frame(width: 50, height: 60)
}
.onAppear {
animator.startIdleLoop()
}
}
private func drawMouth(
context: inout GraphicsContext,
center: CGPoint,
viseme: Viseme,
time: Double
) {
let (width, height) = viseme.mouthDimensions
var path = Path()
if height < 1 {
// Closed mouth β just a line
path.move(to: CGPoint(x: center.x - width, y: center.y))
path.addLine(to: CGPoint(x: center.x + width, y: center.y))
context.stroke(
path,
with: .color(.cyan.opacity(0.5)),
lineWidth: 1
)
} else {
// Open mouth β ellipse
let rect = CGRect(
x: center.x - width,
y: center.y - height/2,
width: width * 2,
height: height
)
path.addEllipse(in: rect)
context.fill(
path,
with: .color(.cyan.opacity(0.2))
)
context.stroke(
path,
with: .color(.cyan.opacity(0.5)),
lineWidth: 0.8
)
}
}
}
extension Viseme {
/// Mouth dimensions for 2D rendering (width, height in points)
var mouthDimensions: (CGFloat, CGFloat) {
switch self {
case .sil: return (4, 0)
case .pp: return (3, 0)
case .ff: return (4, 1)
case .th: return (4, 1.5)
case .dd: return (4, 2)
case .kk: return (3.5, 2.5)
case .ch: return (5, 2)
case .ss: return (4, 1)
case .nn: return (4, 1.5)
case .rr: return (3, 2)
case .aa: return (5, 4)
case .ee: return (6, 2)
case .ih: return (4.5, 2)
case .oh: return (3.5, 3.5)
case .ou: return (3, 3)
}
}
}
The VIKI aesthetic actually works better at small sizes than photorealistic 3D. The simplified, high-contrast, emissive style reads clearly even at 50x60pt. You lose nothing by going stylized β you gain readability.
Face Rendering Engines
| Engine | Native macOS | Programmatic Control | Lip Sync | Small Size Performance | License | Best For |
|---|---|---|---|---|---|---|
| SceneKit (Apple) | Yes | Full blend shape API | Manual (you map visemes) | Excellent | Free (Apple SDK) | Production macOS apps with 3D faces |
| RealityKit (Apple) | Partial (visionOS focus) | Limited blend shape API | No built-in | Good | Free (Apple SDK) | AR/VR β overkill for 2D HUD |
| Three.js via WKWebView | Via WebView | Full morph target API | Via TalkingHead/wawa-lipsync | Moderate (WebGL overhead) | MIT | Rapid prototyping, web-first |
| SpriteKit (Apple) | Yes | Frame-based | Manual sprite swapping | Excellent | Free (Apple SDK) | 2D animated avatars, lowest overhead |
| Live2D Cubism SDK | C++ (no Swift bindings) | Full parameter API | Manual (parameter mapping) | Good | Proprietary ($) | VTuber-style 2D avatars |
| Unity (as library) | Via Unity as a Library | Full API | Via plugins | Heavy (~200MB runtime) | Per-seat license | When youβre already using Unity |
| Metal (custom) | Yes | Whatever you build | Whatever you build | Best possible | Free (Apple SDK) | Maximum control, VIKI-style effects |
Lip Sync Solutions
| Solution | Real-Time | Accuracy | Platform | License | Integration Effort |
|---|---|---|---|---|---|
| Rhubarb Lip Sync | No (offline) | High | CLI (macOS/Linux/Win) | MIT | Medium β parse JSON output |
| Azure Speech Viseme API | Yes (streaming) | Highest | Cloud API | Pay-per-use | Low β SDK available |
| AVSpeechSynthesizer | Yes (word-level) | Low (word-level only) | macOS/iOS native | Free | Low β delegate callbacks |
| wawa-lipsync | Yes | Medium | Browser (JS/TS) | MIT | Low (web) / Medium (native) |
| NVIDIA Audio2Face-3D | Yes | Very High | CUDA (no Apple Silicon) | Open source (MIT) | High β CUDA dependency |
| Audio FFT analysis | Yes | Low | Any | N/A (build yourself) | Medium β signal processing |
| Oculus OVR LipSync | Yes | High | Quest/PC (no macOS) | Meta SDK license | N/A for macOS |
Cloud Avatar APIs
| Provider | Real-Time Streaming | Embed Widget | Programmatic Expression | Pricing | Status |
|---|---|---|---|---|---|
| HeyGen | Yes (WebRTC) | Yes (SDK) | Yes (API) | $0.10/min streaming | Active |
| Simli | Yes (WebSocket) | Yes (React SDK) | Yes (audio-to-video) | Pay-per-use | Active |
| D-ID | Yes | Yes (web) | Limited | $0.025/sec | Active |
| Synthesia | No (batch) | No | Via API | Enterprise pricing | Active |
| Soul Machines | Was yes | Was yes | Was full API | N/A | Receivership (Feb 2026) |
| Avaturn | No | Yes (iframe SDK) | Avatar creation only | Paid tiers | Active |
| Ready Player Me | No (model only) | Yes | Avatar creation only | Free tier + paid | Active |
Open Source Avatar Projects
| Project | What It Does | Language | macOS Support | Active |
|---|---|---|---|---|
| TalkingHead | Full 3D avatar with lip sync in browser | JS/TS | Via WebView | Yes |
| Open-LLM-VTuber | AI assistant with Live2D face | Python | Yes (desktop pet mode) | Yes |
| openFACS | FACS-based 3D face animation | Python | Partial | Limited |
| FACSvatar | FACS to avatar pipeline | Python/Unity | Unity only | Limited |
| wawa-lipsync | Real-time browser lip sync | TS | Via WebView | Yes |
| Audio2Face-3D | Audio to blend shapes | C++/CUDA | No (CUDA required) | Yes |
| GLTFKit | GLTF loader for SceneKit/Metal | Obj-C/Swift | Yes | Yes |
| Rhubarb Lip Sync | Offline audio to viseme | C++ | Yes (CLI) | Yes |
| Donβt | Do Instead | Why |
|---|---|---|
| Embed Unity for a 50x60pt face | Use SceneKit or SpriteKit | Unity adds ~200MB of runtime, separate rendering pipeline, and app startup overhead for rendering a thumbnail-sized face |
| Use ARKit face tracking to drive a synthetic face | Use blend shapes directly via SCNMorpher | ARKit tracking requires a TrueDepth camera pointed at a human face β youβre generating expressions from code, not tracking them |
| Create 315 pre-composed sprite frames (7 expressions x 15 visemes x 3 blink states) | Use layered sprites with independent mouth/eye/brow layers | Layer composition at runtime reduces asset count from 315 to ~140 and makes adding expressions trivial |
| Set visemes synchronously with TTS word callbacks | Buffer visemes and interpolate between them with a smoothing function | Direct viseme switching creates a βflapping jawβ effect β smoothing makes mouth movement look natural |
| Use photorealistic 3D rendering at small sizes | Use stylized, high-contrast rendering (VIKI aesthetic) | At 100x120 pixels, photorealistic detail is invisible. Stylized faces with emissive edges and clean shapes read better at low resolution |
| Run lip sync at 60fps | Run lip sync at 15-24fps, render at 30fps | Human speech produces about 10-15 visemes per second. 60fps viseme updates waste compute with no visual benefit |
| Put the avatar renderer in the SwiftUI body diff cycle | Use NSViewRepresentable or SpriteView with imperative updates | SwiftUI will recompute the body on every state change. SceneKit and SpriteKit have their own render loops β let them run independently |
| Use WKWebView for the primary notch-sized avatar | Use WKWebView only for larger avatar displays (panels, full-screen) | WKWebView runs in a separate process with IPC overhead. For a 50x60pt view thatβs always visible, native rendering is dramatically more efficient |
| Apply expression and lip sync to the same blend shapes without layering | Use a priority-based layer system where lip sync overrides only mouth shapes | Without layering, setting βalarmedβ (furrowed brow + frown) while speaking will fight with viseme mouth shapes, causing jitter |
| Generate blend shape models from scratch | Start with Ready Player Me GLB export with ?morphTargets=ARKit | RPM gives you a properly rigged model with ARKit-compatible blend shapes for free. You can customize the appearance later |
Based on the analysis above, here is the recommended implementation path for a notch-HUD AI assistant avatar:
Phase 1: Pure SwiftUI (Ship Fast)
Use the Canvas-based VIKI aesthetic approach. No external dependencies, no 3D models needed, renders in the existing SwiftUI HUD with zero integration friction.
βββββββββββββββββββββββββββββββββββββββββββ
β StatusWatcher βββ JaneController β
β β β
β βββββββββ΄ββββββββ β
β β β β
β Expression TTSVisemeDriver β
β State Machine β β
β β β β
β βββββββββ¬ββββββββ β
β β β
β FaceAnimator β
β β β
β VIKIFaceView (Canvas) β
β 50x60pt in notch β
βββββββββββββββββββββββββββββββββββββββββββ
Effort: 2-3 days. Ship it.
Phase 2: SceneKit Upgrade (When Avatar Goes Larger)
When you add a floating panel or full-screen mode, upgrade to SceneKit with a Ready Player Me model. The existing FaceAnimator state feeds into FaceSceneController blend shapes.
Phase 3: TalkingHead WebView (Rich Interactions)
For a future chat/conversation mode where Jane appears full-screen, embed TalkingHead via WKWebView. This gives you the full Ready Player Me ecosystem, Mixamo animations, and real-time lip sync without building the 3D pipeline yourself.
The Control Surface (Shared Across All Phases)
protocol JaneFace {
func setExpression(_ expression: Expression, duration: TimeInterval)
func setViseme(_ viseme: Viseme)
func setGaze(_ direction: CGPoint)
func blink()
func startIdle()
func stopIdle()
}
// Each rendering backend conforms to JaneFace
// The JaneController dispatches to whichever backend is active
This protocol remains stable across all three phases. The rendering backend is a swappable implementation detail.
These are ML models that generate video of a talking face from audio input. Theyβre fundamentally different from the real-time rendering approaches above β they produce video frames, not controllable 3D state.
SadTalker
SadTalker generates talking head video from a single image + audio. It produces 3DMM motion coefficients (head pose + expression) and renders them into video frames.
- Can it run on Apple Silicon? Yes, via PyTorch with MPS backend. Inference is not real-time (~5-10 seconds per second of video on M1/M2).
- Can it render in SwiftUI? Not directly β it produces video files, not real-time frames.
- Is it useful for Jane? No for real-time. Potentially yes for pre-generating avatar video clips that play during long responses.
Wav2Lip
Wav2Lip modifies existing video to match new audio β itβs a lip-sync correction tool, not a face generator.
- Real-time? No. Produces video files.
- Useful for Jane? No.
NVIDIA Audio2Face-3D
Audio2Face-3D is the most relevant ML model. It takes audio input and outputs blend shape weights (not video frames), which can drive any 3D face. It was open-sourced in September 2025.
- Can it run on Apple Silicon? Not natively β the SDK requires CUDA/TensorRT. However, the ONNX models could theoretically be converted to Core ML using
coremltools. This is untested but architecturally feasible. - Output format: 52 ARKit-compatible blend shape weights per frame + emotion labels.
- Is it useful for Jane? Potentially the best lip sync solution if the Core ML conversion works. The model is small enough for real-time inference on the Neural Engine.
Key insight: Audio2Face-3D is the bridge between ML lip sync and real-time 3D rendering. Unlike SadTalker/Wav2Lip (which produce video), Audio2Face outputs blend shape weights that feed directly into SceneKit. If someone ports it to Core ML, it becomes the definitive lip sync solution for macOS avatars.
ARKit Blend Shapes (52 locations)
The de facto standard. Full documentation. Every major avatar platform supports these names.
Categories:
- Eyes (14): blink, look direction, squint, wide for each eye
- Brows (5): inner up, outer up, down for each brow
- Jaw (4): open, forward, left, right
- Mouth (19): smile, frown, press, stretch, pucker, funnel, dimple, roll, shrug, lower, upper
- Cheek (3): puff, squint left, squint right
- Nose (2): sneer left, sneer right
- Tongue (1): out
- Other (4): head roll, eye roll left/right (Azure extension)
FACS (Facial Action Coding System)
The scientific standard β 46 Action Units describing individual muscle movements. More granular than ARKit but less directly supported by avatar platforms. FACS reference.
Mapping FACS AUs to ARKit blend shapes:
| FACS AU | Name | ARKit Blend Shape |
|---|---|---|
| AU1 | Inner Brow Raise | browInnerUp |
| AU2 | Outer Brow Raise | browOuterUpLeft/Right |
| AU4 | Brow Lowerer | browDownLeft/Right |
| AU5 | Upper Lid Raise | eyeWideLeft/Right |
| AU6 | Cheek Raise | cheekSquintLeft/Right |
| AU7 | Lid Tightener | eyeSquintLeft/Right |
| AU9 | Nose Wrinkle | noseSneerLeft/Right |
| AU10 | Upper Lip Raise | mouthUpperUpLeft/Right |
| AU12 | Lip Corner Pull (smile) | mouthSmileLeft/Right |
| AU15 | Lip Corner Depressor (frown) | mouthFrownLeft/Right |
| AU17 | Chin Raise | mouthShrugLower |
| AU20 | Lip Stretch | mouthStretchLeft/Right |
| AU23 | Lip Tightener | mouthPressLeft/Right |
| AU25 | Lips Part | jawOpen (partial) |
| AU26 | Jaw Drop | jawOpen |
| AU28 | Lip Suck | mouthRollLower/Upper |
| AU45 | Blink | eyeBlinkLeft/Right |
OVR Visemes (15 mouth shapes)
Defined by Meta/Oculus, these are the standard for lip sync in VR/game engines. Ready Player Me supports both ARKit and OVR viseme morph targets.
VTube Studio API Parameters
VTube Studio uses a WebSocket API on ws://localhost:8001. Parameters map to Live2D model parameters rather than blend shapes. You can mix face-tracking values with API-driven values using a weight parameter (0-1).
// VTube Studio WebSocket message to set a parameter
interface VTSParameterMessage {
apiName: 'VTubeStudioPublicAPI';
apiVersion: '1.0';
requestID: string;
messageType: 'InjectParameterDataRequest';
data: {
faceFound: boolean;
mode: 'set' | 'add';
parameterValues: Array<{
id: string;
weight?: number; // 0-1, blend with face tracking
value: number;
}>;
};
}
// Example: set mouth open to 50% while letting face tracking control 50%
const message: VTSParameterMessage = {
apiName: 'VTubeStudioPublicAPI',
apiVersion: '1.0',
requestID: 'req-001',
messageType: 'InjectParameterDataRequest',
data: {
faceFound: false,
mode: 'set',
parameterValues: [
{ id: 'MouthOpen', weight: 0.5, value: 0.8 },
{ id: 'MouthSmile', value: 0.3 },
{ id: 'EyeOpenLeft', value: 1.0 },
{ id: 'EyeOpenRight', value: 1.0 },
],
},
};
The Live2D Cubism SDK for Native is a C++ library that renders Live2D models. Itβs the engine behind VTuber avatars and VTube Studio.
macOS Support
The SDK supports macOS with OpenGL rendering. As of Cubism 5 SDK, Apple Silicon is supported. However, there are no official Swift bindings β youβd need to create an Objective-C++ bridge.
Programmatic Control
Live2D models are controlled via named parameters (not blend shapes). Each model defines its own parameter set, but common ones include:
ParamAngleX β Head rotation X (-30 to 30 degrees)
ParamAngleY β Head rotation Y (-30 to 30 degrees)
ParamAngleZ β Head rotation Z (-30 to 30 degrees)
ParamEyeLOpen β Left eye open (0 to 1)
ParamEyeROpen β Right eye open (0 to 1)
ParamEyeBallX β Eye gaze X (-1 to 1)
ParamEyeBallY β Eye gaze Y (-1 to 1)
ParamBrowLY β Left brow position (-1 to 1)
ParamBrowRY β Right brow position (-1 to 1)
ParamMouthOpenY β Mouth open (0 to 1)
ParamMouthForm β Mouth shape (-1 to 1, frown to smile)
Integration Approach for macOS
// Conceptual C++ integration β would need Obj-C++ bridge to Swift
#include "CubismFramework.hpp"
#include "Model/CubismUserModel.hpp"
class JaneLive2DModel : public Csm::CubismUserModel {
public:
void SetExpression(const char* expressionName) {
// Load and apply a .exp3.json expression file
auto expression = LoadExpression(expressionName);
if (expression) {
GetExpressionManager()->StartMotion(expression, false, 1.0f);
}
}
void SetMouthOpen(float value) {
// Direct parameter manipulation
auto model = GetModel();
auto paramIndex = model->GetParameterIndex(
Csm::CubismFramework::GetIdManager()->GetId("ParamMouthOpenY")
);
model->SetParameterValue(paramIndex, value);
}
void SetEyeBlink(float leftOpen, float rightOpen) {
auto model = GetModel();
auto leftIndex = model->GetParameterIndex(
Csm::CubismFramework::GetIdManager()->GetId("ParamEyeLOpen")
);
auto rightIndex = model->GetParameterIndex(
Csm::CubismFramework::GetIdManager()->GetId("ParamEyeROpen")
);
model->SetParameterValue(leftIndex, leftOpen);
model->SetParameterValue(rightIndex, rightOpen);
}
};
Should You Use Live2D for Jane?
Probably not. Live2Dβs strengths (deformable 2D meshes, anime-style art, VTuber ecosystem) donβt align with the VIKI aesthetic. The C++ SDK requires significant bridging work for Swift integration. If you wanted an anime-style avatar, Live2D + VTube Studio API would be the right choice β but for a holographic AI face, native Apple frameworks give you more control with less friction.
Open-LLM-VTuber is the most complete open-source AI assistant with a visual avatar. It supports:
- Any LLM backend (Ollama, OpenAI, Claude, Gemini, etc.)
- Live2D avatar with lip sync
- Voice interaction with interruption
- Desktop pet mode (floating avatar on screen)
- Vision capability (can see your screen)
- Runs on macOS with Apple Silicon
Itβs written in Python with a web frontend, so itβs not directly embeddable in a Swift app. But it demonstrates the full pipeline:
Audio Input β Speech Recognition β LLM β TTS β Lip Sync β Live2D Render
If youβre prototyping the Jane concept and want to validate the user experience before building native, running Open-LLM-VTuber in desktop pet mode is the fastest path to a working demo.
CPU/GPU Budget
The notch HUD runs continuously β itβs an always-visible overlay. Avatar rendering must be nearly invisible in Activity Monitor.
| Approach | CPU Impact | GPU Impact | Memory | Verdict |
|---|---|---|---|---|
| SwiftUI Canvas at 30fps | ~0.5% | ~0.2% | ~5MB | Best |
| SpriteKit at 30fps | ~0.3% | ~0.3% | ~8MB (textures) | Excellent |
| SceneKit at 30fps | ~1% | ~1% | ~15MB (3D model) | Good |
| WKWebView + Three.js | ~3-5% | ~2-3% | ~80-120MB | Acceptable for panels |
| Metal custom shader | ~0.2% | ~0.5% | ~5MB | Best (most work) |
Rendering Strategy
// Adaptive frame rate based on activity
class AdaptiveRenderer {
enum ActivityLevel {
case idle // Blinks only β render at 10fps
case active // Expression changes β render at 24fps
case speaking // Lip sync β render at 30fps
case transition // Expression transition β render at 60fps
}
var activityLevel: ActivityLevel = .idle {
didSet {
updateFrameRate()
}
}
private func updateFrameRate() {
let fps: Int
switch activityLevel {
case .idle: fps = 10
case .active: fps = 24
case .speaking: fps = 30
case .transition: fps = 60
}
scnView?.preferredFramesPerSecond = fps
}
}
At .idle, the avatar only needs to render when a blink happens (~once every 3-5 seconds). Between blinks, you can skip rendering entirely. This drops the CPU/GPU cost to effectively zero during idle periods.
For scenarios where you want the richness of the Three.js ecosystem without building everything natively, a hybrid approach works well:
class HybridAvatarView: NSView {
private let nativeOverlay = VIKIFaceLayer() // Native 2D for notch
private lazy var webAvatar: WKWebView = { // Web 3D for expanded view
// Only created when needed
return createWebView()
}()
enum DisplayMode {
case notch // 50x60pt β use native rendering
case expanded // 300x400pt β use web rendering
case fullscreen // Full screen β use web rendering
}
var displayMode: DisplayMode = .notch {
didSet { updateRendering() }
}
private func updateRendering() {
switch displayMode {
case .notch:
webAvatar.isHidden = true
nativeOverlay.isHidden = false
case .expanded, .fullscreen:
nativeOverlay.isHidden = true
webAvatar.isHidden = false
// Ensure web avatar is loaded
if webAvatar.url == nil {
loadWebAvatar()
}
}
}
}
This gives you the best of both worlds:
- Native rendering for the always-visible notch (minimal resource usage)
- Full Three.js + TalkingHead for expanded/fullscreen modes (maximum capability)
- A shared control protocol (
JaneFace) that works with both backends
The cloud APIs (HeyGen, Simli, D-ID) are designed for a different use case β generating avatar video at scale for marketing, training, and customer service. Theyβre not ideal for a desktop AI assistant because:
- Latency: Even streaming APIs add 200-500ms of latency. For a face that reacts to system state, this delay is noticeable.
- Cost: At $0.10/minute (HeyGen) for continuous rendering, a desktop assistant would cost $150/day.
- Dependency: Your avatar doesnβt work offline or when the API is down.
- Control: You canβt set arbitrary blend shapes β you send text or audio and get video back.
When they ARE useful:
- Pre-generating video clips for specific responses (βHereβs what I foundβ¦β)
- Building a web-based demo of the assistant concept
- When photorealistic quality is essential and cost isnβt a concern
HeyGen Streaming Avatar (For Reference)
// HeyGen Streaming Avatar SDK β embeddable in a web page
// https://docs.heygen.com/docs/streaming-avatar-sdk
import StreamingAvatar, {
AvatarQuality,
StreamingEvents,
TaskType,
} from '@heygen/streaming-avatar';
const avatar = new StreamingAvatar({
token: 'your-access-token',
});
// Start a streaming session
const session = await avatar.createStartAvatar({
quality: AvatarQuality.Medium,
avatarName: 'default',
language: 'en',
});
// Make the avatar speak
await avatar.speak({
text: 'Hello, I am Jane, your AI assistant.',
taskType: TaskType.TALK,
});
// Listen for events
avatar.on(StreamingEvents.STREAM_READY, (event) => {
// event.detail contains the MediaStream
// Attach to a <video> element
const videoEl = document.getElementById('avatar-video');
videoEl.srcObject = event.detail;
});
// Change avatar expression (limited control)
await avatar.speak({
text: '<break time="500ms"/> I noticed something concerning.',
taskType: TaskType.TALK,
});
Simli Real-Time Avatar
// Simli WebSocket API β audio in, video out
// https://docs.simli.com/
const ws = new WebSocket('wss://api.simli.com/ws');
// Initialize session
ws.send(JSON.stringify({
type: 'init',
apiKey: 'your-api-key',
faceId: 'default-female',
audioFormat: 'pcm_16000',
}));
// Send audio chunks, receive video frames
ws.onmessage = (event) => {
if (event.data instanceof Blob) {
// Video frame β render to canvas
renderFrame(event.data);
} else {
const msg = JSON.parse(event.data);
if (msg.type === 'ready') {
// Start sending audio
startAudioStream(ws);
}
}
};
function startAudioStream(ws: WebSocket) {
// Capture or generate audio and send as PCM chunks
const audioContext = new AudioContext({ sampleRate: 16000 });
// ... audio processing pipeline
}
Hereβs how all the pieces connect for the complete lip sync pipeline:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Text Input ("Hello, how are you?") β
β β β
β βββββββββββ΄ββββββββββ β
β β β β
β βββββββΌββββββ βββββββββΌβββββββββ β
β β TTS β β Text-to-Phonemeβ β
β β Engine β β (for viseme β β
β β β β pre-compute) β β
β βββββββ¬βββββββ βββββββββ¬βββββββββ β
β β β β
β Audio Stream Phoneme Timeline β
β β β β
β βββββββΌβββββββ ββββββββΌββββββββββ β
β β Audio β β Phoneme-to- β β
β β Analysis β β Viseme Map β β
β β (FFT) β β (OVR 15-set) β β
β βββββββ¬βββββββ ββββββββ¬ββββββββββ β
β β β β
β Energy/ Timed Viseme β
β Frequency Sequence β
β β β β
β βββββββΌββββββββββββββββββββΌββββββββββ β
β β Viseme Interpolator β β
β β (smoothing + coarticulation) β β
β βββββββββββββββ¬ββββββββββββββββββββββ β
β β β
β Blend Shape Weights β
β (per frame) β
β β β
β βββββββββββββββΌββββββββββββββββββββββ β
β β Layer Compositor β β
β β Expression + Viseme + Idle β β
β βββββββββββββββ¬ββββββββββββββββββββββ β
β β β
β Final Blend Shapes β
β β β
β βββββββββββββββΌββββββββββββββββββββββ β
β β Renderer β β
β β (SceneKit / SpriteKit / Canvas) β β
β βββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The dual-path approach (audio analysis + text-to-phoneme) provides both real-time responsiveness and accuracy:
- Audio FFT gives you immediate mouth movement that tracks speech energy β the mouth opens and closes in sync with sound, even if the exact shape is wrong.
- Text-to-phoneme gives you the correct viseme sequence β the mouth makes the right shapes, but with a slight delay since you need to pre-process the text.
Blending both paths gives you natural-looking lip sync: the overall mouth movement is driven by audio energy (immediate), while the specific mouth shapes are refined by the phoneme timeline (delayed by ~100ms).
If you want a custom 3D face (not a Ready Player Me avatar), hereβs the pipeline:
Blender Workflow
-
Model the base mesh β a simple face with ~2000-5000 vertices is sufficient for a small display. More geometry gives smoother blend shapes but costs more to render.
-
Create blend shape targets β in Blender, these are called βShape Keys.β You need at minimum:
- 15 OVR viseme shapes (mouth positions)
- 2 eye blink shapes (left + right)
- 4 eye look shapes (up/down/left/right or in/out per eye)
- 4-6 brow shapes (inner up, outer up, down per side)
- Thatβs ~27 shapes minimum
-
Name them with ARKit conventions β use the names from
ARFaceAnchor.BlendShapeLocationexactly. When you export as GLTF/GLB, these names carry through. -
Export as GLTF or USDZ β SceneKit supports both. USDZ is Appleβs native format and loads fastest.
Using the ARKit Blendshape Helper
The ARKit Blendshape Helper Blender addon generates all 52 ARKit blend shapes as shape keys. Install it in Blender, select your face mesh, and it creates target shapes that you can then sculpt to match your face design.
Minimum Viable Face Model
For the VIKI aesthetic, you donβt need realistic skin, hair, or teeth. A stylized geometric face works better:
Vertex count: ~1500-3000
Blend shapes: 27 minimum (visemes + blink + gaze + brows)
Materials: 1 (emissive/translucent shader)
Textures: 0 (procedural in Metal shader)
File size: ~200KB as USDZ
Compare this to a Ready Player Me avatar (~5-15MB) or a MetaHuman (~500MB+). For a 50x60pt display, the minimal model is the right choice.
| Approach | Development Time | Quality at 50x60pt | Quality at Full Screen | Resource Usage | Offline | Customization |
|---|---|---|---|---|---|---|
| Pure SwiftUI Canvas (VIKI style) | 2-3 days | Excellent | Good (stylized) | Negligible | Yes | Full |
| SpriteKit layered | 3-5 days + art | Excellent | Limited | Very Low | Yes | Need artist |
| SceneKit + custom model | 1-2 weeks | Good | Excellent | Low | Yes | Full |
| SceneKit + RPM avatar | 3-5 days | Good | Good | Low | Yes | RPM ecosystem |
| WKWebView + TalkingHead | 2-3 days | Moderate | Excellent | Moderate | Yes | Full |
| HeyGen Streaming | 1 day | N/A | Photorealistic | API call | No | Limited |
| Open-LLM-VTuber | 1 day (config) | Good (Live2D) | Good (Live2D) | Moderate | Yes | Community models |
For the Jane notch HUD avatar:
-
Start with the VIKI Canvas approach (Pattern 4 + VIKI section). It ships fast, looks great at small sizes, and has zero dependencies. The holographic aesthetic is distinctive and reads well at 100x120 pixels.
-
Add the TTS viseme pipeline (Pattern 5, Approach A or D). Start with the simple text-to-viseme mapper for AVSpeechSynthesizer. Upgrade to Rhubarb for pre-recorded audio or audio FFT for real-time analysis.
-
Define the JaneFace protocol now, even if you only implement one backend. This keeps your options open for SceneKit, TalkingHead, or any future rendering approach.
-
Consider Open-LLM-VTuber for rapid prototyping of the overall assistant experience, even if the final avatar will be native Swift.
-
Ignore Unity, Unreal, cloud avatar APIs, and talking head ML models for the notch use case. They solve different problems at different scales.
The face is the easy part. The hard part is what it expresses and when. Build the expression state machine first, render it second.
Apple Frameworks & Documentation
- SCNMorpher β Apple Developer Documentation β SceneKit blend shape controller for morph target animation
- ARFaceAnchor.BlendShapeLocation β Apple Developer Documentation β The 52 ARKit blend shape definitions (de facto industry standard)
- SCNView.preferredFramesPerSecond β Apple Developer Documentation β Frame rate control for SceneKit views
- Animating SceneKit Content β Apple Developer Documentation β Official guide to SceneKit animation
- AVSpeechSynthesizer β Apple Developer Documentation β Appleβs text-to-speech engine
- AVSpeechSynthesizer.MarkerCallback β Apple Developer Documentation β Callbacks for speech markers during synthesis
- speechSynthesizer(_:willSpeak:utterance:) β Apple Developer Documentation β Word-level timing callbacks
- SpriteView β Apple Developer Documentation β SwiftUI integration for SpriteKit scenes
- Extend Speech Synthesis with personal and custom voices β WWDC23 β Appleβs speech synthesis extensions including SSML and Personal Voice
Blend Shapes & Face Animation Standards
- ARKit 52 Facial Blendshapes: The Ultimate Guide β Anatomy reference for artists creating ARKit-compatible blend shapes
- ARKit Face Blendshapes Interactive Reference β Interactive visualization of all 52 ARKit blend shape locations
- Facial Action Coding System β Wikipedia β Overview of FACS, the scientific standard for facial expression encoding
- Facial Action Coding System β Paul Ekman Group β Official FACS reference from its creator
- FACS β Carnegie Mellon University β Academic reference for the Facial Action Coding System
Lip Sync Tools & Libraries
- Rhubarb Lip Sync β GitHub β Command-line tool for generating 2D mouth animation from voice recordings (MIT license)
- Oculus OVR LipSync Viseme Reference β Meta Developer Docs β The 15-viseme standard used across VR and game engines
- VRChat Visemes Reference β Practical viseme documentation for avatar animation
- Oculus OVR LipSync Morph Targets β Ready Player Me β How RPM avatars expose OVR viseme blend shapes
- Azure Speech Service β Get Facial Position with Viseme β Microsoftβs viseme API with 55 blend shape output
- wawa-lipsync β GitHub β Open-source real-time lip sync library for web (MIT license)
- Viseme Cheat Sheet & Interactive IPA Chart β Visual reference for phoneme-to-viseme mapping
3D Avatar Platforms & Tools
- Ready Player Me Developer Docs β Avatar creation platform with SDK for Unity, Unreal, and web
- TalkingHead β GitHub β JavaScript class for real-time lip-sync with full-body 3D avatars (RPM compatible)
- GLTFKit β GitHub β Objective-C glTF 2.0 loader and Metal-based renderer for SceneKit
- ARKit Blendshape Helper β GitHub β Blender addon that generates ARKit blend shapes for facial motion capture
- SkinAndMorph β GitHub β SceneKit sample for blend shape (morph) animation
- openFACS β GitHub β Open source FACS-based 3D face animation system
- FACSvatar β GitHub β Open source modular framework from face to FACS-based avatar animation
- Avaturn Developer Docs β Realistic 3D avatar creator with web SDK
AI Avatar & Lip Sync ML Models
- NVIDIA Audio2Face-3D β GitHub β Open-source audio-to-blend-shape animation models and tools
- NVIDIA Audio2Face-3D SDK β GitHub β High-performance C++ SDK for Audio2Face inference
- NVIDIA Open Sources Audio2Face β Blog Post β Announcement and technical overview of the open-source release
- Audio2Face-3D v3.0 β Hugging Face β Model weights on Hugging Face
- SadTalker β GitHub β Audio-driven single image talking face animation (CVPR 2023)
- Audio2Face for iClone β Reallusion β Reallusionβs integration of NVIDIA Audio2Face
VTuber & Live2D
- Live2D Cubism SDK for Native β Documentation β Official SDK documentation for C++ integration
- Live2D Cubism SDK Download β SDK download page
- CubismNativeSamples β GitHub β Official Live2D sample implementations
- VTube Studio API β GitHub β WebSocket API for controlling VTuber models
- VTubeStudioJS β GitHub β JavaScript implementation of the VTube Studio API
- Open-LLM-VTuber β GitHub β Open-source AI assistant with Live2D avatar, voice interaction, and LLM integration
Cloud Avatar APIs
- HeyGen Streaming Avatar SDK Documentation β Real-time AI avatar streaming via WebRTC
- HeyGen API Documentation β Full API reference for video generation and streaming avatars
- Simli Documentation β Real-time lip-synced AI avatar API
- Simli Auto API Reference β Simplified avatar session management
- Top 5 Best Avatar APIs 2025 β A2E β Comparison of major avatar API providers
- Soul Machines Developer Docs β Documentation for Digital Person components (company in receivership as of Feb 2026)
Shader & Visual Effects
- Hologram-Material β GitHub β GLSL hologram shader for Three.js with fresnel rim lighting
- Shader Journey: Holograms β Medium β Tutorial on creating holographic shader effects
- GLSL Hologram Shader β jMonkeyEngine β Discussion and implementation of holographic shaders
- Custom Metal Drawing in SceneKit β Medium β How to use custom Metal shaders with SceneKit
Tutorials & Integration Guides
- Build a Three.js 3D Avatar with Real-Time AI β Gabber.dev β End-to-end guide for Three.js avatar with lip sync in Next.js
- Integrating Ready Player Me 3D Model with Lipsyncing in React β Medium β Beginner guide to RPM avatar lip sync
- Lip Sync Tutorial β Wawa Sensei β React Three Fiber lip sync with blend shapes
- Building a Talking Avatar on the Web β Infosys β Comprehensive guide to open-source digital human avatars
- Creating Face-Based AR Experiences β Apple Sample Code β Appleβs official ARKit face tracking sample
- Using SpriteKit to create animations in Swift β Swift by Sundell β Guide to SpriteKit animation in Swift
- How to integrate SpriteKit using SpriteView β Hacking with Swift β SwiftUI integration for SpriteKit
Film & Visual Reference
- βI, Robotβ and the Future of Digital Effects β Animation World Network β Behind-the-scenes look at the VFX of I, Robot including VIKI
- VIKI β I, Robot Wiki β Character reference for VIKIβs visual design
- Reallusion iClone Python API β Programmatic control of iClone characters
- Character Creator β Reallusion β Professional 3D character design software