What kind of stack do you think this uses?

The real time lipsync and avatar expressions must require a lot of compute right? Also, does it go like human speech (ffmpeg) => text(whisper) => llm => response => tts (dia, eleven labs, sesame) and somehow involve the avatar in it?

https://www.linkedin.com/posts/vrishanksaini_every-single-demo-weve-done-someone-asks-ugcPost-7356467729278619650-GQPH?utm_source=share&utm_medium=member_ios&rcm=ACoAAEOYEDoBbG2O5-zOauJWFR0-TILY8U9Hbkg

submitted by /u/agamer60
[link] [comments]

from Software Development – methodologies, techniques, and tools. Covering Agile, RUP, Waterfall + more! https://ift.tt/I2VlCfY

Share this:

Related

Leave a comment Cancel reply