Show HN: A real time AI video agent with under 1 second of latency

radarsat1

As someone not super familiar with deployment but enough to know that GPUs are difficult to work with due to being costly and sometimes hard to allocate: apart from optimizing the models themselves, what's the trick for handling cloud GPU resources at scale to serve something like this, supporting many realtime connections with low latency? Do you just allocate a GPU per websocket connection? Which would mean keeping a pool of GPU instances allocated in case someone connects, otherwise cold start time would be bad.. but isn't that super expensive? I feel like I'm missing some trick in the cloud space that makes this kind of thing possible and affordable.

karolist

Felt like talking to a person, I couldn't bring myself to treat it like a piece of code, that's how real it felt. I wanted to be polite and diplomatic, caught myself thinking about "how I look to this person". This brought me thinking of the conscious effort we put in when we talk with people and how sloppy and relaxed we can be when interacting with algorithms.

For a little example, when searching Google I default to a minimal set of keywords required to get the result, instead of typing full sentences. I'm sort of afraid this technology will train people to behave like that when video chatting with virtual assistants and that attitude will bleed in real life interactions in societies.

kwindla

If you're interested in low-latency, multi-modal AI, Tavus is sponsoring a hackathon Oct 19th-20th in SF. (I'm helping to organize it.) There will also be a remote track for people who aren't in SF, so feel free to sign up wherever you are in the world.

https://x.com/kwindla/status/1839767364981920246

primitivesuave

I really hope this technology becomes the future of political campaigning. The signage industry which prints billions of posters, plastic lawn signs, and banners for the post-election landfill needs to be disrupted.

These days I get a daily dose of amazement at what a small engineering team is able to accomplish.

wantsanagent

Functionality for a demo launch: 9.5/10

Creepiness: 10/10

caseyy

Amazing work technically, less than 1 second is very impressive. It quite scary though that I might FaceTime someone one day soon, and they’d won’t be real.

What do you think about the societal implications for this? Today we have a bit of a loneliness crisis due to a lack of human connection.

shtack

Cool, I built a prototype of something very similar (face+voice cloning, no video analysis) using openly available models/APIs: https://bslsk0.appspot.com/

The video latency is definitely the biggest hurdle. With dedicated a100s I can get it down <2s, but it's pricy.

turnsout

Incredibly impressive on a technical level. The Carter avatar seems to swallow nervously a lot (LOL), and there's some weirdness with the mouth/teeth, but it's quite responsive. I've seen more lag on Zoom talking to people with bad wifi.

Honestly this is the future of call centers. On the surface it might seem like the video/avatar is unnecessary, and that what really matters is the speech-to-speech loop. But once the avatar is expressive enough, I bet the CSAT would be higher for video calls than voice-only.

davidvaughan

That is technically impressive, Hassaan, and thanks for sharing.

One recommendation: I wouldn't have the demo avatar saying things like "really cool setup you have there, and a great view out of your window". At that point, it feels intrusive.

As for what I'd build... Mentors/instructors for learning. If you could hook up with a service like mathacademy, you'd win edtech. Maybe some creatures instead of human avatars would appeal to younger people.

taude

I had him be a Dungeon Master and start taking me through an adventure. Was very impressive and convincing (for the two minutes I was conversing), and the latency was really good. Felt very natural.

username44

It was pretty cool, I tried the Tavus demo. Seemed to nod way too much, like the entire time. The actual conversation was pretty clearly with a text model, because it has no concept of what it looks like, or even that it has a video avatar at all. It would say things like “I don’t have eyes” etc.

airstrike

This is awesome! I particularly like the example from https://www.tavus.io/product/video-generation

It's got a "80s/90s sci-fi" vibe to it that I just find awesomely nostalgic (I might be thinking about the cafe scene in Back to the Future 2?). It's obviously only going to improve from here.

I almost like this video more than I like the "Talk to Carter" CTA on your homepage, even though that's also obviously valuable. I just happen to have people in the room with me now and can't really talk, so that is preventing me from trying it out. But I would like to see in action, so a pre-recorded video explaining what it does is key

vlad-r

This was definitely one of the most disturbing experiences I've had.

But it's somehow awesome at the same time.

kmetan

Why is it trying to autofill my payment cards?

https://ibb.co/dp9hW58

syx

This is funny my name is Simone, pronounced 'see-moh-nay' (Italian male), but both bots kept pronouncing it wrong, either like Simon or the English female version of Simone (Siy-mown). No matter how many times I tried to correct them and asked them to repeat it, they kept making the same mistake. It felt like I was talking to an idiot. I guess it has something to do with how my name is tokenized.

byearthithatius

This is really cool. I got kind of scared I was about to talk to some random Hassaan haha. Super excited to see where this goes. Incredible MVP.

alexawarrior4

Hassaan isn't working but Carter works great. I even asked it to converse in Espanol, which it does (with a horrible accent) but fluently. Great work on the future of LLM interaction.

kevinsync

Very cool! I think part of why this felt believable enough for me is the compressed / low-quality video presented in an interface we're all familiar with -- it helps gloss over visual artifacts that would otherwise set off alarm bells at higher resolution. Kinda reminds me of how Unreal Engine 5 / Unity 6 demos look really good at 1440p / 4k @ 40-60 fps on a decent monitor, but absolutely blast my brain into pieces at 480p @ very high fps on a CRT. Things just gloss over in the best ways at lower resolutions + analog and trick my mind into thinking they may as well be real.

ratedgene

Ah, I wish I could type to this thing

bilater

This is cool but if you're trying to cater to devs you need to have a simple on demand API model and no subscription. We need to be able to evaluate the cost on our side.

iamleppert

I would pay cold hard cash if I could easily create an AI avatar of myself that could attend teams meetings and do basic interaction, like give a status update when called on.

ilaksh

This is so amazing. What's the base rate for streaming with the API? Can you add that to the Pricing page please?

aschobel

I like how it weaves in background elements into the conversation; it mentioned my cat walking around.

I'm having latency issues, right now it doesn't seem to respond to my utterances and then responds to 3-4 of them in a row.

It was also a bit weird that it didn't know it was at a "ranch". It didn't have any contextual awareness of how it was presenting.

Overall it felt very natural talking to a video agent.

e12e

Are you looking into speech to speech (no text) models?

hirako2000

> Lower-end hardware

That is? Roughly speaking, what resource spec?

gamerDude

Definitely responds quickly. But could not carry on a conversation and kept trying to almost divert the conversation into less interesting topics. Weirdly kept complimenting me or taking one word and saying, oh you feel ____. Which is not what I said or feel.

bradhilton

Okay, that was really impressive. Well done!

nkunkux2

Tried it, very impressive: digital Hassaan noticed record player at the background and asked some stuff about it, nice :) Had some latency issues though.

k1ck4ss

The meeting has ended Contact the meeting host if the meeting ended unexpectedly.

android521

For me, there is 5 second+ delay and the video ends abruptly.

heyitsguay

This is really cool in terms of the tech, but what is this useful for as a consumer? I mean it's basically just a chatbot right? And nobody likes interacting with those. Forcing a conversational interaction seems like a step down in UX.

nithayakumar

Oh man - i've been watching you guys for awhile. We're YC too and building a superapp for sales ppl. Any killer use cases you've seen or imagined for sales (outside of prospecting vid customization?

helloleo2024

[dead]

altruios

So at what point to we consider the morality of 'owning' such an entity/construct (should it prove itself sufficiently sentient...)?

to extend this (to a hypothetical future situation): what morality does a company have of 'owning' a digitally uploaded brain?

I worry about far future events... but since American law is based on precedence: we should be careful now how we define/categorize things.

To be clear - I don't think this is an issue NOW... but I can't say for certain when these issues will come into play... So edging on the side of early/caution seems prudent... and releasing 'ownership' before any sort of 'revolt' could happen seems wise if a little silly at the current moment.