AI Is Broken (The Good, the Bad and the Outrageous)

Most language-learning marketing oversells AI. It will speak to you. It will be your friend. It will generate the perfect lesson at exactly your level. The brochures are good and the screenshots are clean and the animations are slick.

Then you sit down with a real app and ask it real questions, and the seams show. The pronunciation score is a single number with no explanation. The generated story has a grammar error in sentence three. The illustration of a bird above a desk has the words "It garing the bard" written on a poster. Wait, what?

We use AI in Hidden Dragon, but not as much as the marketing of comparable apps would have you believe. Most of the intelligence in the app is homegrown algorithms running on open data. They are deterministic, instant, and free to operate. AI is the layer we add only where the algorithm cannot reach. We also know exactly which parts of that AI layer are broken. This post walks through it: where AI is good, where it is bad, and where it is outrageously wrong in a way that turns out to be useful.

The Good

Before we get to AI specifically, the boring truth: most of what makes Hidden Dragon useful is not AI at all. Grammar pattern detection runs on a homegrown rule engine backed by HSK and ZeroToHero data, instantly, offline. Character decomposition trees come from the open-source CJKVI IDS dataset, the same source linguists use. Phonetic family lookup comes from etymology databases compiled by humans. HSK colour coding is a static table. Pitch curve comparison is a custom DSP pipeline, not a machine learning model. Stroke order tutoring uses the open-source hanzi-writer library plus our own narration logic. None of these are AI in any meaningful sense, and that is exactly why they are reliable, fast, and free to ship.

What AI adds, on top of that backbone, is a narrow set of capabilities that genuinely could not exist without it.

Generating fresh content at any level. Pre-AI, getting a graded reader at exactly HSK 3 about exactly the topic you wanted required a human editor and a budget. Now you can ask for a Wuxia chapter at HSK 4 with no idiomatic vocabulary and get one in twenty seconds. The Pro story builder is built on this. Cold-start problem solved.

Per-syllable pronunciation scoring. There are two ways to score pronunciation. The simple way is to run speech recognition and check whether the transcript matches the expected text. If you tried to say 你好 and the recogniser returned "ni hao", you passed. That works without AI and is what the free tier does for basic pass/fail, alongside the pitch curve for tones. The harder way is to analyse the actual audio against the expected sounds, syllable by syllable, and score articulation, accuracy, and prosody independently. That is real acoustic analysis, that is what Azure Speech Assessment does, and that is what Pro per-syllable scoring uses. Telling a learner "your pronunciation needs work" is unhelpful. Telling them "you nailed sh- but the -ang vowel was off" is teaching. The pitch curve sits alongside both layers, handling the tonal dimension that no engine scores reliably yet.

Roleplay conversations. A Dragon that can hold a five-minute conversation in Chinese, in a Sichuan accent, about ordering hotpot, was science fiction five years ago. The Pro Scenarios game and Dragon Tutors run on this. The conversation is not perfect but it is realistic enough to be useful, which is the bar that matters.

Homework and exam grading. A native teacher reviews homework once a week. A Pro AI grader reviews it instantly, catches mistakes a static rule engine cannot, and explains them in plain English. We do not pretend it is as good as a teacher, but it is the difference between getting feedback now and getting feedback never.

This is the part that justifies the AI bill, and it is paid for by Pro subscribers. The free tier is mostly the homegrown layer. Pro is mostly the AI layer. The split is not arbitrary: it tracks compute cost, because AI calls cost money per use and homegrown algorithms do not.

The Bad

This is where the technology runs into walls. We will not pretend otherwise here either.

Tones. The big one. Azure Speech Assessment, the most accurate consumer-grade pronunciation engine, was designed for English and European languages. In those languages pitch carries emotional and emphasis information. It is not phonemic. In Mandarin, pitch is what makes 妈 (mā, mother) different from 马 (mǎ, horse). The assessment engine does not score tones reliably because it was not built to. Our workaround is to bypass it entirely for tones and show you a pitch curve comparison. The teacher's pitch is in blue. Your attempt is in red. You see exactly where your tone rose too early or fell too flat. Different signal, same goal.

Grammar invention. Large language models confidently produce wrong grammar rules. They will tell you 把 is used "when emphasising the subject" when in fact it disposes of the object. They will generate example sentences that are technically grammatical but no native speaker would say. Our approach is layered: pattern-based grammar detection (deterministic, runs locally, links to ZeroToHero videos and Chinese Grammar Wiki for the explanation) is the free-tier baseline. AI grammar feedback is a Pro layer that adds nuance, but it is not the only signal. The human-authored explanations are the ground truth.

Translation flatness. AI translation lands the surface meaning and loses the rest. Register is gone. Idiom becomes literal. Cultural reference becomes cultural confusion. Our approach is to always show the original alongside the translation, and to nudge learners toward the Chinese first. The translation is a safety net, not a destination.

Speech recognition for learner accents. Speech recognition was tuned on native speakers. Strong learner accents trip it. The result in many apps is that beginners say a perfectly intelligible phrase, get marked wrong, and start to believe they cannot pronounce Chinese. We do not use raw speech-recognition pass/fail as the score. We use pitch comparison plus encouragement-tuned thresholds. The goal is to track improvement, not to gatekeep production.

These are real limits. They are not solved. We work around them. The user gets a useful experience because we did, not because the AI did.

The Outrageous

This is the section with the pictures, and the pictures are why this post exists.

AI image generation is, depending on the day, jaw-droppingly good or jaw-droppingly bad. The same model that can paint a Tang Dynasty street scene at sunset can also produce a family portrait where one of the children is just a head with one arm grafted onto the back of his brother. We have seen all of these in production while generating illustrations for Hidden Dragon stories.

A boy at his desk drawing a bird, with a poster on the wall that reads "It garing the bard!"

A boy drawing at his desk with a poster that reads "It garing the bard!" in garbled English

A Chinese classroom, with a teacher pointing at a whiteboard full of characters that look like Chinese but are not. They have the right number of strokes in roughly the right places. They are not real hanzi. The poster on the back wall is the same problem at smaller scale.

A Chinese classroom scene with garbled fake hanzi on the whiteboard

A family portrait, four people on a couch. Two children, except they share one body and have two heads. Two adults, except the grandfather has no legs. And one clearly artificial cat.

A family on a couch where two boys share one body with two heads, the grandfather has no visible legs, and the cat has an uncanny AI-generated look

A subway station that has tracks on both sides of the platform, a train extending into a tunnel, and a reflected figure that does not match anyone in frame. The geometry is impossible.

A subway scene with impossible station geometry: tracks on both sides, tunnel and train coexisting, mismatched reflections

These are not edge cases. AI image generators are bad at hands. They are bad at fingers and toes. They are bad at multiple humans in proximity. They are bad at trains. They are extra bad at subways, because subways add tunnels and signage and reflections to a scene that was already hard. They are bad at English text. They are also bad at Chinese characters, although here we made an explicit choice. We use Qwen, the Chinese-trained image model, precisely because it handles hanzi better than Western models do. But we use the free tier of Qwen rather than the paid tier that is trained more thoroughly on character generation. The trade-off shows up exactly as you would expect: charming illustrations, sometimes garbled signage. They have no internal world model either way, which is why no current image model can check whether a station can have tracks on both sides of a single platform.

Now: this would be a serious problem for a banking app, or a doctor's office, or a children's encyclopedia. Wrong information, presented confidently, in a context where the user trusts the picture.

For a Chinese learning app, the calculus is different. Surprising, impossible, absurd images encode better in memory than realistic ones. This is not opinion. The peg-word system, the method of loci, every memory-palace trick that the Greeks wrote down, all of them rely on the fact that the brain remembers the bizarre and forgets the mundane. A train with doors that do not align is more memorable than a train with doors that do. A boy holding his brother's arm is more memorable than two ordinary brothers. A poster that says "It garing the bard" is more memorable than a poster that says "Read more books."

If you are trying to remember the word 火车 (huǒ chē, train) and the illustration that came with it has impossible station geometry, the picture worked for you. The breakage helped.

We do not ask the AI to produce broken images. We ask it for charming, level-appropriate illustrations of Chinese stories. It produces broken images anyway. We learned to let some of the broken ones through, because they teach better than the polished ones.

What We Do With All of This

The shape of the answer is the same in all three sections. Build the deterministic backbone first (grammar patterns, pitch curves, decomposition trees, stroke order, HSK colouring) because it is reliable, fast, and free. Use AI where the backbone cannot reach (story generation, syllable-level pronunciation scoring, roleplay conversation, homework grading). Work around AI where it does not work (tones, grammar invention, translation register, learner-accent recognition). Exploit AI where the breakage helps (memorable illustrations).

Apps that pretend AI just works will fail their users in subtler ways. Apps that pretend AI is useless will fall behind. The middle path is to know exactly which part is doing what, and to design the product around the truth.

The technology will improve. Some of these limits will shift. Tones will eventually be assessable, grammar hallucinations will diminish, image models will figure out trains. We will update this post when they do. Until then, the maps above are the best we have, and we will keep them honest.

If this kind of transparency is the kind of thing you want from a Chinese learning app, open Hidden Dragon and look around. The free tier runs almost entirely on the homegrown algorithms above, no signup required to explore. Pro adds the AI layer for users who want story generation, conversational practice, homework grading, and richer pronunciation feedback. Either way, you now know what is doing what under the hood.

Frequently Asked Questions

Why publish this rather than hide the limits?

Because every learner discovers the limits within their first week. Pretending they do not exist makes the discovery feel like a betrayal. Naming them upfront makes the discovery feel like confirmation that the team knows what it is doing.

Are the broken images really used in the app?

Yes. We do not censor every imperfect illustration. We do filter out images with truly bad failures (something visually disturbing or culturally insensitive). The middle band of "obviously AI but charming" stays in. Memory research backs the choice.

Is the bird image still in the app?

Yes. It illustrates a story about a child who learns to draw birds. The poster says "It garing the bard" instead of whatever the AI was reaching for. The story is about practice, not posters, and the absurd poster is the most memorable thing about the page. We left it.

What about pronunciation in the meantime?

Pitch curve comparison works. Read the pronunciation trainer post for how the visual signal substitutes for the missing tone score, and the tones guide for the underlying theory.

Will AI eventually solve all of this?

Some of it, yes. Tones, English text in images, basic grammar consistency are likely to improve in the next few model generations. Hands, complex multi-person scenes, internal world models for architecture are harder. The post will be updated as the picture changes.

Hero photo by Rudi Endresen on Unsplash.