
A roman statue comes to life in an ancient plaza, drawing a phone-ready crowd. A yellow car speeds down a racetrack suspended in clouds. A Chinese girl unfurls a banner reading “we will open source.”
Clips created with StepFun's 'Step-Video-T2V' AI text-to-video model. Credit: StepFun T2V
None of these videos were shot using cameras. Each, though, is the product of user-delivered prompts to Chinese artificial intelligence firm StepFun, which has made waves this year with a slew of image and video-generation tools.
Many of StepFun’s latest models are multimodal, meaning they can process text, audio, image, and video alike to create new content. “StepFun has made a name for itself by focusing on multimodality,” says Adina Yakefu, a Paris-based AI researcher at open-source repository Hugging Face.
The focus demonstrates how China’s AI firms are differentiating their products as competition mounts in the rapidly growing sector.
The Wire is chronicling these differences with periodic highlights of the Chinese AI start-ups worth more than $1 billion. We’ve previously profiled Baichuan and Zhipu AI.
STEPFUN’S FIRST STEPS
Jiang Daxin, a 16-year veteran of Microsoft, founded Shanghai-based StepFun in 2023. He had joined the American tech giant in 2007, shortly after graduating with a doctorate in computer science from the University at Buffalo. Jiang would go on to become a global vice president and chief scientist of Microsoft’s Software Technology Center Asia — making him one of several leading figures in China’s AI world with a background working for the American company.
Jiang told Chinese media last March that he chose to start his own business after U.S. firm OpenAI released ChatGPT in November 2022. “I thought, I can do it myself, maybe even better,” he said. He recruited fellow Microsoft alums Jiao Binxing and Zhu Yibo to run search and systems, respectively. By April the company was ready to launch.

StepFun has made multimodal models a priority from the start, unlike other AI companies which focus on text-based models. “We believe that [AI] must go from unimodal to multimodal, to embodied intelligence, and finally to AGI, and we have put this roadmap on the wall of our Shanghai office,” Jiang told Chinese media. The company’s name is a reference to the ‘Step Function,’ a mathematical function that changes abruptly rather than gradually.
Last year, StepFun became the first Chinese company to release an AI model containing one-trillion parameters: Models with more parameters can generally process more information. By comparison, one estimate suggests that OpenAI’s GPT-4 has 1.8 trillion parameters.
Jiang told Chinese media last March that he chose to start his own business after U.S. firm OpenAI released ChatGPT in November 2022. “I thought, I can do it myself, maybe even better,” he said.
StepFun has released more than 20 models in total over the past two years, with 11 currently available to download on Hugging Face, a repository of so-called open-source models that make their codes freely accessible. However, StepFun’s top models rank lower than those made by industry giants on China’s AI leaderboard.

StepFun began pitching itself outside of China after another Chinese open-source firm DeepSeek drew global attention in January. The smaller company activated accounts on social media platforms like LinkedIn and X, posting on February 17. Neither site is available in China.

StepFun has also begun pursuing corporate partnerships with major Chinese companies, including a February tie-up with Hangzhou-based auto firm Geely. The carmaker will integrate StepFun’s open-source video and audio models into its own AI systems to “promote the popularization of AI technology in the field of smart cars,” Geely executive Gan Jiayue said.
Stepfun also made an arrangement in March with Shanghai robotics company Agibot. It will share data from its robots with StepFun, which will in turn provide its models to improve Agibot robots’ decisionmaking, according to a company press release.

Jiang has said that StepFun will prioritize selling directly to consumers, however. The company charges users a price per prompt that is measured in million tokens — the common standard by which AI companies measure input and output, with around four Latin characters or one Chinese character to a token. StepFun’s prices range from 3 yuan ($0.42) per million tokens for its cheapest models to 395 yuan ($55) per million for its most expensive.
STEPFUNDING
The firm rose to the unicorn rank within China’s AI ecosystem last December, when it raised hundreds of millions of dollars in a Series B financing round, which is rumored to have valued the company at $2 billion, according to Chinese media. Exact figures are not available, but Fortera Capital, a private equity firm backed by the Shanghai government, posted on social media app WeChat that it led the raise. Tech conglomerate Tencent and venture capital funds Qiming Venture Partners and 5Y Capital also participated, according to data provider S&P Capital IQ. The platform does not show any direct foreign investment in the company, though both VCs have counted U.S. pension funds and endowments among their limited partners.
StepFun has also tried its hand at investing, scooping up shares in firms alongside state-backed bodies. A review of corporate ownership records in WireScreen shows that StepFun owns 10 percent of Shanghai Smart Computing Technology Company, a subsidiary of state-owned electronics manufacturer Inesa Group. It is the only non-state investor. The subsidiary’s official business scope includes “supporting the accelerated development of AI large model technology in China.”

StepFun and Jiang did not respond to requests for comment.

Noah Berman is a staff writer for The Wire based in New York. He previously wrote about economics and technology at the Council on Foreign Relations. His work has appeared in the Boston Globe and PBS News. He graduated from Georgetown University.
LISTEN NOW
Chinese Solar's Sunny Future in Africa
The Wire China Podcast


