Building an AI Agent from Scratch: My 3-Month Warehouse Automation Journey

Before last Singles' Day, my warehouse nearly collapsed under a flood of returns. I gritted my teeth, taught myself AI Agent, and spent 3 months building an automated decision system that handles everything from return sorting to inventory alerts. Today, I'll share my pitfalls and how SMEs can build an AI Agent system from scratch.

2026-05-30

14 min read

FlashWare Team

Building an AI Agent from Scratch: My 3-Month Warehouse Automation Journey

Last year, on the eve of Singles' Day, my warehouse was buried under a mountain of return packages. Three temp workers were frantically unpacking, inspecting, and sorting, but they couldn't keep up. I watched the monitor as a girl ran to the wrong zone and tossed a down jacket into the scrap pile—the jacket only had a missing tag. That night, I calculated: return delays caused a 40% spike in customer complaints, costing nearly 20,000 RMB in refunds. I thought, 'No more. I need automation.'

TL;DR: I spent 3 months teaching myself AI Agent and built an automated return sorting system from scratch. I hit pitfalls like dirty data, dumb models, and employee resistance, but finally got it running with a rule engine + lightweight model. Today, I'll share how to make AI work with minimal cost.

闪仓 WMS · 示意图

内容概览

When Returns Piled Up, I Decided to Let AI Handle It

Three days after Singles' Day, returns peaked: 800 packages a day. Each needed manual inspection—check condition, decide whether to restock or scrap. I stood in the sorting area and watched Old Zhang toss a barely-worn sweater into the donation bin. I nearly fainted. That night, staring at a messy Excel sheet, I realized the problem wasn't people—it was the process.

Instead of hiring more people, let AI learn to judge. I decided to build an AI Agent for return sorting.

闪仓 WMS · 示意图

When Returns Piled Up, I Decided to Let AI Handle It

Step 1: Data Cleaning Nearly Made Me Quit

First pitfall: data. I dug out a year's return records—fields missing, categories messy, notes full of vague words like 'customer said' and 'maybe.' I spent a week with three interns cleaning and standardizing 3,000 records.

Raw Data	Cleaned Data
Customer said shirt too small	Size too small, reason: size
Maybe has stain	Has stain, reason: quality
Probably didn't like it	Customer preference, reason: no reason

I almost gave up—the workload exceeded manual sorting. But once I pushed through, model training became smooth. Anyone who's been there knows: dirty data makes AI useless.

闪仓 WMS · 示意图

Step 1: Data Cleaning Nearly Made Me Quit

Model Selection: Don't Be Fooled by Big Models

Step two: model selection. At first, I jumped at using a large language model (LLM) for full automation. A week later, it crashed—the model classified 'minor scratch' as 'severe damage,' wasting restockable items. According to Gartner's supply chain tech report^[1], many companies overestimate AI capabilities.

I went pragmatic: rule engine + lightweight classification model. For 80% of common cases (size issues, no-reason returns), use predefined rules. For the remaining 20% fuzzy cases (stain severity, missing accessories), use a fine-tuned small model.

闪仓 WMS · 示意图

Model Selection: Don't Be Fooled by Big Models

Rule Engine: Simple but Effective

I built a decision tree with a simple Python rule engine. For example:

If return reason = 'size too small' and item is new → restock
If return reason = 'stain' and stain area < 5% → clean then restock
If return reason = 'missing accessory' → manual review

The engine ran for a month with 85% accuracy, 10x faster than manual.

Lightweight Model: Handling Fuzzy Cases

For rule-uncovered cases, I fine-tuned an open-source BERT model with only 500 records. Surprisingly, it achieved 92% accuracy distinguishing 'minor wear' from 'severe wear.' Comparison:

Method	Accuracy	Speed (per item)	Cost
Pure manual	95%	3 min	High
Rule engine	85%	10 sec	Very low
Rules + model	92%	15 sec	Low

Final solution: rule engine handles 80% simple returns, model handles 20% complex returns, humans only do final review. This balances accuracy and cost.

Employee Resistance: Harder Than Tech

On launch day, Old Zhang quit on the spot: 'Can a computer judge better than my ten years of experience?' He refused to use the system and manually overrode results. I argued with him, but later realized he wasn't lazy—he feared being replaced.

I spent two weeks doing three things:

Held all-hands training, using real cases to prove AI accuracy
Set up a 'human-machine review' process: AI suggestions must be confirmed by team leads
Used saved time to raise wages—originally 200 items/person/day, now 300, with piece-rate pay for extra

A month later, Old Zhang became the system's biggest advocate. He found AI saved him 80% of repetitive work, leaving only truly judgment-intensive cases.

Continuous Improvement: AI Needs Constant Feeding

After three months, accuracy dropped from 92% to 88%. Investigation revealed a shift in return categories—winter arrived, down jacket returns increased, and the model lacked training on down jacket features.

I built a continuous feedback loop:

Weekly export of misclassifications, manually annotated, added to training set
Monthly model fine-tuning
Quarterly rule engine updates (e.g., new rule for down jacket 'feather leakage')

Per McKinsey's operations insights^[2], continuous learning is key for AI deployment. Now the system has run stably for six months, reducing return processing time by 70% and customer complaints by 50%.

Summary

Looking back, building an AI Agent from scratch—the hardest part wasn't tech, but deciding 'what to let AI do and what to let humans do.' My takeaways:

Don't be greedy: Solve one pain point first (like return sorting), then expand

Data first: Spend 70% of time cleaning data; model training is the easy part^[3]

Human-AI collaboration: AI does 80% repetitive work, humans do 20% value judgment—most efficient

Iterate constantly: AI isn't a one-time project; keep feeding it new data

If you're considering AI Agent, don't be intimidated by big companies' full automation. Start small, use rules + simple models, and you can get it running in three months. Trust me, when you watch the system handle a day's returns while you just sip tea and review, the feeling is better than a Singles' Day blowout.

References

Gartner Supply Chain Technology Report — Referenced for trend of overestimating AI capabilities
McKinsey Operations Insights: Continuous Learning in AI — Referenced for importance of continuous learning in AI deployment
Fortune Business Insights WMS Market Report — Referenced for data preparation time proportion in AI projects