Supplier onboarding takes 30-45 days at most distributors. Almost all of that time is people staring at spreadsheets. An LLM can infer the schema mapping, normalization runs on pattern rules, and validation gates catch errors before anything goes live. You don't skip steps. You just stop doing them by hand.
A German circuit breaker manufacturer sends 2,000 SKUs in Excel. Column headers are in German: Nennspannung, Bemessungsstrom, Schutzart. Descriptions mix three languages in the same column. Some dimensions are mm, others cm. 43 records have no UPC. Images sit on an FTP server with filenames that don't match SKUs.
Every supplier file looks like this. The difference between 6 weeks and 24 hours is whether you throw people at it or run it through a pipeline.
Hours 1-3: Infer the schema mapping
Feed the supplier file into an LLM with your catalog schema as structured output. It reads column headers, sample values, data types, and returns a typed mapping. Supplier column to your field, with confidence for each.
| Supplier column | Your field | ETIM attribute | Confidence |
|---|---|---|---|
| Nennspannung | rated_voltage | EF000002 | High |
| Bemessungsstrom | rated_current | EF000004 | High |
| Schutzart | IP_rating | EF000012 | High |
High-confidence mappings go through automatically. Medium and low land in a review queue with the model's reasoning attached. In practice, most columns map without anyone touching them. A person checks 4-5 ambiguous cases. Save the mapping template. Next update from this supplier takes under 60 seconds.
High confidence: auto-map, log the decision for audit Medium confidence: route to review queue with the model's reasoning Column is internal supplier metadata: discard, document in mapping template
Hours 4-6: Normalize units automatically
Voltage values arrive in four formats from the same file. A normalization pipeline with pattern rules handles this without human involvement.
Raw supplier data
- 240V
- 240 VAC
- 240V AC
- 240 V~
After normalization
- 240 V AC (all four collapsed)
- Validation flag: verify AC/DC from datasheet if tilde notation detected
Dimensions convert from mm to inches where US buyers expect imperial. Define the canonical format once, apply to every file. For enclosures and conduit, just output both units in the description. For coded fields, normalize against the standard instead of free text. Validate the unit codes, check IP ratings, and spot-check GTINs before the file reaches staging.
Hours 7-10: Match images to SKUs
Pull images from the supplier FTP. Filename CB-2401.jpg needs to match SKU MCB-240-10A. Run a matching pipeline that tries exact match on manufacturer part number, then strips prefixes and suffixes, then falls back to fuzzy numeric matching.
Out of 2,000 products, 1,847 images match automatically. The other 153 get a category placeholder and a flag for follow-up. Don't hold the publish waiting for perfect images.
- Minimum resolution 800x800px
- File size under 2MB
- Format is JPG or PNG (convert any TIFFs)
- Filename contains SKU or manufacturer part number
- Image shows product only, no marketing overlays
Hours 11-15: Validate with automated gates
Every record runs through validation gates before it reaches staging. Not manual checks. Functions that return pass/fail with a reason.
Block publish if UPC, manufacturer, or category is missing. Result: 43 records blocked (no UPC). These go to a follow-up queue, not a manual spreadsheet review.
Flag records where voltage falls outside 0-1000V or current exceeds 6300A. Result: 12 records flagged. 10 are typos (2400V instead of 240V), 2 are legitimate high-voltage products. Auto-correct the obvious ones, route the rest for review.
Every product gets an ETIM class assignment with a confidence level. High-confidence assignments are accepted automatically. Medium and low get routed to a review queue with the top 3 candidate classes. Result: 1,891 products classified automatically, 109 routed for review.
Hours 16-20: Enrich from datasheets with structured extraction
8 records with stub descriptions. 109 low-confidence classifications. Run structured extraction against the manufacturer PDFs. The pipeline returns typed fields (voltage, current, mounting type, protection class) with source page and bounding box so you can trace every value.
Before extraction "Circuit breaker"
After structured extraction "Miniature circuit breaker, 10kA breaking capacity at 240V AC, thermal-magnetic trip, DIN rail mount, IP20"
Confidence: high. Source: manufacturer datasheet p.3, table 2. Cross-referenced against IEC 60898.
After hour 20, diminishing returns. The remaining records either have no datasheet or the PDF is a scanned image where extraction confidence is low. Flag them and move on.
Hours 21-24: Publish with confidence
Push 1,957 validated products to staging. Quick smoke test: do category filters work, does search return results, do product pages render. Everything here passed all gates.
Publish to production. Watch the first hour for category mismatches. You'll typically see 1-2 edge cases (MCBs that sit on the boundary of two ETIM classes), not 20.
| Metric | Count |
|---|---|
| Products live | 1,957 |
| Classified automatically (high confidence) | 1,891 |
| Flagged for UPC follow-up | 43 |
| Edge cases reclassified after publish | 2 |
So what actually makes this fast? Schema inference that reads German column headers without a translator. Normalization rules you define once and never touch again. And a classifier that knows when it's confident and when it isn't, so a person only looks at the 109 products that actually need eyes on them. That's it. No magic, just fewer spreadsheets.
