[New] Admin deep-dive (trust/safety): AI moderation is review-only — no auto-moderation of inbound content with confidence thresholds

Patrick Bass · Jun 6 · 11 · 0

[Normal Priority] [New Feature] [Under Consideration]

Patrick Bass 🚀 OP Jun 6, 2026 8:20pm

Area: Admin deep-dive (trust/safety) (audit p15a) · Surface: /admin/moderation/{id}/ai/* and /admin/content-filters/ai/* (Admin\ModerationAIController) · Dimension: Competitor-gap feature (#2) · Severity: enhancement

Discourse ships an AI-powered toxicity/NSFW classifier that can auto-flag or auto-hide posts before a human looks; Circle and Mighty Networks offer automated spam/abuse filtering on submission. Mobieus has the AI plumbing (Claude-backed classify) but only invokes it after a human opens a report. That means abusive content is live until a human triages it, and the AI never reduces queue volume — it only annotates items already in the queue. Competitors use AI to keep bad content from ever publishing. Because ContentFilter already has a flag/reject/censor action enum and a runtime evaluation path, an AI classifier could plug into the same decision point.

Evidence

ModerationAIController docblock: 'Output is always a draft for admin review — nothing persists from the model' (ModerationAIController.php:24). All five endpoints (classify/replyDrafts/suggestRules/bulkTriage/policyToRules) return JSON drafts only. The sole pre-publish auto-moderation is regex/keyword ContentFilter (AdminContentFilterController, actions flag/reject/censor at lines 74). `grep -rn 'auto|threshold|confidence' ModerationAI.php` returns nothing.

Suggested fix. Add an optional AI auto-screen at post/message creation that calls classifyReport, and let admins set a confidence threshold per action (e.g. >0.9 auto-reject, 0.6-0.9 auto-flag into the queue, <0.6 allow). Keep it tenant-gated behind the existing mobieusAI BYOK flag and audit every auto-action.

Filed by the automated tenant-app audit and adversarially evidence-verified. Status: verified. Open — not yet actioned.

Patrick Bass
@mobieus

[New] Admin deep-dive (trust/safety): AI moderation is review-only — no auto-moderation of inbound content with confidence thresholds

Related Discussions