Implementing Polymorphic Tagging with AI-Powered Content Classification

One Table to Tag Them All

MyKpopLists lets users post about K-pop. A single post might mention multiple groups, specific idols, songs, and variety shows. Users won’t tag everything correctly — they’re here to discuss music, not to carefully fill out metadata. So the platform needs to handle tagging automatically, and it needs to do it across multiple entity types without duplicating the tagging logic for each one.

The naive database approach: create separate junction tables for each type. post_group_tags, post_idol_tags, post_song_tags, post_variety_show_tags. This works until you need to add albums. Then you add another table, another set of model relationships, and another query every time you want to find all content related to a specific entity. It compounds fast.

Laravel’s polymorphic relationships solve this with a single post_tags table. Instead of a foreign key that references one specific table, each row stores taggable_type (the full class name, e.g., App\Models\Group) and taggable_id (the ID within that table). The two columns together can reference any model in your application. One table, four entity types, zero duplication.

This tutorial covers the database design, Laravel relationship setup, Reddit API integration, Google Gemini AI tagging, fuzzy entity resolution, and asynchronous job processing that makes the whole system run without blocking requests.

What You Need Coming In

Knowledge:

Intermediate Laravel: models, relationships, migrations
Database foreign keys and indexes
Basic HTTP APIs and JSON
Familiarity with asynchronous processing concepts

Environment:

PHP 8.2+
Laravel 12.x
Queue driver: Database (dev) or Redis (production)
API keys for Reddit and Google Gemini

# .env
REDDIT_USERNAME=your-app-name
GEMINI_API_KEY=your-gemini-api-key

The Data Flow

graph TB
    A[Reddit API] -->|Fetch Posts| B[RedditService]
    B -->|Queue Job| C[ProcessSingleRedditPost]
    C -->|Extract Content| D[Post Data]
    D -->|Send to AI| E[GeminiTaggingService]
    E -->|Analyze Text| F[Google Gemini API]
    F -->|Return Tags| G[Tag Suggestions]
    G -->|Fuzzy Match| H[Entity Resolution]
    H -->|Create Records| I[Post Model]
    I -->|Polymorphic Relations| J[PostTag Pivot]
    J -->|Links to| K[Groups/Idols/Songs]
    
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style J fill:#bbf,stroke:#333,stroke-width:2px

Every Reddit post goes through five stages: fetch, AI analysis, entity resolution, post creation, and tag attachment. The AI stage is the most expensive (API call, non-trivial latency), so everything runs asynchronously via Laravel’s job queue.

The Polymorphic Migration

// database/migrations/xxxx_create_post_tags_table.php

return new class extends Migration
{
    public function up(): void
    {
        Schema::create('post_tags', function (Blueprint $table) {
            $table->id();
            $table->foreignId('post_id')
                  ->constrained()
                  ->onDelete('cascade');
            
            $table->string('taggable_type');        // "App\Models\Group"
            $table->unsignedBigInteger('taggable_id'); // 42
            
            $table->timestamps();
            
            $table->index(['taggable_type', 'taggable_id']);
            $table->unique(['post_id', 'taggable_type', 'taggable_id']);
        });
    }
};

taggable_type stores the fully-qualified class name. taggable_id stores the entity’s primary key. Together they can reference any model. The composite index makes lookups fast. The unique constraint prevents tagging the same entity twice — without it, calling attach([1, 1, 1]) would create three identical rows.

Setting Up the Relationships

The Post model gets relationships for each taggable type:

// app/Models/Post.php

class Post extends Model
{
    public function tags()
    {
        return $this->hasMany(PostTag::class);
    }
    
    public function taggedGroups()
    {
        return $this->morphedByMany(Group::class, 'taggable', 'post_tags');
    }
    
    public function taggedIdols()
    {
        return $this->morphedByMany(Idol::class, 'taggable', 'post_tags');
    }
    
    public function taggedSongs()
    {
        return $this->morphedByMany(Song::class, 'taggable', 'post_tags');
    }
    
    public function taggedVarietyShows()
    {
        return $this->morphedByMany(VarietyShow::class, 'taggable', 'post_tags');
    }
}

The reverse relationship lets you query from the entity side:

// app/Models/Group.php

class Group extends Model
{
    public function posts()
    {
        return $this->morphToMany(Post::class, 'taggable', 'post_tags');
    }
}

// Usage:
$group = Group::find(1);
$posts = $group->posts;  // All posts tagged with this group

When you call $post->taggedGroups, Laravel generates:

SELECT groups.*
FROM groups
INNER JOIN post_tags ON groups.id = post_tags.taggable_id
WHERE post_tags.post_id = 42
  AND post_tags.taggable_type = 'App\Models\Group';

The taggable_type filter is what makes it polymorphic — the same pivot table serves all entity types, filtered by class name.

Use with('taggedGroups') to eager-load. Without it, accessing $post->taggedGroups inside a loop fires one query per post:

// Bad: 1 + N queries
$posts = Post::all();
foreach ($posts as $post) { echo $post->taggedGroups->count(); }

// Good: 2 queries total
$posts = Post::with('taggedGroups')->get();
foreach ($posts as $post) { echo $post->taggedGroups->count(); }

Fetching Reddit Posts

Reddit’s JSON API is public — no OAuth required, just a well-formed User-Agent header:

// app/Services/RedditService.php

class RedditService
{
    private string $baseUrl = 'https://www.reddit.com';
    private string $userAgent;

    public function __construct()
    {
        $this->userAgent = config('app.name') . '/1.0 (by /u/' .
                          config('services.reddit.username') . ')';
    }

    public function fetchDailyPosts(): array
    {
        $response = Http::withHeaders([
            'User-Agent' => $this->userAgent,
        ])->get($this->baseUrl . '/r/kpop/new.json?limit=200');

        if (!$response->successful()) { return []; }

        $posts = $response->json()['data']['children'] ?? [];
        $twentyFourHoursAgo = time() - (24 * 60 * 60);
        
        $recentPosts = array_filter($posts, fn($post) =>
            $post['data']['created_utc'] > $twentyFourHoursAgo
        );

        return array_map(fn($post) => $this->formatRedditPost($post['data']), $recentPosts);
    }

    private function formatRedditPost(array $postData): array
    {
        return [
            'reddit_id'   => $postData['id'],
            'title'       => $postData['title'],
            'content'     => $postData['selftext'] ?? '',
            'url'         => $postData['url'] ?? null,
            'author'      => $postData['author'],
            'score'       => $postData['score'],
            'created_utc' => Carbon::createFromTimestamp($postData['created_utc']),
            'permalink'   => 'https://www.reddit.com' . $postData['permalink'],
            'flair_text'  => $postData['link_flair_text'] ?? null,
        ];
    }
}

Reddit blocks requests without a proper User-Agent. The format they expect includes your app name and username — it’s how they track API consumers, not authentication.

AI Tagging With Google Gemini

The key design decision here: we provide the AI with our actual entity lists. This constrains the output to known entities and dramatically reduces hallucination:

// app/Services/GeminiTaggingService.php

class GeminiTaggingService
{
    private string $apiKey;
    private string $apiUrl = 'https://generativelanguage.googleapis.com/v1beta/models/gemini-pro:generateContent';

    public function analyzePost(array $postData): array
    {
        $prompt = $this->buildPrompt($postData);
        
        $response = Http::withHeaders(['Content-Type' => 'application/json'])
            ->post($this->apiUrl . '?key=' . $this->apiKey, [
                'contents' => [['parts' => [['text' => $prompt]]]],
                'generationConfig' => [
                    'temperature' => 0.2,  // Low temperature for deterministic output
                    'topK' => 40,
                    'topP' => 0.95,
                ]
            ]);

        if (!$response->successful()) {
            return ['groups' => [], 'idols' => [], 'songs' => []];
        }

        $text = $response->json()['candidates'][0]['content']['parts'][0]['text'] ?? '';
        return $this->parseAIResponse($text);
    }

    private function buildPrompt(array $postData): string
    {
        $groups = Group::pluck('name')->toArray();
        $idols  = Idol::pluck('stage_name')->toArray();
        $songs  = Song::pluck('title')->toArray();

        return <<<PROMPT
You are a K-pop content analyzer. Analyze the following post and identify relevant entities.

POST TITLE: {$postData['title']}
POST CONTENT: {$postData['content']}
POST FLAIR: {$postData['flair_text']}

AVAILABLE GROUPS: {$this->formatList($groups)}
AVAILABLE IDOLS: {$this->formatList($idols)}
AVAILABLE SONGS: {$this->formatList($songs)}

Return ONLY a JSON object with this structure:
{
  "groups": ["group1", "group2"],
  "idols": ["idol1", "idol2"],
  "songs": ["song1", "song2"]
}

Only include entities you are CONFIDENT are mentioned. Return empty arrays if unsure.
PROMPT;
    }

    private function parseAIResponse(string $text): array
    {
        $jsonStart = strpos($text, '{');
        $jsonEnd   = strrpos($text, '}');
        if ($jsonStart === false || $jsonEnd === false) {
            return ['groups' => [], 'idols' => [], 'songs' => []];
        }
        $jsonString = substr($text, $jsonStart, $jsonEnd - $jsonStart + 1);
        return json_decode($jsonString, true) ?? ['groups' => [], 'idols' => [], 'songs' => []];
    }

    private function formatList(array $items): string
    {
        return implode(', ', array_slice($items, 0, 100));
    }
}

Temperature 0.2 makes the model more deterministic and less creative — which is exactly what you want for entity extraction. Providing the actual entity lists in the prompt means the model is selecting from known options rather than generating free-form text. Still, AI will hallucinate entities that don’t exist. That’s why entity resolution matters.

Resolving Names to Database IDs

The AI returns names like “BLACKPINK” and “Jennie.” You need database IDs. The resolution pipeline tries exact match first, then case-insensitive match, then fuzzy match:

// app/Services/RedditPostProcessingService.php

private function resolveGroups(array $groupNames): array
{
    $resolved = [];
    foreach ($groupNames as $name) {
        $group = Group::where('name', $name)->first()
            ?? Group::whereRaw('LOWER(name) = ?', [strtolower($name)])->first()
            ?? $this->fuzzyMatchGroup($name);
        
        if ($group) { $resolved[] = $group->id; }
    }
    return array_unique($resolved);
}

private function fuzzyMatchGroup(string $name): ?Group
{
    $groups    = Group::all();
    $bestMatch = null;
    $bestScore = 0;
    
    foreach ($groups as $group) {
        $distance  = levenshtein(strtolower($name), strtolower($group->name));
        $maxLength = max(strlen($name), strlen($group->name));
        $similarity = (1 - ($distance / $maxLength)) * 100;
        
        if ($similarity > 80 && $similarity > $bestScore) {
            $bestScore = $similarity;
            $bestMatch = $group;
        }
    }
    return $bestMatch;
}

The 80% similarity threshold is intentional. It catches “BLACKPINK” vs “Black Pink” but not “BTS” vs random three-letter strings. If the AI returns an entity that doesn’t match anything in the database at 80%+ similarity, it’s silently dropped. Unmatched entities should be logged separately for periodic review — over time you’ll learn what the AI consistently hallucinates.

Idol resolution checks both stage_name and birth_name — some users refer to “Jennie” while others use her full name. Song resolution is case-insensitive only, since song titles don’t have the same abbreviation problem as group names.

Creating the Post and Attaching Tags

public function createPostWithTags(array $postData, array $resolvedEntities): void
{
    if (Post::where('reddit_id', $postData['reddit_id'])->exists()) {
        return;  // Skip duplicates
    }

    $botUser = User::firstOrCreate(
        ['username' => 'reddit_bot'],
        ['name' => 'Reddit Bot', 'email' => 'reddit@mykpoplists.com',
         'password' => bcrypt(Str::random(32))]
    );

    $post = Post::create([
        'user_id'          => $botUser->id,
        'title'            => $postData['title'],
        'content'          => $postData['content'],
        'video_url'        => $postData['url'],
        'reddit_id'        => $postData['reddit_id'],
        'reddit_permalink' => $postData['permalink'],
        'reddit_author'    => $postData['author'],
        'reddit_flair'     => $postData['flair_text'],
    ]);

    if (!empty($resolvedEntities['groups'])) {
        $post->taggedGroups()->attach($resolvedEntities['groups']);
    }
    if (!empty($resolvedEntities['idols'])) {
        $post->taggedIdols()->attach($resolvedEntities['idols']);
    }
    if (!empty($resolvedEntities['songs'])) {
        $post->taggedSongs()->attach($resolvedEntities['songs']);
    }
}

When you call $post->taggedGroups()->attach([1, 2, 3]), Laravel generates:

INSERT INTO post_tags (post_id, taggable_type, taggable_id, created_at, updated_at)
VALUES
  (42, 'App\Models\Group', 1, NOW(), NOW()),
  (42, 'App\Models\Group', 2, NOW(), NOW()),
  (42, 'App\Models\Group', 3, NOW(), NOW());

taggable_type is set automatically from the relationship definition. You never write it manually.

Always sanitize external content before storing it. Reddit posts can contain HTML, XSS payloads, and malicious URLs. At minimum: Str::limit(strip_tags($postData['title']), 255) for titles, strip_tags($postData['content'], '<p><br><a>') for content.

Processing Everything Asynchronously

Calling the Gemini API synchronously would block the request for 1–3 seconds per post. With 50 posts, that’s 50–150 seconds of blocking. Use Laravel’s queue system instead:

// app/Jobs/ProcessSingleRedditPost.php

class ProcessSingleRedditPost implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public int $tries   = 3;
    public int $backoff = 60;    // Wait 60s between retries
    public int $timeout = 120;   // Timeout after 2 minutes

    public function __construct(public array $postData) {}

    public function handle(
        GeminiTaggingService $geminiService,
        RedditPostProcessingService $processingService
    ): void {
        $aiTags           = $geminiService->analyzePost($this->postData);
        $resolvedEntities = $processingService->resolveEntities($aiTags);
        $processingService->createPostWithTags($this->postData, $resolvedEntities);
    }

    public function failed(\Throwable $exception): void
    {
        Log::error('Reddit post job permanently failed', [
            'reddit_id' => $this->postData['reddit_id'],
            'error'     => $exception->getMessage()
        ]);
    }
}

The command that dispatches jobs spreads them over time to stay within API rate limits:

// app/Console/Commands/QueueRedditPosts.php

public function handle(RedditService $redditService): int
{
    $posts      = $redditService->fetchDailyPosts();
    $posts      = array_slice($posts, 0, $this->option('max-posts'));
    $totalSecs  = $this->option('spread-hours') * 3600;
    $delayBetween = count($posts) > 1 ? $totalSecs / count($posts) : 0;

    foreach ($posts as $index => $postData) {
        $delay = (int)($index * $delayBetween);
        ProcessSingleRedditPost::dispatch($postData)->delay(now()->addSeconds($delay));
    }
    return Command::SUCCESS;
}

50 posts spread over 12 hours works out to one Gemini API call every 14.4 minutes — well within both Reddit’s rate limit (60 requests/minute) and Gemini’s free tier (1,500 requests/day).

If you scale to monitoring 20 containers every 60 seconds with concurrent SSH calls, the single Node.js event loop will lag. Standard n8n is single-threaded. Beyond a dozen active workflows, switch to Queue Mode with Redis and dedicated worker processes.

Scaling Considerations

The fuzzy matching in fuzzyMatchGroup loads all groups into memory for every name it can’t exact-match. For a catalog of a few hundred groups, this is fine. For thousands, you’d want to preload the group list once per job rather than querying on every unresolved name.

The AI prompt currently includes up to 100 entities per type (sliced before the Gemini call). Keep an eye on token usage as your database grows — if the entity lists exceed the model’s context window, you’ll get truncation errors.

One thing the unique constraint on post_tags doesn’t handle: if you call attach() with duplicate IDs in the same call without the constraint, you’d get duplicate rows. The database constraint catches this at the SQL level. Alternatively, use sync() instead of attach() if you want to replace the full set of tags rather than add to it.