Files
supermarket/CLAUDE.md

8.9 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Supermarket scraper system for Russian supermarkets (Магнит and planned: Пятёрочка/5ka). The system scrapes product data via API (primary method), with planned support for web scraping (Playwright) and Android app scraping (Appium). Data is stored in PostgreSQL with Prisma ORM, with planned integration for pgvector embeddings and LangChain for LLM-based queries.

Current Status: MVP phase - API scraping for Magnit is functional. Database and basic scraping infrastructure are complete.

Development Commands

Package Manager

This project uses pnpm (not npm or yarn).

Essential Commands

# Install dependencies
pnpm install

# Install Playwright browsers (required for scraping)
pnpm exec playwright install chromium

# Type checking (no build output)
pnpm type-check

# Build TypeScript to dist/
pnpm build

# Run Magnit API scraper
pnpm dev

# Test database connection
pnpm test-db

Prisma Commands

# Generate Prisma Client (required after schema changes)
pnpm prisma:generate

# Create and apply migrations
pnpm prisma:migrate

# Open Prisma Studio (database GUI)
pnpm prisma:studio

# Format schema.prisma file
pnpm prisma:format

Database Setup

# Start PostgreSQL via Docker
docker-compose up -d

# Stop PostgreSQL
docker-compose down

Running Scripts Directly

# Run scraper with tsx
tsx src/scripts/scrape-magnit-products.ts

# Run with specific store code
MAGNIT_STORE_CODE=992301 tsx src/scripts/scrape-magnit-products.ts

Architecture Overview

Core Design Pattern

Layered architecture with clear separation of concerns:

Scripts (orchestration)
    ↓
Scrapers (data acquisition)
    ↓
Parser (transformation)
    ↓
Services (business logic)
    ↓
Database (persistence via Prisma)

Key Architectural Components

1. Scrapers (src/scrapers/api/magnit/)

  • MagnitApiScraper: Main scraper class implementing hybrid Playwright + Axios approach
  • Lifecycle management: initialize()scrapeAllProducts()saveToDatabase()close()
  • Authentication pattern: Uses Playwright to obtain session cookies and device ID, then makes API requests via Axios
  • Pagination: Offset-based with 100 items per request, 300ms rate limiting between requests

2. Services (src/services/)

  • ProductService: Handles all database persistence with batch operations (50 items per batch)
  • ProductParser: Transforms API responses to database schema format (price conversion, date parsing, etc.)
  • Pattern: All services accept PrismaClient via dependency injection

3. Database (src/database/)

  • Prisma ORM with PostgreSQL adapter (@prisma/adapter-pg)
  • Schema location: src/database/prisma/schema.prisma
  • Models: Store, Category (hierarchical), Product, ScrapingSession
  • Key constraint: (externalId, storeId) unique for upsert operations

4. Scripts (src/scripts/)

  • scrape-magnit-products.ts: Main entry point demonstrating full flow
  • test-db-connection.ts: Database connection verification

Data Flow

1. MagnitApiScraper.initialize()
   - Launch Chromium via Playwright
   - Navigate to magnit.ru
   - Extract mg_udi cookie as deviceId
   - Configure Axios with headers + cookies

2. MagnitApiScraper.scrapeAllProducts()
   - POST to /webgate/v2/goods/search
   - Paginate with limit=100, offset increments
   - Return ProductItem[] array

3. MagnitApiScraper.saveToDatabase()
   - Get/create Store via ProductService
   - Extract unique categories from products
   - Get/create Categories via ProductService
   - Parse products via ProductParser
   - Batch save via ProductService.saveProducts()

4. Database persistence
   - Upsert based on (externalId, storeId)
   - Related entities created idempotently

Magnit API Authentication & Anti-Bot Bypass

Critical Implementation Detail

The Magnit API (https://magnit.ru/webgate/v2/goods/search) is protected and requires a hybrid approach:

Hybrid Pattern: Playwright (session init) + Axios (API requests)

  1. Use Playwright to visit magnit.ru and obtain:

    • mg_udi cookie (Device ID in UUID format)
    • Other session cookies (oxxfgh, uwyii, etc.)
  2. Extract required values:

    • x-device-id header = mg_udi cookie value
    • All cookies as Cookie header string
  3. Make API requests via Axios with:

    • Required headers: x-device-id, x-client-name: magnit, x-device-platform: Web, x-app-version, x-new-magnit: true
    • Cookie header from Playwright session
    • Standard browser headers (User-Agent, Referer, etc.)
  4. Handle 403 errors: Automatically re-initialize Playwright session

API Details

Endpoint: POST https://magnit.ru/webgate/v2/goods/search

Request payload:

{
  sort: { order: "desc", type: "popularity" },
  pagination: { limit: 100, offset: 0 },
  categories: [],              // Empty = all products
  includeAdultGoods: false,
  storeCode: "992301",         // From env var
  storeType: "6",
  catalogType: "1"
}

Response structure:

  • items[] - array of products
  • Prices in kopecks (24999 = 249.99 rubles) - must convert to rubles
  • Promotion data: promotion.oldPrice, promotion.discountPercent, promotion.endDate
  • Ratings: ratings.rating, ratings.scoresCount, ratings.commentsCount

Rate limiting: 300ms delay between requests (implemented in scraper)

Database Schema Patterns

Upsert Strategy

Products use composite unique constraint (externalId, storeId) for idempotent updates:

// ProductService automatically handles upsert via Prisma
await prisma.product.upsert({
  where: { externalId_storeId: { externalId, storeId } },
  update: { /* latest data */ },
  create: { /* new product */ }
})

Category Hierarchy

Self-referential relationship via parentId:

model Category {
  parent   Category?  @relation("CategoryHierarchy", fields: [parentId], references: [id])
  children Category[] @relation("CategoryHierarchy")
}

Store Types

Store type field indicates data source:

  • "web" - from web scraping
  • "app" - from Android app scraping
  • API-based stores use code field (e.g., "992301")

Configuration & Environment

Required Environment Variables

DATABASE_URL=postgresql://user:password@localhost:5432/supermarket
MAGNIT_STORE_CODE=992301  # Store code for scraping

TypeScript Configuration

  • Target: ES2023
  • Module: ESNext
  • Strict mode enabled
  • Output: dist/ directory

Planned Features (Not Yet Implemented)

Reference .cursor/plans/supermarket_scraper_system_1af4ed29.plan.md for full roadmap.

Phase 1 (MVP - mostly complete):

  • Database setup with Prisma
  • Magnit API scraping with authentication bypass
  • pgvector embeddings (planned)

Phase 2 (Future):

  • Web scraping via Playwright (fallback method)
  • Android app scraping via Appium
  • Pyaterochka/5ka scraper
  • LangChain integration for LLM queries
  • REST API server for external integrations (n8n, etc.)
  • Scheduler service with cron-like functionality
  • Price history tracking and analytics

Requestly Integration

The project includes Requestly HTTP request testing integration. API test files are stored in .requestly-supermarket/ directory.

Key testing patterns (from .cursor/rules/requestly-test-rules.mdc):

  • Use rq.test() for test definitions
  • Use rq.expect() for Chai.js-style assertions
  • Access response via rq.response.body (parse as JSON before use)
  • Status checks: rq.response.to.be.ok, rq.response.to.be.success
  • JSON validation: rq.response.to.have.jsonBody(path, value)
  • Prices stored in kopecks (24999 = 249.99 rubles)

Important Development Notes

When Adding New Scrapers

  1. Extend base patterns from src/scrapers/api/magnit/
  2. Implement initialize(), scrapeAllProducts(), saveToDatabase(), close() lifecycle
  3. Use ProductParser for data transformation
  4. Use ProductService for database operations (never call Prisma directly from scrapers)

When Modifying Database Schema

  1. Update src/database/prisma/schema.prisma
  2. Run pnpm prisma:generate to update Prisma Client
  3. Run pnpm prisma:migrate to create and apply migrations
  4. Update related TypeScript types in src/scrapers/*/types.ts

Error Handling Pattern

Custom error classes in src/utils/errors.ts:

  • ScraperError - scraping failures
  • DatabaseError - database operations
  • APIError - HTTP/API failures (includes statusCode and response body)

Logging Pattern

Static logger in src/utils/logger.ts:

  • Logger.info(), Logger.error(), Logger.warn(), Logger.debug()
  • Debug messages gated by DEBUG environment variable
  • ISO timestamp formatting

Testing

No test framework currently configured. Manual testing via:

  • pnpm test-db - database connection
  • pnpm dev - full scraping run
  • Prisma Studio - data inspection