8.9 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Supermarket scraper system for Russian supermarkets (Магнит and planned: Пятёрочка/5ka). The system scrapes product data via API (primary method), with planned support for web scraping (Playwright) and Android app scraping (Appium). Data is stored in PostgreSQL with Prisma ORM, with planned integration for pgvector embeddings and LangChain for LLM-based queries.
Current Status: MVP phase - API scraping for Magnit is functional. Database and basic scraping infrastructure are complete.
Development Commands
Package Manager
This project uses pnpm (not npm or yarn).
Essential Commands
# Install dependencies
pnpm install
# Install Playwright browsers (required for scraping)
pnpm exec playwright install chromium
# Type checking (no build output)
pnpm type-check
# Build TypeScript to dist/
pnpm build
# Run Magnit API scraper
pnpm dev
# Test database connection
pnpm test-db
Prisma Commands
# Generate Prisma Client (required after schema changes)
pnpm prisma:generate
# Create and apply migrations
pnpm prisma:migrate
# Open Prisma Studio (database GUI)
pnpm prisma:studio
# Format schema.prisma file
pnpm prisma:format
Database Setup
# Start PostgreSQL via Docker
docker-compose up -d
# Stop PostgreSQL
docker-compose down
Running Scripts Directly
# Run scraper with tsx
tsx src/scripts/scrape-magnit-products.ts
# Run with specific store code
MAGNIT_STORE_CODE=992301 tsx src/scripts/scrape-magnit-products.ts
Architecture Overview
Core Design Pattern
Layered architecture with clear separation of concerns:
Scripts (orchestration)
↓
Scrapers (data acquisition)
↓
Parser (transformation)
↓
Services (business logic)
↓
Database (persistence via Prisma)
Key Architectural Components
1. Scrapers (src/scrapers/api/magnit/)
- MagnitApiScraper: Main scraper class implementing hybrid Playwright + Axios approach
- Lifecycle management:
initialize()→scrapeAllProducts()→saveToDatabase()→close() - Authentication pattern: Uses Playwright to obtain session cookies and device ID, then makes API requests via Axios
- Pagination: Offset-based with 100 items per request, 300ms rate limiting between requests
2. Services (src/services/)
- ProductService: Handles all database persistence with batch operations (50 items per batch)
- ProductParser: Transforms API responses to database schema format (price conversion, date parsing, etc.)
- Pattern: All services accept PrismaClient via dependency injection
3. Database (src/database/)
- Prisma ORM with PostgreSQL adapter (
@prisma/adapter-pg) - Schema location:
src/database/prisma/schema.prisma - Models: Store, Category (hierarchical), Product, ScrapingSession
- Key constraint:
(externalId, storeId)unique for upsert operations
4. Scripts (src/scripts/)
- scrape-magnit-products.ts: Main entry point demonstrating full flow
- test-db-connection.ts: Database connection verification
Data Flow
1. MagnitApiScraper.initialize()
- Launch Chromium via Playwright
- Navigate to magnit.ru
- Extract mg_udi cookie as deviceId
- Configure Axios with headers + cookies
2. MagnitApiScraper.scrapeAllProducts()
- POST to /webgate/v2/goods/search
- Paginate with limit=100, offset increments
- Return ProductItem[] array
3. MagnitApiScraper.saveToDatabase()
- Get/create Store via ProductService
- Extract unique categories from products
- Get/create Categories via ProductService
- Parse products via ProductParser
- Batch save via ProductService.saveProducts()
4. Database persistence
- Upsert based on (externalId, storeId)
- Related entities created idempotently
Magnit API Authentication & Anti-Bot Bypass
Critical Implementation Detail
The Magnit API (https://magnit.ru/webgate/v2/goods/search) is protected and requires a hybrid approach:
Hybrid Pattern: Playwright (session init) + Axios (API requests)
-
Use Playwright to visit magnit.ru and obtain:
mg_udicookie (Device ID in UUID format)- Other session cookies (
oxxfgh,uwyii, etc.)
-
Extract required values:
x-device-idheader =mg_udicookie value- All cookies as Cookie header string
-
Make API requests via Axios with:
- Required headers:
x-device-id,x-client-name: magnit,x-device-platform: Web,x-app-version,x-new-magnit: true - Cookie header from Playwright session
- Standard browser headers (User-Agent, Referer, etc.)
- Required headers:
-
Handle 403 errors: Automatically re-initialize Playwright session
API Details
Endpoint: POST https://magnit.ru/webgate/v2/goods/search
Request payload:
{
sort: { order: "desc", type: "popularity" },
pagination: { limit: 100, offset: 0 },
categories: [], // Empty = all products
includeAdultGoods: false,
storeCode: "992301", // From env var
storeType: "6",
catalogType: "1"
}
Response structure:
items[]- array of products- Prices in kopecks (24999 = 249.99 rubles) - must convert to rubles
- Promotion data:
promotion.oldPrice,promotion.discountPercent,promotion.endDate - Ratings:
ratings.rating,ratings.scoresCount,ratings.commentsCount
Rate limiting: 300ms delay between requests (implemented in scraper)
Database Schema Patterns
Upsert Strategy
Products use composite unique constraint (externalId, storeId) for idempotent updates:
// ProductService automatically handles upsert via Prisma
await prisma.product.upsert({
where: { externalId_storeId: { externalId, storeId } },
update: { /* latest data */ },
create: { /* new product */ }
})
Category Hierarchy
Self-referential relationship via parentId:
model Category {
parent Category? @relation("CategoryHierarchy", fields: [parentId], references: [id])
children Category[] @relation("CategoryHierarchy")
}
Store Types
Store type field indicates data source:
"web"- from web scraping"app"- from Android app scraping- API-based stores use
codefield (e.g., "992301")
Configuration & Environment
Required Environment Variables
DATABASE_URL=postgresql://user:password@localhost:5432/supermarket
MAGNIT_STORE_CODE=992301 # Store code for scraping
TypeScript Configuration
- Target: ES2023
- Module: ESNext
- Strict mode enabled
- Output:
dist/directory
Planned Features (Not Yet Implemented)
Reference .cursor/plans/supermarket_scraper_system_1af4ed29.plan.md for full roadmap.
Phase 1 (MVP - mostly complete):
- ✅ Database setup with Prisma
- ✅ Magnit API scraping with authentication bypass
- ⏳ pgvector embeddings (planned)
Phase 2 (Future):
- Web scraping via Playwright (fallback method)
- Android app scraping via Appium
- Pyaterochka/5ka scraper
- LangChain integration for LLM queries
- REST API server for external integrations (n8n, etc.)
- Scheduler service with cron-like functionality
- Price history tracking and analytics
Requestly Integration
The project includes Requestly HTTP request testing integration. API test files are stored in .requestly-supermarket/ directory.
Key testing patterns (from .cursor/rules/requestly-test-rules.mdc):
- Use
rq.test()for test definitions - Use
rq.expect()for Chai.js-style assertions - Access response via
rq.response.body(parse as JSON before use) - Status checks:
rq.response.to.be.ok,rq.response.to.be.success - JSON validation:
rq.response.to.have.jsonBody(path, value) - Prices stored in kopecks (24999 = 249.99 rubles)
Important Development Notes
When Adding New Scrapers
- Extend base patterns from
src/scrapers/api/magnit/ - Implement
initialize(),scrapeAllProducts(),saveToDatabase(),close()lifecycle - Use ProductParser for data transformation
- Use ProductService for database operations (never call Prisma directly from scrapers)
When Modifying Database Schema
- Update
src/database/prisma/schema.prisma - Run
pnpm prisma:generateto update Prisma Client - Run
pnpm prisma:migrateto create and apply migrations - Update related TypeScript types in
src/scrapers/*/types.ts
Error Handling Pattern
Custom error classes in src/utils/errors.ts:
ScraperError- scraping failuresDatabaseError- database operationsAPIError- HTTP/API failures (includes statusCode and response body)
Logging Pattern
Static logger in src/utils/logger.ts:
Logger.info(),Logger.error(),Logger.warn(),Logger.debug()- Debug messages gated by
DEBUGenvironment variable - ISO timestamp formatting
Testing
No test framework currently configured. Manual testing via:
pnpm test-db- database connectionpnpm dev- full scraping run- Prisma Studio - data inspection