287 lines
8.9 KiB
Markdown
287 lines
8.9 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
Supermarket scraper system for Russian supermarkets (Магнит and planned: Пятёрочка/5ka). The system scrapes product data via API (primary method), with planned support for web scraping (Playwright) and Android app scraping (Appium). Data is stored in PostgreSQL with Prisma ORM, with planned integration for pgvector embeddings and LangChain for LLM-based queries.
|
|
|
|
**Current Status**: MVP phase - API scraping for Magnit is functional. Database and basic scraping infrastructure are complete.
|
|
|
|
## Development Commands
|
|
|
|
### Package Manager
|
|
This project uses **pnpm** (not npm or yarn).
|
|
|
|
### Essential Commands
|
|
```bash
|
|
# Install dependencies
|
|
pnpm install
|
|
|
|
# Install Playwright browsers (required for scraping)
|
|
pnpm exec playwright install chromium
|
|
|
|
# Type checking (no build output)
|
|
pnpm type-check
|
|
|
|
# Build TypeScript to dist/
|
|
pnpm build
|
|
|
|
# Run Magnit API scraper
|
|
pnpm dev
|
|
|
|
# Test database connection
|
|
pnpm test-db
|
|
```
|
|
|
|
### Prisma Commands
|
|
```bash
|
|
# Generate Prisma Client (required after schema changes)
|
|
pnpm prisma:generate
|
|
|
|
# Create and apply migrations
|
|
pnpm prisma:migrate
|
|
|
|
# Open Prisma Studio (database GUI)
|
|
pnpm prisma:studio
|
|
|
|
# Format schema.prisma file
|
|
pnpm prisma:format
|
|
```
|
|
|
|
### Database Setup
|
|
```bash
|
|
# Start PostgreSQL via Docker
|
|
docker-compose up -d
|
|
|
|
# Stop PostgreSQL
|
|
docker-compose down
|
|
```
|
|
|
|
### Running Scripts Directly
|
|
```bash
|
|
# Run scraper with tsx
|
|
tsx src/scripts/scrape-magnit-products.ts
|
|
|
|
# Run with specific store code
|
|
MAGNIT_STORE_CODE=992301 tsx src/scripts/scrape-magnit-products.ts
|
|
```
|
|
|
|
## Architecture Overview
|
|
|
|
### Core Design Pattern
|
|
**Layered architecture** with clear separation of concerns:
|
|
|
|
```
|
|
Scripts (orchestration)
|
|
↓
|
|
Scrapers (data acquisition)
|
|
↓
|
|
Parser (transformation)
|
|
↓
|
|
Services (business logic)
|
|
↓
|
|
Database (persistence via Prisma)
|
|
```
|
|
|
|
### Key Architectural Components
|
|
|
|
**1. Scrapers** (`src/scrapers/api/magnit/`)
|
|
- **MagnitApiScraper**: Main scraper class implementing hybrid Playwright + Axios approach
|
|
- **Lifecycle management**: `initialize()` → `scrapeAllProducts()` → `saveToDatabase()` → `close()`
|
|
- **Authentication pattern**: Uses Playwright to obtain session cookies and device ID, then makes API requests via Axios
|
|
- **Pagination**: Offset-based with 100 items per request, 300ms rate limiting between requests
|
|
|
|
**2. Services** (`src/services/`)
|
|
- **ProductService**: Handles all database persistence with batch operations (50 items per batch)
|
|
- **ProductParser**: Transforms API responses to database schema format (price conversion, date parsing, etc.)
|
|
- **Pattern**: All services accept PrismaClient via dependency injection
|
|
|
|
**3. Database** (`src/database/`)
|
|
- **Prisma ORM** with PostgreSQL adapter (`@prisma/adapter-pg`)
|
|
- **Schema location**: `src/database/prisma/schema.prisma`
|
|
- **Models**: Store, Category (hierarchical), Product, ScrapingSession
|
|
- **Key constraint**: `(externalId, storeId)` unique for upsert operations
|
|
|
|
**4. Scripts** (`src/scripts/`)
|
|
- **scrape-magnit-products.ts**: Main entry point demonstrating full flow
|
|
- **test-db-connection.ts**: Database connection verification
|
|
|
|
### Data Flow
|
|
|
|
```
|
|
1. MagnitApiScraper.initialize()
|
|
- Launch Chromium via Playwright
|
|
- Navigate to magnit.ru
|
|
- Extract mg_udi cookie as deviceId
|
|
- Configure Axios with headers + cookies
|
|
|
|
2. MagnitApiScraper.scrapeAllProducts()
|
|
- POST to /webgate/v2/goods/search
|
|
- Paginate with limit=100, offset increments
|
|
- Return ProductItem[] array
|
|
|
|
3. MagnitApiScraper.saveToDatabase()
|
|
- Get/create Store via ProductService
|
|
- Extract unique categories from products
|
|
- Get/create Categories via ProductService
|
|
- Parse products via ProductParser
|
|
- Batch save via ProductService.saveProducts()
|
|
|
|
4. Database persistence
|
|
- Upsert based on (externalId, storeId)
|
|
- Related entities created idempotently
|
|
```
|
|
|
|
## Magnit API Authentication & Anti-Bot Bypass
|
|
|
|
### Critical Implementation Detail
|
|
The Magnit API (`https://magnit.ru/webgate/v2/goods/search`) is **protected** and requires a hybrid approach:
|
|
|
|
**Hybrid Pattern: Playwright (session init) + Axios (API requests)**
|
|
|
|
1. **Use Playwright** to visit magnit.ru and obtain:
|
|
- `mg_udi` cookie (Device ID in UUID format)
|
|
- Other session cookies (`oxxfgh`, `uwyii`, etc.)
|
|
|
|
2. **Extract required values**:
|
|
- `x-device-id` header = `mg_udi` cookie value
|
|
- All cookies as Cookie header string
|
|
|
|
3. **Make API requests via Axios** with:
|
|
- Required headers: `x-device-id`, `x-client-name: magnit`, `x-device-platform: Web`, `x-app-version`, `x-new-magnit: true`
|
|
- Cookie header from Playwright session
|
|
- Standard browser headers (User-Agent, Referer, etc.)
|
|
|
|
4. **Handle 403 errors**: Automatically re-initialize Playwright session
|
|
|
|
### API Details
|
|
|
|
**Endpoint**: `POST https://magnit.ru/webgate/v2/goods/search`
|
|
|
|
**Request payload**:
|
|
```typescript
|
|
{
|
|
sort: { order: "desc", type: "popularity" },
|
|
pagination: { limit: 100, offset: 0 },
|
|
categories: [], // Empty = all products
|
|
includeAdultGoods: false,
|
|
storeCode: "992301", // From env var
|
|
storeType: "6",
|
|
catalogType: "1"
|
|
}
|
|
```
|
|
|
|
**Response structure**:
|
|
- `items[]` - array of products
|
|
- Prices in **kopecks** (24999 = 249.99 rubles) - must convert to rubles
|
|
- Promotion data: `promotion.oldPrice`, `promotion.discountPercent`, `promotion.endDate`
|
|
- Ratings: `ratings.rating`, `ratings.scoresCount`, `ratings.commentsCount`
|
|
|
|
**Rate limiting**: 300ms delay between requests (implemented in scraper)
|
|
|
|
## Database Schema Patterns
|
|
|
|
### Upsert Strategy
|
|
Products use composite unique constraint `(externalId, storeId)` for idempotent updates:
|
|
```typescript
|
|
// ProductService automatically handles upsert via Prisma
|
|
await prisma.product.upsert({
|
|
where: { externalId_storeId: { externalId, storeId } },
|
|
update: { /* latest data */ },
|
|
create: { /* new product */ }
|
|
})
|
|
```
|
|
|
|
### Category Hierarchy
|
|
Self-referential relationship via `parentId`:
|
|
```prisma
|
|
model Category {
|
|
parent Category? @relation("CategoryHierarchy", fields: [parentId], references: [id])
|
|
children Category[] @relation("CategoryHierarchy")
|
|
}
|
|
```
|
|
|
|
### Store Types
|
|
Store `type` field indicates data source:
|
|
- `"web"` - from web scraping
|
|
- `"app"` - from Android app scraping
|
|
- API-based stores use `code` field (e.g., "992301")
|
|
|
|
## Configuration & Environment
|
|
|
|
### Required Environment Variables
|
|
```bash
|
|
DATABASE_URL=postgresql://user:password@localhost:5432/supermarket
|
|
MAGNIT_STORE_CODE=992301 # Store code for scraping
|
|
```
|
|
|
|
### TypeScript Configuration
|
|
- Target: ES2023
|
|
- Module: ESNext
|
|
- Strict mode enabled
|
|
- Output: `dist/` directory
|
|
|
|
## Planned Features (Not Yet Implemented)
|
|
|
|
Reference `.cursor/plans/supermarket_scraper_system_1af4ed29.plan.md` for full roadmap.
|
|
|
|
**Phase 1 (MVP - mostly complete)**:
|
|
- ✅ Database setup with Prisma
|
|
- ✅ Magnit API scraping with authentication bypass
|
|
- ⏳ pgvector embeddings (planned)
|
|
|
|
**Phase 2 (Future)**:
|
|
- Web scraping via Playwright (fallback method)
|
|
- Android app scraping via Appium
|
|
- Pyaterochka/5ka scraper
|
|
- LangChain integration for LLM queries
|
|
- REST API server for external integrations (n8n, etc.)
|
|
- Scheduler service with cron-like functionality
|
|
- Price history tracking and analytics
|
|
|
|
## Requestly Integration
|
|
|
|
The project includes Requestly HTTP request testing integration. API test files are stored in `.requestly-supermarket/` directory.
|
|
|
|
**Key testing patterns** (from `.cursor/rules/requestly-test-rules.mdc`):
|
|
- Use `rq.test()` for test definitions
|
|
- Use `rq.expect()` for Chai.js-style assertions
|
|
- Access response via `rq.response.body` (parse as JSON before use)
|
|
- Status checks: `rq.response.to.be.ok`, `rq.response.to.be.success`
|
|
- JSON validation: `rq.response.to.have.jsonBody(path, value)`
|
|
- Prices stored in kopecks (24999 = 249.99 rubles)
|
|
|
|
## Important Development Notes
|
|
|
|
### When Adding New Scrapers
|
|
1. Extend base patterns from `src/scrapers/api/magnit/`
|
|
2. Implement `initialize()`, `scrapeAllProducts()`, `saveToDatabase()`, `close()` lifecycle
|
|
3. Use ProductParser for data transformation
|
|
4. Use ProductService for database operations (never call Prisma directly from scrapers)
|
|
|
|
### When Modifying Database Schema
|
|
1. Update `src/database/prisma/schema.prisma`
|
|
2. Run `pnpm prisma:generate` to update Prisma Client
|
|
3. Run `pnpm prisma:migrate` to create and apply migrations
|
|
4. Update related TypeScript types in `src/scrapers/*/types.ts`
|
|
|
|
### Error Handling Pattern
|
|
Custom error classes in `src/utils/errors.ts`:
|
|
- `ScraperError` - scraping failures
|
|
- `DatabaseError` - database operations
|
|
- `APIError` - HTTP/API failures (includes statusCode and response body)
|
|
|
|
### Logging Pattern
|
|
Static logger in `src/utils/logger.ts`:
|
|
- `Logger.info()`, `Logger.error()`, `Logger.warn()`, `Logger.debug()`
|
|
- Debug messages gated by `DEBUG` environment variable
|
|
- ISO timestamp formatting
|
|
|
|
## Testing
|
|
|
|
No test framework currently configured. Manual testing via:
|
|
- `pnpm test-db` - database connection
|
|
- `pnpm dev` - full scraping run
|
|
- Prisma Studio - data inspection
|