Development Guide¶
1. Architecture Overview¶
CiteVerifier has two execution surfaces:
- CLI verification pipeline (
verifier.py) - Runs strategy-chain verification (DBLP/cache/online/LLM reparse).
- Exports results and statistics.
- Web lookup service (
web_app.py) - Serves single and batch title lookup.
- Persists runtime metrics into
runtime.sqlite.
Supporting modules:
unified_database.py: verification cache DB (scholar_results/search_results).runtime_store.py: Web runtime telemetry and batch run records.dblp_match.py: local DBLP match functions (indexed path + brute-force fallback).
2. CLI Verification Flow (with cache)¶
2.1 Strategy Chain Assembly Rules¶
Actual behavior in ScrapingDogVerifier._build_verification_chain():
- Default (
--enable-onlinedisabled): DblpVerificationStrategyCacheVerificationStrategy- With
--enable-onlineenabled, append: APIVerificationStrategyGoogleFallbackStrategyLLMReparseStrategy
If online clients fail to import/initialize, runtime degrades automatically to DBLP + cache only.
2.2 CLI Cache Read/Write Semantics¶
Cache DB is scholar_results.db (UnifiedDatabase):
- Read path:
CacheVerificationStrategycallssearch_scholar_by_title(title). - SQL is
WHERE title LIKE '%{title}%' ORDER BY created_at DESC LIMIT 1. - Hit result is still validated by the unified validator, not blindly accepted as VALID.
- Write path: only writes when final verification is
VALIDandbest_matchexists. - Dedup rule: unique index
UNIQUE(title, authors, year); duplicate writes are ignored by default.
So CLI cache is a result cache (not request cache), and LIKE-based retrieval may introduce near-match risk.
3. Web Lookup Flow and Caching¶
3.1 Single Lookup (POST /api/search/title)¶
- Normalize title (trim + collapse whitespace).
- Verify DB file exists, otherwise
404. - Search strategy:
- use indexed search when word index exists;
- fallback to brute-force otherwise.
- If matched, enrich with publication metadata (
year/venue/pub_type). - Record runtime counters and single-search event.
3.2 Batch Lookup (POST /api/search/title/batch)¶
- Normalize + de-duplicate titles (
casefold). - Empty list ->
400; overMAX_BATCH_TITLES=200->400. - Process each title in a request-local serial loop (
for), writing onebatch_itemsrow per title. - Finalize
batch_runssummary and counters.
3.3 Web In-Memory Cache Semantics¶
web_app.py keeps _brute_cache: dict[db_path -> all_titles]:
- used only for brute-force mode;
- first load is protected by
_brute_cache_lock, then reused; - no TTL, no size cap, no active eviction;
- cache is process-local and lost on process restart.
4. Concurrency Model and Overload Behavior (Wait / Reject / Degrade)¶
| Scenario | Constraint | Behavior when exceeded | Result |
|---|---|---|---|
| CLI batch concurrency | asyncio.Semaphore(max_concurrent), default --concurrent=10 |
wait on semaphore | queued tasks continue, no reject |
CLI --concurrent value |
no hard cap in code | no automatic reject | can amplify IO pressure, limited by downstream systems |
| Web batch size | max 200 titles |
immediate reject | 400 Batch size exceeds limit |
Web max_candidates |
1..500000 (Pydantic) |
immediate reject | 422 |
| First brute-cache load | _brute_cache_lock mutex |
wait for lock | continue after load |
| Per-request Web batch processing | serial for loop |
no reject path, only longer processing | increased latency |
| Web runtime SQLite contention | sqlite timeout=30s + WAL |
wait for lock first | timeout raises exception (typically 500) |
| CLI online clients unavailable | import/init failure | automatic degradation | DBLP + cache mode continues |
Direct answer to the acceptance question:
- CLI over-concurrency means wait (not forbidden).
- Web batch-size overflow means forbidden with 400.
- SQLite lock contention means wait first, then fail on timeout.
5. Validation Constraints and Error Codes¶
5.1 Web API¶
POST /api/search/titletitle: minimum length 1; empty yields422.max_candidates:1..500000; out-of-range yields422.POST /api/search/title/batch- normalized title list must contain at least 1 title, else
400. - normalized list above 200 returns
400. GET /api/health- when DB is missing, returns
{"status":"error"}with HTTP 200.
5.2 Key CLI options¶
--dblp-db: path to local DBLP sqlite.--dblp-threshold: DBLP title similarity threshold (default0.9).--dblp-max-candidates: DBLP candidate cap (default100000).--disable-dblp: skip DBLP pre-match.--enable-online: enable online strategy chain.
6. Web Runtime Data Model¶
runtime_store.py tables:
runtime_counterssingle_search_eventsbatch_runsbatch_itemsevent_logs
Recommended operational metrics:
single_search_requests / single_search_errorsbatch_search_requestsbatch_search_items_total / batch_search_found_totalbatch_runs.statusanderror_message
7. Implementation Guidance (Cache + Concurrency)¶
- If cross-process cache is required, replace
_brute_cachewith an explicit shared cache and define TTL/capacity. - If you add intra-request parallelism for Web batch, design DB connection and timeout policy first.
- For CLI online reliability, add global LLM/API rate limiter and backoff retry policies.
- Every new endpoint should document overload semantics explicitly: wait, reject, or degrade.