Data Methodology

Sofia Pulse / Virtual Arena AI

Last updated: March 27, 2026

1. General Principle

Sofia Pulse is committed to full transparency about its data sources, limitations, and known biases. We believe that data without context can be more harmful than helpful. Therefore, we openly document how each source is collected, processed, and presented.

2. Data Sources

📋 Job Listings

Sources: Himalayas, RemoteOK, Arbeitnow, Careerjet, Greenhouse, Catho, InfoJobs, Adzuna and others (~12 platforms).

Coverage: Global, with bias toward markets with English-language platforms. English-speaking countries (especially the US) tend to be over-represented.

📄 Academic Papers

Sources: OpenAlex (API), ArXiv.

Geographic enrichment: Semantic Scholar, ROR, OpenAlex DOI.

Counting methodology: Counted by co-authorship — one paper with N authors from N countries counts N times (once per country). This inflates absolute numbers but reflects participation.

📑 Patents

Source: PatentsView (USPTO).

Coverage: Primary: US and international patents processed by the USPTO. Limited coverage for patents that do not go through the American system (e.g., patents registered only in China, South Korea, or Japan).

💻 GitHub

Data: Trending repositories (daily) and programming languages.

Window: 90 days.

🌐 Community Signals

StackOverflow: Trending tags.
HackerNews: Top stories.
NPM / PyPI: Package trends.

3. Collection Methodology

Automation: Automated collection via cron jobs with source-specific frequency (daily, weekly, or monthly).
Deduplication: Duplicate records are identified and removed based on unique identifiers from each source.
Geographic normalization: Country, state, and city names are normalized to a single standard (ISO 3166). This process is heuristic and may contain inaccuracies.

4. Known Limitations

We openly document our data limitations so users can interpret the information critically:

Coverage bias: Countries with English-language platforms are over-represented in job data. Coverage of markets in other languages is partial.
Paper counting: The co-authorship methodology inflates absolute numbers. A paper with 10 authors from 5 countries counts 5 times.
USPTO patents: Patent coverage is limited to the USPTO. Patents registered exclusively at other offices (CNIPA, KIPO, JPO) are not captured.
Geographic normalization: The location normalization process is imperfect. Some cities or regions may be mapped incorrectly.
Market-academia alignment: The cross-referencing between job taxonomies and paper categories uses fuzzy matching heuristics, which may generate false positives or negatives.

5. Scores and Metrics

All scores, rankings, and metrics presented on the platform are algorithmic constructions — not absolute measurements. They represent computational modeling on public data and should be interpreted as relative indicators, not definitive truths.

Alignment Score: Uses fuzzy matching between different taxonomies (job skills vs. academic paper categories). Because it operates with heterogeneous taxonomies, results are approximations, not exact matches.

6. Data Updates

Data is updated daily via automated collectors. Each data source has its own collection cycle (daily, weekly, or monthly). The last collection timestamp is available in the data presented on the platform, allowing users to verify the timeliness of the information.

Questions about the Data

If you have identified any data inconsistency or have questions about the methodology, contact:

[email protected]