Data Methodology
Sofia Pulse / Virtual Arena AI
Last updated: March 27, 20261. General Principle
Sofia Pulse is committed to full transparency about its data sources, limitations, and known biases. We believe that data without context can be more harmful than helpful. Therefore, we openly document how each source is collected, processed, and presented.
2. Data Sources
📋 Job Listings
Sources: Himalayas, RemoteOK, Arbeitnow, Careerjet, Greenhouse, Catho, InfoJobs, Adzuna and others (~12 platforms).
Coverage: Global, with bias toward markets with English-language platforms. English-speaking countries (especially the US) tend to be over-represented.
📄 Academic Papers
Sources: OpenAlex (API), ArXiv.
Geographic enrichment: Semantic Scholar, ROR, OpenAlex DOI.
Counting methodology: Counted by co-authorship — one paper with N authors from N countries counts N times (once per country). This inflates absolute numbers but reflects participation.
📑 Patents
Source: PatentsView (USPTO).
Coverage: Primary: US and international patents processed by the USPTO. Limited coverage for patents that do not go through the American system (e.g., patents registered only in China, South Korea, or Japan).
💻 GitHub
Data: Trending repositories (daily) and programming languages.
Window: 90 days.
🌐 Community Signals
- StackOverflow: Trending tags.
- HackerNews: Top stories.
- NPM / PyPI: Package trends.
3. Collection Methodology
- Automation: Automated collection via cron jobs with source-specific frequency (daily, weekly, or monthly).
- Deduplication: Duplicate records are identified and removed based on unique identifiers from each source.
- Geographic normalization: Country, state, and city names are normalized to a single standard (ISO 3166). This process is heuristic and may contain inaccuracies.
4. Known Limitations
We openly document our data limitations so users can interpret the information critically:
- Coverage bias: Countries with English-language platforms are over-represented in job data. Coverage of markets in other languages is partial.
- Paper counting: The co-authorship methodology inflates absolute numbers. A paper with 10 authors from 5 countries counts 5 times.
- USPTO patents: Patent coverage is limited to the USPTO. Patents registered exclusively at other offices (CNIPA, KIPO, JPO) are not captured.
- Geographic normalization: The location normalization process is imperfect. Some cities or regions may be mapped incorrectly.
- Market-academia alignment: The cross-referencing between job taxonomies and paper categories uses fuzzy matching heuristics, which may generate false positives or negatives.
5. Scores and Metrics
All scores, rankings, and metrics presented on the platform are algorithmic constructions — not absolute measurements. They represent computational modeling on public data and should be interpreted as relative indicators, not definitive truths.
Alignment Score: Uses fuzzy matching between different taxonomies (job skills vs. academic paper categories). Because it operates with heterogeneous taxonomies, results are approximations, not exact matches.
6. Data Updates
Data is updated daily via automated collectors. Each data source has its own collection cycle (daily, weekly, or monthly). The last collection timestamp is available in the data presented on the platform, allowing users to verify the timeliness of the information.
Questions about the Data
If you have identified any data inconsistency or have questions about the methodology, contact: