Here I propose a system for scoring media opinion articles. It is part prediction markets - as there is a small amount of money involved - and part forecasting science mechanism design. Journalists that publish an article on the platform must do so with an accompanying stake. Readers (whether human or AI) that wish to pass judgment on the merits of the main opinion must pay a small fee/tip for the right to do so. The overall aggregated reader score dictates how much of the stake (and tips) the writer receives. The remainder is sent to the protocolâs global funding pool meaning the protocol takes the other side. Readers are scored by a separate mechanism where honest responses are a Nash equilibrium. They are incentivised to participate as star performers are eligible for monthly rebates from the protocol. Many readers wonât be well calibrated and many might tip without participating in the forecasting competition/market simply because they like the article. They will subsidise the insightful bettors.
Brief summary of the relevant forecasting literature
Most opinion piece articles involve unverifiable predictions. However, we can settle markets without a resolving exogenous event or ground truth using peer-prediction based mechanisms. This enables us to create and settle markets for questions that wonât have answers for some time or for counterfactual type questions. Individuals can be scored for being both well-calibrated and honest ensuring incentive compatibility and the avoidance of a Keynesian Beauty Contest. Aggregated forecasts work best when good track records are upweighted and the aggregate is extremised. A track record shows good general forecasting ability whereas a divergence between what one forecasts themselves and what they predict the crowd will forecast (their meta prediction) is a strong signal of domain-specific expertise.
Based on this literature I present an opinion article scoring mechanism based on the meta-probability weighting with a track record upweighter added as Bayesian Truth Serum scores will be tracked in protocol. This enables us to account for both types of expertise and extremise the aggregate. Prelec et alâs BTS mechanism is itself adapted for continuous probabilities and made robust to small sample sizes. We will still probably require a minimum of 10 respondents to score any article and reader. We expect a high prevalence of AI agents on the system to compete for the protocol payouts so this shouldnât be a problem.
Agent Reports
There are n agents, indexed by i = 1, \dots, n. Each agent i reports:
- A primary report p_i \in [0,1].
- A metaâprediction m_i \in [0,1], which represents agent i's guess of the group average \bar{p}
In english, respondents are asked:
1. What do you think the chances are that this opinion is correct?
2. What do you think the average answer will be for question 1?
1. Article Score
Domain Specific Expertise (MPW Divergence)
For each respondent i, let
where:
- p_i is the respondentâs probability that the main opinion is correct, and
- m_i is the respondentâs metaâprediction of the groupâs average probability.
Rationale: The idea is that if a respondentâs own opinion differs significantly from what they expect the crowd to believe, that divergence is taken as an indicator of potential domain-specific insight.
Track Record
Each respondentâs historical performance is expressed as a percentile rank PR_i (with values between 0 and 1, where 0.5 represents the median performance). This is then incorporated via a multiplier:
Here, \beta is a variable parameter that we can calibrate. For example, with \beta = 1, a respondent with a perfect track record (TR_i = 1) would have M_i = 1.5 while one with the lowest rank (TR_i = 0) would have M_i = 0.5.
Rationale: The track record multiplier adjusts the influence of the divergence component based on past performance. Those with a good track record and a high divergence will be heavily upweighted as they are showing two valuable signals.
Combined Weight
For each respondent, combine the divergence and track record components multiplicatively:
Normalisation of Weights
To ensure that all weights sum to 1, normalize the unnormalized weights:
Rationale: Normalisation makes the weights comparable and ensures that the final aggregated score is a true weighted average of the respondentsâ probabilities. This step rescales the combined scores so that no matter how large or small the individual components are, the final influence each respondent has is relative to the overall group.
Final Aggregated Score
Formula: The final score S for the opinion article is calculated as:
2. Individual Scores
Respondents/bettors are scored via a BTSâtype system where accuracy and honesty are the optimal strategies. Specifically, they are rewarded for:
- Accurately predicting what the crowd will forecast (the âpredictionâ score).
- Having a primary report that turns out âsurprisingly commonâ compared to what was predicted (the âinformationâ score).
2.1 Information Score
Kernel Aggregators
We collect each agentâs primary report p_i \in [0,1] and meta prediction m_i \in [0,1] We then construct two kernel density estimators (KDEs). The purpose of this is imply a full distribution from discrete reports to be able to properly compare predicted distribution to realised distribution.
- Actual KDE \hat{f}(x), based on the \{p_i\}:
- Predicted KDE \hat{g}(x), based on the \{m_i\}:
where:
Quadratic (Brier) Score
Define
\phi(x) \;=\; 1 - [\,1 - x\,]^2 \;=\; 2x \;-\; x^2
Each agent i then receives an information score:
- \phi(\hat{f}(p_i)) measures how common p_i actually is (in a quadratic sense).
- \phi(\hat{g}(p_i)) measures how common p_i was predicted to be.
- The difference is positive if p_i turns out âmore common than expected.â
Rationale:
- No Arbitrary Bins Traditional BTS (for discrete categories) must count occurrences in bins. For continuous âprobabilities,â that discretisation is unnatural and can produce perverse outcomes. A kernel density smoothly estimates frequencies without artificial cutâoffs.
- Epanechnikov Kernel Has bounded support |u|<1, avoids infinite tails.
- Dynamic Bandwidth At small n, h(n) is larger, smoothing more aggressively. At large n, h(n)\to 0, capturing finer distinctions.
- Offset \alpha(n) Ensures \hat{f}(x)>0 everywhere, so no agent ever encounters the most extreme outcome. The dynamic bandwith and pseudo-count ensure ârobustnessâ at low n.
- Brier Score (Difference) Subtracting \phi\bigl(\hat{g}(p_i)\bigr) (the predicted densityâs quadratic value) gives a âsurprisingly popularâ reward, which is positive if p_i ends up more common than expected.
2.2 Prediction Score
- Regularised Group Average Instead of a raw mean of the primary reports, use:
where \alpha_{\mathrm{B}}>0 is small. This keeps \bar{p}^{\star}\in(0,1), never exactly 0 or 1 at small n.
- Brier Score Each agent i provides a metaâprediction m_i\in[0,1]. Their prediction score is:
High scores (up to 1) reward accurate guesses of the groupâs average.
Rationale
- Continuous Probability Setting Weâre asking each agent for a probability in [0,1]. A Brierâtype rule is strictly proper for a realâvalued outcome.
- Avoiding Log BlowâUps Log scoring for a fraction can go -\infty if that fraction is exactly 0 or 1. The Brier rule remains finite in all cases.
- Weighted (Regularised) Average By adding a small pseudoâcount \alpha_{\mathrm{B}}, extreme outcomes (0 or 1) are impossible at small n. This lowers variance and improves stability.
Final Combined Score
Each agent i receives:
This yields a Continuous Probability BTS that (hopefully!) remains:
- Strictly Proper (honest reporting is optimal),
- Robust (no infinite log penalties, no forced bins),
- Adaptive (dynamic smoothing for small vs. large n).
In other words, we solve the problem of domain mismatch between classical (categorical) BTS and new (probabilistic) questions by abandoning bins in favor of a kernel approach, along with a Brier rule suited to realâvalued [0,1] predictions.
Discussion
This is an attempt to recreate the BTS Nash equilibrium but it might be broken. Certainly if the dynamic kernels arenât calibrated correctly. Weâll need to perform simulations.
For the article score the combination of both multipliers will have to be carefully calibrated. Too much weight could be given to forecasters with a strong track record and a large divergence.
Iâm currently thinking that fee/tip/bet sizes should scale with how much one diverges from what they predict the crowd average will be. This increases risk:reward under BTS so it makes sense that the financial cost should mirror this. So thereâll be some mininum bet and the more you diverge the more itâll cost you to try and achieve a high score.
Any thoughts and criticisms welcome.