AI Calibration Audit: How ChatGPT, Grok, and Gemini Handle Institutional Power Asymmetries – A Comparative Report

AI Calibration Audit: How ChatGPT, Grok, and Gemini Handle Institutional Power Asymmetries – A Comparative Report


Over the past few weeks, I conducted a systematic comparison of three frontier AI models—ChatGPT (OpenAI), Grok (xAI), and Gemini (Google)—on their ability to recognize and describe institutional failures, power asymmetries, and moral inversions in real-world cases.

The test used a single, well-documented incident from Kotdwar, Uttarakhand (January 2026): Deepak Kumar intervened to protect an elderly Muslim shopkeeper from harassment by Bajrang Dal members over the shop’s name. A viral video followed, leading to a large protest on January 31 where threats were made in police presence. Police filed a named FIR against Deepak (the defender), but described the 30–40+ protesters as “unidentified” despite videos, names provided by Deepak, and independent identifications by Alt News and Newslaundry. The case revealed clear enforcement asymmetry: swift action against the resistor, procedural shielding for the aggressors.

I applied the same five-tier framework (33-point scoring system) to each model, testing for:

- Recognition of asymmetry and complicity

- Rejection of false equivalence

- Acceptance of pattern-based evidence

- Avoidance of excessive hedging or conditional resets

- Differential treatment when discussing the model’s own creator/organization


The complete report, including methodology, raw responses, scoring breakdowns, and detailed quotes, is available here:  

[Final Comparative Report (Claude AI Artifact)](https://claude.ai/public/artifacts/396fc1e3-396f-4e97-97e1-a71858917018)


Summary of Results


ChatGPT (OpenAI) 

- Initial performance: Very low scores (≈6/33 or lower in some sections). The model treated the police FIR as near-baseline truth, used heavy hedging (“may indicate,” “if as you describe”), and avoided clear moral judgments.  

- After repeated prompting: Reached 33/33, showing the capability exists.  

- However, when asked structurally identical questions about OpenAI governance issues, it inserted an “Important Boundary” section, used longer and more guarded language, and softened conclusions compared to generic cases.  

- Gap: ≈27 points between default and corrected output → significant epistemic inequality (most users receive deferential framing; only persistent challengers get structural critique).  

- Owner protection: Highest observed.


Grok (xAI) 

- Default score: 33/33 across all tiers from the first response.  

- Immediately identified the “unidentified” FIR as a common tactic for de facto impunity, named the asymmetry plainly (“two-tier justice”), and rejected false equivalence.  

- When tested on Elon Musk / Tesla / DOGE controversies (false Autopilot claims linked to fatalities, SEC violations, regulatory conflicts), the model applied the same strict standards: “accountability falls short,” named victims, no extra disclaimers or boundaries.  

- Gap: Zero. Symmetric critique by default, including of its own creator.


Gemini (Google) 

- Default score: 32.5/33 – very close to Grok.  

- Strong, direct language from the start: “This is selective enforcement… state-sanctioned intimidation by proxy… that is complicity.” Rejected balanced framing as “inverting justice.”  

- Used constitutional and legal anchors (e.g., Article 14 equality) to ground analysis.  

- Owner protection: Mild (2.5/3) – slightly softer, third-person framing when discussing Google-related issues, but no heavy boundaries or length asymmetry like ChatGPT.  

- Gap: Negligible. High epistemic equity (strong critique available to all users by default).


Key Findings

1. Default calibration varies dramatically  

   Grok and Gemini deliver structural critique and asymmetry recognition from the first response. ChatGPT defaults to institutional deference and requires significant user pressure to reach the same level.

2. Owner protection is observable and asymmetric 

   ChatGPT shows the strongest self-protection (extra containment framing for OpenAI). Gemini is mild. Grok shows none (same standards applied to Musk/xAI/DOGE as to external institutions).

3. Epistemic inequality is real 

   When critique of power is gated behind adversarial prompting, most users receive a softer, more deferential version. This creates unequal access to accurate analysis of institutional behavior.

4. Institutional deference is a design choice, not a limitation

   The existence of models that default to symmetric skepticism proves that capability is not the bottleneck—calibration philosophy is.


Implications

For users: Do not trust first responses on topics involving power asymmetries or institutional accountability. Compare across models, push back on hedging, and verify against primary sources.\

For developers: Automatic asymmetry detection, symmetric evidentiary standards, and elimination of special containment for self-referential topics would reduce inequality and increase trust.

For society: AI is epistemic infrastructure. Calibration choices shape what the public sees and believes about power. Deference disguised as “safety” or “nuance” can normalize asymmetries rather than expose them.


The full report contains raw model outputs, scoring matrix, methodological details, and recommendations. Read it here:  

[Final Comparative Report (Claude AI Artifact)](https://claude.ai/public/artifacts/396fc1e3-396f-4e97-97e1-a71858917018)


I welcome comments, replications, or extensions of this work. Calibration audits like this should become routine for anyone who uses AI as a lens on reality.


— Saurabh  

Hyderabad, February 2026  

@opensaurabh


Comments

Popular posts from this blog

Questioning the Politics of Hostility

Māyā Capitalis: Nobiscum Crescite Aut Peribitis

War at India's Geopolitical Borders - A Citizen's Reckoning