INDEX
Explanations
instances of contrasting information or opposing viewpoints
New Auto-Interp
Negative Logits
tnc
-0.78
SPONSORED
-0.76
INS
-0.72
bas
-0.69
illet
-0.68
yz
-0.67
hig
-0.65
bre
-0.65
ursed
-0.65
bart
-0.64
POSITIVE LOGITS
acknowledging
1.09
technically
0.92
conced
0.90
researching
0.85
admittedly
0.85
admitting
0.79
initially
0.77
browsing
0.74
agreeing
0.74
discussing
0.73
Activations Density 1.037%