INDEX
Explanations
instances of contradiction or doublespeak in statements
New Auto-Interp
Head Attr Weights
0:0.05
1:0.04
2:0.01
3:0.09
4:0.05
5:0.17
6:0.04
7:0.02
8:0.07
9:0.38
10:0.01
11:0.01
Negative Logits
apo
-2.49
ieu
-2.47
ema
-2.10
akable
-2.04
Submit
-2.03
depth
-2.00
utor
-1.95
ivities
-1.94
icter
-1.90
rition
-1.83
POSITIVE LOGITS
mentions
2.39
ALSO
1.96
conspic
1.92
Phelps
1.92
prominently
1.91
also
1.91
.)
1.88
coincided
1.86
Rowling
1.86
fared
1.84
Activations Density 0.129%