INDEX
Explanations
words and phrases related to serious accusations or investigations
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.11
3:0.15
4:0.02
5:0.03
6:0.04
7:0.08
8:0.16
9:0.11
10:0.07
11:0.13
Negative Logits
certific
-1.06
Chambers
-0.93
Sr
-0.92
Got
-0.90
Hartford
-0.90
Emb
-0.90
Gall
-0.90
examines
-0.90
UCLA
-0.89
Copyright
-0.88
POSITIVE LOGITS
iHUD
1.25
ouf
1.07
>[
1.03
dden
1.01
terday
1.00
alkyrie
0.98
uph
0.97
inaction
0.96
amaru
0.96
iling
0.95
Activations Density 0.016%