INDEX
Explanations
phrases related to accountability and public scrutiny
New Auto-Interp
Head Attr Weights
0:0.02
1:0.02
2:0.15
3:0.40
4:0.07
5:0.03
6:0.03
7:0.04
8:0.03
9:0.05
10:0.06
11:0.06
Negative Logits
)."
-1.79
igi
-1.72
)"
-1.67
)</
-1.66
iHUD
-1.49
acia
-1.49
obi
-1.46
),"
-1.46
OV
-1.46
ofi
-1.44
POSITIVE LOGITS
awoken
1.74
worse
1.70
somew
1.69
subconscious
1.65
might
1.58
enough
1.54
Shit
1.53
somehow
1.52
better
1.50
doomed
1.50
Activations Density 0.171%