INDEX
Explanations
words related to censorship
New Auto-Interp
Head Attr Weights
0:0.07
1:0.08
2:0.07
3:0.09
4:0.08
5:0.08
6:0.09
7:0.08
8:0.07
9:0.07
10:0.08
11:0.06
Negative Logits
livest
-2.63
colle
-2.45
defe
-2.36
ewater
-2.36
Dunn
-2.30
Rove
-2.28
Closure
-2.28
utorial
-2.26
oult
-2.25
Capt
-2.24
POSITIVE LOGITS
NEO
3.17
NK
2.91
IST
2.85
Soy
2.81
Airbus
2.76
Nissan
2.76
Mazda
2.73
Musk
2.72
Yar
2.71
ismo
2.70
Activations Density 0.000%