INDEX
Explanations
words related to societal concerns and safety issues
New Auto-Interp
Negative Logits
awah
-0.16
.Glide
-0.15
ehler
-0.15
arias
-0.15
disag
-0.15
lage
-0.14
Animalia
-0.14
avig
-0.14
/REC
-0.14
mada
-0.14
POSITIVE LOGITS
iscard
0.15
igy
0.15
T
0.14
ince
0.14
Zar
0.14
DLC
0.14
BAT
0.14
iÄĩ
0.14
azz
0.14
Gust
0.14
Activations Density 0.022%