INDEX
Explanations
concepts related to safety and harm in various contexts
New Auto-Interp
Negative Logits
sons
-0.17
isko
-0.15
borg
-0.15
isci
-0.14
implify
-0.14
al
-0.14
kre
-0.14
malı
-0.14
adoo
-0.14
olv
-0.13
POSITIVE LOGITS
enu
0.17
onta
0.17
oct
0.16
Directions
0.16
ENU
0.16
æ¦
0.15
ÃŃž
0.15
inet
0.15
ffa
0.14
onu
0.14
Activations Density 0.183%