INDEX
Explanations
words related to safety
New Auto-Interp
Negative Logits
eldorf
-0.17
Stanley
-0.15
ases
-0.14
isku
-0.14
ÑĪиб
-0.14
丶
-0.14
FTA
-0.14
fmt
-0.14
ible
-0.14
asters
-0.14
POSITIVE LOGITS
eguard
0.33
ETY
0.31
avid
0.20
eties
0.19
eway
0.19
AreaView
0.19
Saf
0.18
aris
0.18
ARI
0.17
saf
0.17
Activations Density 0.009%