INDEX
Explanations
words related to the concept of "safety" or "security."
New Auto-Interp
Negative Logits
erah
-0.15
slack
-0.15
atings
-0.15
aneously
-0.15
REA
-0.15
atee
-0.15
spy
-0.15
refl
-0.14
askell
-0.14
setattr
-0.14
POSITIVE LOGITS
osten
0.28
ott
0.27
vil
0.27
osp
0.26
ottom
0.26
vol
0.26
otto
0.25
periment
0.25
ulla
0.25
ulle
0.25
Activations Density 0.006%