INDEX
Explanations
words related to control, regulation, or enforcement
New Auto-Interp
Negative Logits
lm
-0.20
lp
-0.19
latex
-0.19
l
-0.19
egr
-0.18
lx
-0.18
lv
-0.18
ri
-0.17
ls
-0.17
ono
-0.17
POSITIVE LOGITS
̧
0.27
raft
0.23
heck
0.21
chio
0.21
chia
0.21
s
0.20
ourt
0.19
eneg
0.19
ussion
0.19
avity
0.19
Activations Density 0.216%