INDEX
Explanations
phrases and terms related to principles
New Auto-Interp
Negative Logits
oken
-0.18
gie
-0.17
emez
-0.17
encias
-0.15
dens
-0.15
ceph
-0.15
rait
-0.15
akan
-0.15
emp
-0.15
ney
-0.14
POSITIVE LOGITS
-agent
0.28
ities
0.19
-Agent
0.19
ps
0.18
investigator
0.18
/pr
0.18
stown
0.17
Investig
0.16
pal
0.16
etro
0.15
Activations Density 0.016%