INDEX
Explanations
terms related to hierarchy and superiority
New Auto-Interp
Negative Logits
afari
-0.19
sworth
-0.16
suppress
-0.15
nest
-0.15
/how
-0.14
voor
-0.14
pray
-0.14
ACC
-0.14
doors
-0.14
soever
-0.14
POSITIVE LOGITS
ior
0.28
iors
0.27
intendent
0.24
IOR
0.24
iore
0.23
charged
0.23
cil
0.23
ordinate
0.22
stit
0.22
iores
0.21
Activations Density 0.039%