INDEX
Explanations
concepts related to dominance and power dynamics
New Auto-Interp
Negative Logits
aci
-0.16
ade
-0.15
SavaÅŁ
-0.15
lich
-0.14
tre
-0.14
rou
-0.14
het
-0.14
antee
-0.14
inspir
-0.14
Mir
-0.14
POSITIVE LOGITS
easy
0.22
容æĺĵ
0.21
easily
0.21
easy
0.21
Easily
0.19
relatively
0.19
Easy
0.19
æĺĵ
0.19
easiest
0.18
fácil
0.18
Activations Density 0.276%