INDEX
Explanations
references to liberation or liberating actions
New Auto-Interp
Negative Logits
par
-0.64
Hale
-0.60
colo
-0.59
listing
-0.59
Sil
-0.58
xus
-0.58
KE
-0.58
WAYS
-0.58
Grimm
-0.57
enegger
-0.57
POSITIVE LOGITS
liberated
1.05
liberate
1.02
liberating
0.86
raint
0.83
liberation
0.81
selves
0.81
emancipation
0.80
veland
0.77
nesday
0.75
piration
0.74
Activations Density 0.012%