INDEX
Explanations
references to freedom and related concepts
New Auto-Interp
Negative Logits
antan
-0.17
ryan
-0.16
aul
-0.15
.sheet
-0.14
fu
-0.14
ored
-0.14
fn
-0.14
c
-0.14
sel
-0.14
ardi
-0.14
POSITIVE LOGITS
quent
0.26
udent
0.23
ddie
0.23
estyle
0.22
inkel
0.22
QUENCY
0.21
edom
0.20
itas
0.19
unds
0.19
fre
0.19
Activations Density 0.009%