INDEX
Explanations
mentions of the concept of freedom or references to the word "free."
New Auto-Interp
Negative Logits
older
-0.81
sidx
-0.81
ulous
-0.77
amel
-0.72
therap
-0.69
URRENT
-0.67
IPS
-0.65
Takeru
-0.64
ENTS
-0.64
ENTION
-0.61
POSITIVE LOGITS
bies
1.37
bie
1.11
zing
1.08
zers
1.08
boot
0.98
zes
0.97
roam
0.96
edom
0.94
zer
0.83
ze
0.83
Activations Density 0.056%