INDEX
Explanations
references to the concept of freedom or phrases associated with free will
New Auto-Interp
Negative Logits
amm
-0.16
Baron
-0.15
ivo
-0.15
py
-0.15
atham
-0.15
sec
-0.14
pic
-0.14
ella
-0.14
arium
-0.14
uss
-0.14
POSITIVE LOGITS
fall
0.22
zing
0.21
/free
0.20
zers
0.20
enterprise
0.19
lance
0.19
zes
0.19
free
0.19
-wheel
0.19
Enterprise
0.19
Activations Density 0.030%