INDEX
Explanations
statements of belief, opinion, or claims about events or attributes
New Auto-Interp
Negative Logits
898
-0.16
yre
-0.15
usta
-0.14
ftware
-0.14
ettle
-0.14
alet
-0.14
/start
-0.14
δα
-0.13
iki
-0.13
-consuming
-0.13
POSITIVE LOGITS
to
0.22
capable
0.20
ly
0.19
anced
0.18
edly
0.17
responsible
0.16
likely
0.16
worthy
0.16
likely
0.16
Likely
0.15
Activations Density 0.066%