INDEX
Explanations
expressions of belief or perceptions of truth
New Auto-Interp
Negative Logits
cta
-0.19
otte
-0.17
ocol
-0.17
ilir
-0.15
ody
-0.14
dana
-0.14
occo
-0.14
å¸Ń
-0.13
bones
-0.13
ban
-0.13
POSITIVE LOGITS
ÃĹ↵↵
0.18
065
0.17
hare
0.15
à¸Ļà¸Ķ
0.15
wap
0.15
.cn
0.15
ihat
0.15
hi
0.14
hr
0.14
cages
0.14
Activations Density 0.008%