INDEX
Explanations
expressions of skepticism or criticism towards authority figures or systems
New Auto-Interp
Negative Logits
485
-0.16
raft
-0.15
agon
-0.15
ÃĸL
-0.15
.cgi
-0.14
abay
-0.14
御
-0.14
iar
-0.14
igne
-0.14
INI
-0.14
POSITIVE LOGITS
when
0.24
when
0.23
When
0.19
khi
0.19
cuando
0.19
quando
0.18
When
0.18
when
0.18
WHEN
0.18
adian
0.17
Activations Density 0.159%