INDEX
Explanations
sentences that convey personal commitment or experience
New Auto-Interp
Negative Logits
ruc
-0.18
acom
-0.15
ohn
-0.14
Han
-0.14
onas
-0.14
ocus
-0.14
mare
-0.14
elligence
-0.14
iminal
-0.14
zet
-0.14
POSITIVE LOGITS
exist
0.19
exists
0.19
existed
0.19
Cat
0.18
existe
0.17
ué
0.15
exists
0.15
fram
0.15
elo
0.15
Cat
0.15
Activations Density 0.013%