INDEX
Explanations
phrases or terms indicating reasons and justifications for actions or opinions
New Auto-Interp
Negative Logits
aro
-0.14
forme
-0.14
BERT
-0.14
:^
-0.14
bert
-0.13
egg
-0.13
aan
-0.13
eg
-0.13
rts
-0.13
اÙĦØ©
-0.13
POSITIVE LOGITS
sake
0.58
purposes
0.52
purpose
0.27
reasons
0.26
pur
0.21
purpose
0.21
PURPOSE
0.20
Purpose
0.20
reason
0.18
_REASON
0.18
Activations Density 1.274%