INDEX
Explanations
phrases indicating justification or reasoning
New Auto-Interp
Negative Logits
oux
-0.20
å¹ķ
-0.18
Pok
-0.17
earer
-0.14
):-
-0.14
ammen
-0.14
aku
-0.14
hol
-0.14
ushman
-0.14
931
-0.13
POSITIVE LOGITS
reason
0.37
reason
0.27
reasons
0.26
Reason
0.24
Reason
0.23
purpose
0.23
_reason
0.21
.reason
0.20
_REASON
0.20
purpose
0.19
Activations Density 0.041%