INDEX
Explanations
patterns related to reasoning and justification
"[reason]" or "[reasons]" after certain words
for any reason
New Auto-Interp
Negative Logits
SourceChecksum
-0.83
estekak
-0.82
Roskov
-0.73
referenties
-0.72
esternos
-0.65
nahilalakip
-0.65
CppMethod
-0.63
Diwedd
-0.62
sätzlich
-0.61
########.
-0.59
POSITIVE LOGITS
reason
3.02
reasons
2.73
reason
2.41
Reason
2.17
Reason
2.16
reasons
2.12
Reasons
2.09
REASON
2.02
Reasons
1.95
REASONS
1.86
Activations Density 0.411%