INDEX
Explanations
phrases indicating justification or lack of justification
instances of the word "reason" and its variations, indicating justifications or rationales
New Auto-Interp
Negative Logits
chin
-0.68
chron
-0.63
Carbuncle
-0.61
xon
-0.60
tein
-0.59
ModLoader
-0.57
ilation
-0.57
ophon
-0.56
Warcraft
-0.55
rodu
-0.54
POSITIVE LOGITS
why
1.51
why
1.32
WHY
1.23
abl
1.06
Why
0.99
Why
0.96
justifying
0.81
rationale
0.80
justification
0.78
Reviewer
0.74
Activations Density 0.044%