INDEX
Explanations
phrases related to rationale or reasoning for certain actions or decisions
terms related to explanations and justifications for actions or beliefs
New Auto-Interp
Negative Logits
semble
-0.88
redd
-0.75
oner
-0.69
chance
-0.69
rop
-0.67
zig
-0.66
plet
-0.65
hold
-0.65
chin
-0.65
omez
-0.65
POSITIVE LOGITS
rationale
1.13
why
1.02
justification
0.96
SourceFile
0.93
justifying
0.92
reasoning
0.87
behind
0.86
WHY
0.84
underpin
0.78
justify
0.78
Activations Density 0.025%