INDEX
Explanations
text related to explanations, justifications, or the underlying logic behind actions or decisions
phrases related to reasoning and justification
New Auto-Interp
Negative Logits
adal
-0.72
vette
-0.68
onies
-0.67
semble
-0.67
uckle
-0.66
stocking
-0.65
hold
-0.65
national
-0.64
borg
-0.64
vas
-0.63
POSITIVE LOGITS
reasoning
1.19
rationale
1.01
DragonMagazine
0.96
why
0.95
SourceFile
0.95
argument
0.88
justification
0.84
arguments
0.80
excuse
0.79
WHY
0.79
Activations Density 0.007%