INDEX
Explanations
phrases related to citing reasons or explanations
phrases that attribute reasons or justifications, particularly linked to actions or endorsements
New Auto-Interp
Negative Logits
cair
-0.74
mop
-0.69
llular
-0.66
orr
-0.65
medium
-0.65
leaf
-0.65
rams
-0.65
rone
-0.64
roup
-0.63
roman
-0.62
POSITIVE LOGITS
justification
1.25
evidence
1.09
examples
1.08
reasons
1.07
inspiration
1.06
proof
1.03
contributing
1.02
reason
1.01
evidence
1.00
culprit
0.97
Activations Density 0.084%