INDEX
Explanations
phrases indicating the absence of something or negation
claims that lack evidence or credibility
New Auto-Interp
Negative Logits
raid
-0.73
EEE
-0.72
NN
-0.72
NI
-0.71
GD
-0.68
Volunte
-0.68
ATS
-0.67
HAEL
-0.66
Alert
-0.64
ATT
-0.64
POSITIVE LOGITS
merit
1.20
relevance
1.19
meaning
1.16
redeem
1.16
significance
1.11
resemblance
1.11
intrinsic
1.07
validity
1.06
rhy
0.99
merits
0.96
Activations Density 0.139%