INDEX
Explanations
words related to justifications or reasons for actions
New Auto-Interp
Negative Logits
icer
-0.78
sung
-0.73
apes
-0.69
quer
-0.69
Wend
-0.68
ascus
-0.68
estern
-0.67
resent
-0.66
chrom
-0.66
Patriarch
-0.65
POSITIVE LOGITS
basis
1.03
tenance
0.81
grounds
0.77
premise
0.76
footing
0.76
uation
0.73
plates
0.72
foundation
0.72
grounding
0.70
ifiers
0.70
Activations Density 0.028%