INDEX
Explanations
the word "No" with high activation values
repeated occurrences of the word "No."
New Auto-Interp
Negative Logits
rn
-0.67
RAFT
-0.64
MpServer
-0.64
knit
-0.63
tein
-0.62
CLOSE
-0.60
RANT
-0.59
adobe
-0.59
ULAR
-0.59
iership
-0.59
POSITIVE LOGITS
kidding
1.08
doubt
1.05
wonder
1.00
vel
1.00
zzle
0.96
isy
0.96
matter
0.95
longer
0.92
emi
0.92
ct
0.90
Activations Density 0.060%