INDEX
Explanations
percentages and statistics in the text
New Auto-Interp
Negative Logits
CV
-0.65
nob
-0.65
tresp
-0.59
senal
-0.59
ensis
-0.59
initials
-0.58
umbn
-0.58
illusion
-0.57
whims
-0.57
boss
-0.57
POSITIVE LOGITS
thirds
0.99
etheless
0.90
ixty
0.87
irteen
0.85
entimes
0.78
Ń·
0.77
ccording
0.77
roximately
0.76
fif
0.76
racuse
0.75
Activations Density 0.131%