INDEX
Explanations
proper nouns
the word "so" in various contexts
New Auto-Interp
Negative Logits
DERR
-0.66
trespass
-0.62
deficit
-0.62
races
-0.61
appra
-0.60
prose
-0.60
curves
-0.60
separ
-0.59
shutter
-0.58
disbelief
-0.57
POSITIVE LOGITS
fter
1.09
aked
1.06
oner
1.05
FTWARE
1.04
bered
1.04
ppy
1.00
ooo
0.99
oooo
0.98
zzo
0.95
oths
0.95
Activations Density 0.039%