INDEX
Explanations
the word "sat" with varying activation values
instances of the word "sat."
New Auto-Interp
Negative Logits
ALLY
-0.75
escal
-0.74
credit
-0.73
enforcement
-0.70
ISO
-0.69
achev
-0.68
obs
-0.65
negative
-0.64
raid
-0.64
ever
-0.64
POSITIVE LOGITS
seiz
0.86
ivas
0.86
anic
0.82
chers
0.78
nav
0.74
lie
0.73
anism
0.73
chel
0.73
toget
0.71
urn
0.71
Activations Density 0.009%