INDEX
Explanations
the word "over" with increasingly strong activations
New Auto-Interp
Negative Logits
osity
-0.71
Forward
-0.64
associates
-0.63
partName
-0.62
Deity
-0.62
yssey
-0.60
olson
-0.60
resy
-0.58
oS
-0.57
iosity
-0.57
POSITIVE LOGITS
kill
1.20
blown
1.14
rated
1.10
stated
1.07
reaching
1.02
loading
0.99
priced
0.97
joy
0.95
drive
0.95
sold
0.92
Activations Density 0.023%