INDEX
Explanations
the word "Show" with varying activation strengths
instances of the word "Show"
New Auto-Interp
Negative Logits
OAD
-0.72
ADE
-0.67
adem
-0.61
bsite
-0.59
Seeking
-0.59
Bere
-0.58
ngth
-0.58
auri
-0.57
eco
-0.57
Âł
-0.56
POSITIVE LOGITS
Thumbnails
1.03
alter
0.74
cases
0.70
case
0.68
me
0.64
nested
0.63
boat
0.63
kat
0.61
downs
0.60
boats
0.59
Activations Density 0.028%