INDEX
Explanations
mentions of the term "junk" at different activation levels
occurrences of the term "unk," suggesting a focus on unspecified or unknown entities or terms
New Auto-Interp
Negative Logits
voy
-0.76
APH
-0.64
expressive
-0.63
flare
-0.60
hor
-0.60
unintended
-0.59
flared
-0.58
effective
-0.58
orsi
-0.57
latitude
-0.56
POSITIVE LOGITS
buster
1.12
geon
1.04
irk
0.97
rat
0.94
etsu
0.91
busters
0.91
ernel
0.90
regate
0.90
lift
0.89
ett
0.88
Activations Density 0.020%