INDEX
Explanations
references to personal relationships and family connections
New Auto-Interp
Negative Logits
asper
-0.16
åĩĿ
-0.16
ucken
-0.16
Unnamed
-0.15
IDGET
-0.15
ÅĻes
-0.15
imizer
-0.14
.prompt
-0.14
ediator
-0.14
ıcı
-0.14
POSITIVE LOGITS
dangerous
0.17
danger
0.17
jeopard
0.16
secrets
0.16
dangerously
0.15
both
0.15
worse
0.15
deeper
0.15
deep
0.15
darkest
0.15
Activations Density 0.374%