INDEX
Explanations
phrases related to legal issues and safety concerns
punctuated phrases indicating lists or multiple ideas
New Auto-Interp
Negative Logits
ophon
-0.76
abouts
-0.74
ibility
-0.72
Orig
-0.72
imb
-0.67
utral
-0.66
DX
-0.66
é¾
-0.65
Availability
-0.64
MQ
-0.63
POSITIVE LOGITS
lest
1.17
forgetting
1.00
eh
0.98
thereby
0.94
huh
0.93
or
0.92
ruining
0.92
ignoring
0.89
knowing
0.88
ignores
0.88
Activations Density 0.419%