INDEX
Explanations
references to seriousness and safety concerns
New Auto-Interp
Negative Logits
enheim
-0.16
ÃŃch
-0.16
abby
-0.15
#
-0.14
iteli
-0.14
ìĭľíĸī
-0.13
emey
-0.13
indsight
-0.13
reib
-0.13
quine
-0.13
POSITIVE LOGITS
serious
1.09
Serious
0.93
serious
0.91
seriousness
0.88
seriously
0.76
-ser
0.69
seri
0.67
Seriously
0.60
ÑģеÑĢÑĮез
0.59
Ser
0.58
Activations Density 0.251%