INDEX
Explanations
references to personal safety and threats
New Auto-Interp
Negative Logits
è¥
-0.07
adar
-0.07
CHAT
-0.06
ÑijÑĢ
-0.06
ìĹ¼
-0.06
asic
-0.06
ascent
-0.06
à¸Ńห
-0.06
okus
-0.06
itech
-0.06
POSITIVE LOGITS
safety
0.14
Safety
0.13
protection
0.13
Safety
0.12
Protection
0.12
threats
0.11
-threat
0.11
å®īåħ¨
0.11
security
0.11
Protection
0.11
Activations Density 0.054%