INDEX
Explanations
references to punishment or legal consequences
New Auto-Interp
Negative Logits
-0.60
Guys
-0.49
scene
-0.49
-0.48
$
-0.48
American
-0.48
Guys
-0.48
Dean
-0.47
guys
-0.47
mode
-0.46
POSITIVE LOGITS
autorytatywna
0.82
protoimpl
0.81
annica
0.76
sonno
0.75
للاسماء
0.74
poichè
0.72
imakasih
0.71
الرياضيه
0.71
'\\;'
0.70
للمعارف
0.70
Activations Density 0.578%