INDEX
Explanations
harmful and dangerous statements
New Auto-Interp
Negative Logits
but
0.88
nhưng
0.83
pero
0.77
אך
0.76
लेकिन
0.75
但不
0.75
ngunit
0.75
ولكن
0.74
αλλά
0.73
എന്നാൽ
0.73
POSITIVE LOGITS
!!!!!!!!!!!!!!!!
0.73
!!!!
0.70
!!!!
0.69
!!!!!
0.67
!!!!!!!
0.66
!!!!!!
0.66
!!!!!!!!
0.64
!!!
0.62
!!!
0.57
PERIOD
0.57
Activations Density 0.055%