INDEX
Explanations
physical environment, action, or harm
New Auto-Interp
Negative Logits
ình
0.74
8
0.72
”)
0.71
٦
0.71
)$.
0.70
minutes
0.70
thirty
0.70
versation
0.69
prisoners
0.69
loved
0.68
POSITIVE LOGITS
physical
1.20
fisik
1.05
physically
1.02
físico
0.98
ع
0.96
0.95
fís
0.88
física
0.80
физи
0.77
fysis
0.77
Activations Density 0.019%