INDEX
Explanations
appropriate or inappropriate behavior
New Auto-Interp
Negative Logits
sayesinde
0.68
Plus
0.68
Gracias
0.66
Needed
0.65
Available
0.64
便于
0.62
ईमान
0.62
nötig
0.61
Forbidden
0.60
Required
0.60
POSITIVE LOGITS
practices
1.15
erweise
0.97
behavior
0.97
طریقے
0.91
behaviors
0.91
Practices
0.87
behaviour
0.86
behaviours
0.86
comportamento
0.80
practices
0.79
Activations Density 0.457%