INDEX
Explanations
evading capture or detection
New Auto-Interp
Negative Logits
hostility
0.61
hostile
0.53
humiliating
0.49
ridicule
0.47
humiliated
0.47
shameful
0.45
অপমান
0.44
scorn
0.43
temptations
0.43
opposition
0.42
POSITIVE LOGITS
detection
1.02
detection
0.89
Detection
0.88
capture
0.87
Detection
0.85
DETECTION
0.75
capture
0.73
detección
0.69
Capture
0.68
Capture
0.66
Activations Density 0.006%