INDEX
Explanations
shaming, degrading, and harassing
New Auto-Interp
Negative Logits
Resistance
0.76
resistance
0.75
Resistance
0.72
resistivity
0.71
තා
0.71
сопротив
0.70
铤
0.69
ครอง
0.68
resist
0.68
สื
0.68
POSITIVE LOGITS
shame
2.00
humiliation
1.84
humiliating
1.77
ridicule
1.74
humili
1.74
Shame
1.62
judgment
1.61
mocking
1.59
humiliated
1.57
judgement
1.50
Activations Density 0.584%