INDEX
Explanations
concepts related to punishment
references to various types of punishment
New Auto-Interp
Negative Logits
sonian
-0.90
gow
-0.83
ergy
-0.74
Mich
-0.71
sie
-0.69
coe
-0.67
opic
-0.65
aug
-0.65
olated
-0.64
itect
-0.64
POSITIVE LOGITS
punishment
1.04
punishments
0.99
inflicted
0.97
punished
0.92
sanction
0.85
regimes
0.80
punish
0.80
harshly
0.79
imposed
0.79
penalty
0.79
Activations Density 0.022%