INDEX
Explanations
references to harm or damage in various contexts
New Auto-Interp
Negative Logits
Monfieur
-1.06
purpoſe
-0.96
Anſ
-0.96
ſeveral
-0.96
itſelf
-0.95
Chriftian
-0.95
expandindo
-0.92
pleaſure
-0.91
ſtate
-0.90
Majefty
-0.89
POSITIVE LOGITS
stra
0.73
Railway
0.60
igh
0.55
s
0.54
z
0.53
bou
0.53
y
0.52
u
0.51
wa
0.51
yl
0.51
Activations Density 0.147%