INDEX
Explanations
warnings or negative implications related to actions that can lead to significant negative consequences
instances of the word "ruin" and its variations indicating negative consequences
New Auto-Interp
Negative Logits
arij
-0.81
leground
-0.80
duino
-0.80
bors
-0.76
reluct
-0.71
appa
-0.70
soType
-0.68
rict
-0.68
rouch
-0.66
>>>>>>>>
-0.65
POSITIVE LOGITS
havoc
1.15
ous
0.90
ously
0.86
OUS
0.81
spoil
0.79
spo
0.78
ruined
0.76
ifully
0.76
ruining
0.75
ruin
0.74
Activations Density 0.039%