INDEX
Explanations
mentions of potential risks or negative consequences
New Auto-Interp
Negative Logits
admirable
-0.65
ãģ®é
-0.64
ãģ®éŃĶ
-0.61
Avg
-0.61
çͰ
-0.61
ellen
-0.59
courage
-0.58
inho
-0.58
hest
-0.58
aples
-0.57
POSITIVE LOGITS
someday
1.02
repercussions
0.96
urrence
0.91
retribution
0.88
angering
0.85
relapse
0.85
apocalypse
0.83
contag
0.81
repr
0.81
poisoning
0.80
Activations Density 0.350%