INDEX
Explanations
principles, ethics, and morality-related phrases
phrases related to ethics and moral principles
New Auto-Interp
Negative Logits
acas
-0.63
dor
-0.62
zhou
-0.61
é¾į
-0.59
apons
-0.58
ij
-0.58
使
-0.57
cffff
-0.55
cli
-0.55
scar
-0.55
POSITIVE LOGITS
coincidence
0.82
nutshell
0.68
shenan
0.67
itch
0.63
brainer
0.63
pecul
0.62
kinda
0.58
creek
0.57
sill
0.57
!"
0.56
Activations Density 0.988%