INDEX
Explanations
phrases related to justice or moral judgment
New Auto-Interp
Negative Logits
ĻĤ
-0.71
ason
-0.66
SPA
-0.65
lapt
-0.63
satell
-0.62
iaries
-0.61
advoc
-0.60
worldly
-0.59
acas
-0.58
gard
-0.58
POSITIVE LOGITS
/"
1.15
referring
0.95
["
0.84
implying
0.79
[
0.75
referencing
0.75
refers
0.72
([
0.71
meaning
0.71
("0.70
Activations Density 0.655%