INDEX
Explanations
descriptive phrases that compare actions or qualities, often emphasizing effectiveness and moral implications
New Auto-Interp
Negative Logits
ucha
-0.16
alach
-0.16
krv
-0.14
->___
-0.14
auge
-0.14
tet
-0.14
rox
-0.14
ynthia
-0.14
иÑĤÑĥ
-0.14
functioning
-0.13
POSITIVE LOGITS
justice
0.27
cket
0.24
things
0.23
damage
0.20
Justice
0.20
justice
0.20
thing
0.19
Damage
0.19
work
0.19
Justice
0.19
Activations Density 0.243%