INDEX
Explanations
phrases related to moral warnings or consequences
New Auto-Interp
Negative Logits
ãģĵãĤĵãģª
-0.15
nÃły
-0.14
è¿Ļ个
-0.13
loh
-0.13
estas
-0.13
esta
-0.13
aji
-0.12
δÎŃ
-0.12
oji
-0.12
rary
-0.12
POSITIVE LOGITS
those
1.27
those
1.12
Those
1.03
Those
0.98
éĤ£äºĽ
0.90
ones
0.80
ceux
0.80
tÄĽch
0.62
celui
0.54
ones
0.52
Activations Density 0.578%