INDEX
Explanations
expressions of love, purity, and moral goodness
New Auto-Interp
Negative Logits
ant
-0.16
lein
-0.16
owitz
-0.16
adar
-0.15
emer
-0.15
maf
-0.15
istem
-0.15
bew
-0.15
erg
-0.14
.ie
-0.14
POSITIVE LOGITS
ammen
0.16
APPED
0.15
ATEST
0.15
agini
0.15
@student
0.15
åĮ
0.15
á»ı
0.15
mey
0.14
Trev
0.14
ils
0.13
Activations Density 0.119%