INDEX
Explanations
words related to negative actions or characteristics
expressions of astonishment or awe
New Auto-Interp
Negative Logits
ãĥķãĤ©
-0.76
Gemini
-0.74
å§«
-0.70
Feld
-0.67
Luxem
-0.67
ãĥ¼ãĥĨãĤ£
-0.67
xual
-0.66
Schl
-0.65
士
-0.65
ãĤ¼
-0.64
POSITIVE LOGITS
akening
1.47
akens
1.31
kward
1.22
ashington
1.06
atche
1.05
adesh
1.02
aii
1.01
oln
0.98
yers
0.96
atson
0.95
Activations Density 0.010%