INDEX
Explanations
phrases indicating different perspectives or ways of thinking
New Auto-Interp
Negative Logits
nad
-0.19
lsru
-0.16
ģn
-0.15
ën
-0.15
βε
-0.14
jac
-0.14
ogan
-0.14
elda
-0.14
alors
-0.14
lette
-0.13
POSITIVE LOGITS
ward
0.17
oins
0.15
fv
0.15
cand
0.15
sigmoid
0.14
isphere
0.14
LTR
0.14
Russo
0.14
inton
0.14
wij
0.14
Activations Density 0.032%