INDEX
Explanations
phrases emphasizing associations or connections between different concepts or elements
New Auto-Interp
Negative Logits
-0.64
p
-0.63
is
-0.59
v
-0.59
(
-0.57
↵↵
-0.54
on
-0.52
P
-0.52
-
-0.51
–
-0.51
POSITIVE LOGITS
Efq
1.19
]='\
1.19
]--;
1.16
ſelves
1.14
Majefty
1.14
}}$}
1.13
leaſt
1.13
ſelf
1.11
་་
1.10
myſelf
1.06
Activations Density 0.505%