INDEX
Explanations
statements regarding decision-making and evaluation processes
New Auto-Interp
Negative Logits
pong
-0.15
adero
-0.15
tere
-0.14
kat
-0.14
oda
-0.14
ycl
-0.14
owy
-0.13
inas
-0.13
arem
-0.13
tern
-0.13
POSITIVE LOGITS
bear
0.25
attempt
0.21
odds
0.20
Listed
0.19
you
0.19
Attempt
0.19
listed
0.19
it
0.19
Bear
0.18
there
0.18
Activations Density 0.073%