INDEX
Explanations
articles that indicate a specific quality or importance
New Auto-Interp
Negative Logits
uros
-0.17
idir
-0.17
hazi
-0.16
ral
-0.14
isman
-0.14
moil
-0.14
RATION
-0.14
rez
-0.13
rather
-0.13
ney
-0.13
POSITIVE LOGITS
necessarily
0.25
anymore
0.23
particularly
0.21
coincidence
0.19
terribly
0.18
isolated
0.17
bad
0.17
particularly
0.16
endor
0.16
thing
0.16
Activations Density 0.057%