INDEX
Explanations
numbered lists or rankings
New Auto-Interp
Negative Logits
gypt
-0.91
ãĥ¼ãĥĨãĤ£
-0.83
ãĥ¼ãĥĨ
-0.79
alam
-0.73
inem
-0.67
ucl
-0.67
å§«
-0.66
Rabb
-0.66
Ͻ
-0.63
achus
-0.63
POSITIVE LOGITS
onsense
0.75
brainer
0.74
Shift
0.70
WATCHED
0.63
0001
0.62
notice
0.62
Fake
0.61
whatsoever
0.61
Chomsky
0.61
Notice
0.59
Activations Density 1.057%