INDEX
Explanations
words indicating strong opinions or obvious conclusions
New Auto-Interp
Negative Logits
istrat
-0.17
WA
-0.15
recent
-0.15
yesterday
-0.15
ðŁ
-0.14
canonical
-0.14
livest
-0.14
wherever
-0.14
recent
-0.14
Bren
-0.13
POSITIVE LOGITS
cave
0.17
prisoner
0.17
represent
0.16
代表
0.15
ceb
0.15
Prison
0.15
Cave
0.15
perce
0.15
represent
0.15
perceived
0.15
Activations Density 0.000%