INDEX
Explanations
phrases related to a specific organization or brand
references to specific organizations or entities, particularly related to academia or research
New Auto-Interp
Negative Logits
panc
-0.84
Hillary
-0.72
soup
-0.72
Columbus
-0.68
Chop
-0.68
Bos
-0.67
Dunk
-0.65
Vers
-0.65
Tunis
-0.63
boiled
-0.63
POSITIVE LOGITS
RL
4.68
rl
1.63
RL
1.62
RS
1.44
JR
1.33
LR
1.31
RC
1.30
SL
1.25
LL
1.22
ARA
1.18
Activations Density 0.010%