INDEX
Explanations
mentions of specific entities or groups within broader topics
instances of the word "including."
New Auto-Interp
Negative Logits
rait
-0.88
iny
-0.80
uters
-0.79
iri
-0.79
iet
-0.78
erb
-0.77
ules
-0.74
uay
-0.74
endant
-0.73
ifi
-0.73
POSITIVE LOGITS
those
0.75
ours
0.69
yours
0.68
NJ
0.64
flashbacks
0.63
hasht
0.62
spoilers
0.61
hypoc
0.61
ones
0.61
worth
0.60
Activations Density 0.069%