INDEX
Explanations
statements confirming or asserting facts
assertions of factual claims
New Auto-Interp
Negative Logits
Flavoring
-0.73
bye
-0.69
Crown
-0.68
Sov
-0.67
Greenwood
-0.64
South
-0.60
Klux
-0.59
Corinthians
-0.58
incinn
-0.58
CK
-0.57
POSITIVE LOGITS
ional
0.97
REP
0.78
uality
0.73
olkien
0.72
çī
0.71
netflix
0.71
uracy
0.71
opus
0.70
orial
0.69
###
0.69
Activations Density 0.016%