INDEX
Explanations
phrases expressing emotions or opinions
details about evaluations, responses, and outcomes related to decisions and events
New Auto-Interp
Negative Logits
iliated
-0.74
igate
-0.66
ornia
-0.63
interrupted
-0.63
uments
-0.62
emies
-0.59
ificantly
-0.59
itory
-0.58
orno
-0.56
aneously
-0.56
POSITIVE LOGITS
antics
0.88
sincerity
0.73
honesty
0.69
eness
0.68
inconsistency
0.68
temptation
0.68
arrang
0.67
motives
0.67
portrayal
0.67
pitfalls
0.67
Activations Density 0.720%