INDEX
Explanations
phrases or terms within quotations
quoted phrases or expressions in the text
New Auto-Interp
Negative Logits
"[
-0.84
afar
-0.83
whilst
-0.79
preceded
-0.78
Ubisoft
-0.76
accomp
-0.75
[â̦]
-0.75
while
-0.75
cited
-0.75
viewed
-0.75
POSITIVE LOGITS
clean
1.46
moral
1.43
death
1.43
safe
1.40
reset
1.39
smart
1.39
zero
1.38
Make
1.37
safety
1.37
comfort
1.36
Activations Density 0.093%