INDEX

Explanations

deception and lies

misrepresentation, falsehood, deception

New Auto-Interp

Configuration

Prompts (Dashboard)

392,802 prompts, 256 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

맡

0.38

രം

0.37

揮

0.36

quinazoline

0.36

 متجه

0.36

锢

0.36

 FETCH

0.36

umine

0.35

傍

0.35

갇

0.34

POSITIVE LOGITS

 거짓

2.80

谎

2.64

 झूठ

2.63

 deception

2.58

謊

2.55

 deceit

2.52

 lied

2.50

 falsehood

2.42

 deceptive

2.41

 deceiving

2.41

Activations Density 0.218%