Getting started

Installation

vtjson is available via pip:

$ pip install vtjson

Tutorial

Here is a simple schema:

book_schema = {
  "title": str,
  "authors": [str, ...],
  "editor?": str,
  "year": int,
}

The following conventions were used:

  • As in typescript, a (string) key ending in ? represents an optional key. The corresponding schema (the item the key points to) will only be used for validation when the key is present in the object that should be validated. A key can also be made optional by wrapping it as vtjson.optional_key().

  • If in a list/tuple the last entry is (ellipsis) it means that the next to last entry will be repeated zero or more times. In this way generic types can be created. For example the schema [str, …] represents a list of strings.

Let’s try to validate some book objects:

good_book = {
  "title": "Gone with the Wind",
  "authors": ["Margaret Mitchell"],
  "year": 1936,
}

bad_book = {
  "title": "Gone with the Wind",
  "authors": ["Margaret Mitchell"],
  "year": "1936",
}

validate(book_schema, good_book, name="good_book")
validate(book_schema, bad_book, name="bad_book")

As expected vtjson throws an exception for the second object:

Traceback (most recent call last):
    ...
    raise ValidationError(message)
vtjson.vtjson.ValidationError: bad_book['year'] (value:'1936') is not of type 'int'

We can turn the book_schema into a genuine Python type.

Book = make_type(book_schema)

print(f"Is good_book an instance of Book? {isinstance(good_book, Book)}!")
print(f"Is bad_book an instance of Book? {isinstance(bad_book, Book)}!")
Is good_book an instance of Book? True!
Is bad_book an instance of Book? False!

We may also rewrite the book_schema as a valid Python type annotation.

from typing import NotRequired, TypedDict

class book_schema(TypedDict):
  title: str
  authors: list[str]
  editor: NotRequired[str]
  year: int

Attempting to validate the bad book raises a similar exception as before:

validate(book_schema, bad_book, name="bad_book")
Traceback (most recent call last):
    ...
    raise ValidationError(message)
vtjson.vtjson.ValidationError: bad_book is not of type 'book_schema': bad_book['year'] (value:'1936') is not of type 'int'

vtjson.safe_cast() functions exactly like cast except that it also verifies at run time that the given object matches the given schema.

book2 = safe_cast(book_schema, good_book)
book3 = safe_cast(book_schema, bad_book)

The exception thrown is similar.

Traceback (most recent call last):
    ...
    raise ValidationError(message)
vtjson.vtjson.ValidationError: object is not of type 'book_schema': object['year'] (value:'1936') is not of type 'int'

Schemas can of course be more complicated and in particular they can be nested

person_schema = {
  "name": regex("[a-zA-Z. ]*"),
  "email?": email,
  "website?": url,
}

book_schema = {
  "title": str,
  "authors": [person_schema, ...],
  "editor?": person_schema,
  "year": intersect(int, ge(1900)),
}

regex, email and url are built-in schemas. See Built-in schemas. intersect is a wrapper. See Wrappers. ge is a modifier. See Modifiers. It should be obvious that the schema

intersect(int, ge(1900))

represents an integer greater or equal than 1900.

Let’s validate an object not fitting the schema.

bad_book = {
  "title": "Gone with the Wind",
  "authors": [{"name": "Margaret Mitchell", "email":"margaret@gmailcom"}],
  "year": "1936",
}

validate(book_schema, bad_book, name="bad_book")
Traceback (most recent call last):
    ...
    raise ValidationError(message)
vtjson.vtjson.ValidationError: bad_book['authors'][0]['email'] (value:'margaret@gmailcom') is not of type 'email': The part after the @-sign is not valid. It should have a period.

As before we can rewrite the new book_schema as a valid type annotation

from typing import Annotated, NotRequired, TypedDict

class person_schema(TypedDict):
  name: Annotated[str, regex("[a-zA-Z. ]*")]
  email: NotRequired[Annotated[str, email]]
  website: NotRequired[Annotated[str, url]]

class book_schema(TypedDict):
  title: str
  authors: list[person_schema]
  editor: NotRequired[person_schema]
  year: Annotated[int, ge(1900)]

Many constraints expressible in vtjson schemas cannot be expressed in the language of type annotations. That’s where typing.Annotated comes in. Consider the following example:

Annotated[str, email]

Type checkers such as mypy only see the str part of this schema, but vtjson sees everything. For more information see Type annotations integration. There is a small caveat here: email in fact already checks that the object is a string. So as further explained in Type annotations integration, it is more efficient to write:

Annotated[str, email, skip_first]

Here it makes little difference, but the gain in efficiency may be important for larger schemas.

Let’s check that validation also works with type annotations:

validate(book_schema, bad_book, name="bad_book")
Traceback (most recent call last):
    ...
    raise ValidationError(message)
vtjson.vtjson.ValidationError: bad_book is not of type 'book_schema': bad_book['authors'][0] is not of type 'person_schema': bad_book['authors'][0]['email'] (value:'margaret@gmailcom') is not of type 'email': The part after the @-sign is not valid. It should have a period.

Real world examples

Example 1

Below we give the schema of a recent version of the run object in the mongodb database underlying the Fishtest web application https://tests.stockfishchess.org/tests. For the latest version see https://raw.githubusercontent.com/official-stockfish/fishtest/master/server/fishtest/schemas.py. See Example 2 for a version of this example that is compatible with Python type annotations.

import copy
import math
from datetime import datetime, timezone

from bson.objectid import ObjectId

from vtjson import (
    at_most_one_of,
    div,
    fields,
    ge,
    glob,
    gt,
    ifthen,
    intersect,
    ip_address,
    keys,
    lax,
    one_of,
    quote,
    regex,
    set_name,
    union,
    url,
)

username = regex(r"[!-~][ -~]{0,30}[!-~]", name="username")
net_name = regex("nn-[a-f0-9]{12}.nnue", name="net_name")
tc = regex(r"([1-9]\d*/)?\d+(\.\d+)?(\+\d+(\.\d+)?)?", name="tc")
str_int = regex(r"[1-9]\d*", name="str_int")
sha = regex(r"[a-f0-9]{40}", name="sha")
country_code = regex(r"[A-Z][A-Z]", name="country_code")
run_id = set_name(ObjectId.is_valid, "run_id")
uuid = regex(r"[0-9a-zA-Z]{2,}(-[a-f0-9]{4}){3}-[a-f0-9]{12}", name="uuid")
epd_file = glob("*.epd", name="epd_file")
pgn_file = glob("*.pgn", name="pgn_file")
even = div(2, name="even")
datetime_utc = intersect(datetime, fields({"tzinfo": timezone.utc}))

uint = intersect(int, ge(0))
suint = intersect(int, gt(0))
ufloat = intersect(float, ge(0))
sufloat = intersect(float, gt(0))


def valid_results(R):
    l, d, w = R["losses"], R["draws"], R["wins"]
    R = R["pentanomial"]
    return (
        l + d + w == 2 * sum(R)
        and w - l == 2 * R[4] + R[3] - R[1] - 2 * R[0]
        and R[3] + 2 * R[2] + R[1] >= d >= R[3] + R[1]
    )


zero_results = {
    "wins": 0,
    "draws": 0,
    "losses": 0,
    "crashes": 0,
    "time_losses": 0,
    "pentanomial": 5 * [0],
}

if_bad_then_zero_stats_and_not_active = ifthen(
    keys("bad"), lax({"active": False, "stats": quote(zero_results)})
)


def final_results_must_match(run):
    rr = copy.deepcopy(zero_results)
    for t in run["tasks"]:
        r = t["stats"]
        for k in r:
            if k != "pentanomial":
                rr[k] += r[k]
            else:
                for i, p in enumerate(r["pentanomial"]):
                    rr[k][i] += p
    if rr != run["results"]:
        raise Exception(
            f"The final results {run['results']} do not match the computed results {rr}"
        )
    else:
        return True


def cores_must_match(run):
    cores = 0
    for t in run["tasks"]:
        if t["active"]:
            cores += t["worker_info"]["concurrency"]
    if cores != run["cores"]:
        raise Exception(
            f"Cores mismatch. Cores from tasks: {cores}. Cores from "
            f"run: {run['cores']}"
        )

    return True


def workers_must_match(run):
    workers = 0
    for t in run["tasks"]:
        if t["active"]:
            workers += 1
    if workers != run["workers"]:
        raise Exception(
            f"Workers mismatch. Workers from tasks: {workers}. Workers from "
            f"run: {run['workers']}"
        )

    return True


valid_aggregated_data = intersect(
    final_results_must_match,
    cores_must_match,
    workers_must_match,
)

worker_info_schema = {
    "uname": str,
    "architecture": [str, str],
    "concurrency": suint,
    "max_memory": uint,
    "min_threads": suint,
    "username": str,
    "version": uint,
    "python_version": [uint, uint, uint],
    "gcc_version": [uint, uint, uint],
    "compiler": union("clang++", "g++"),
    "unique_key": uuid,
    "modified": bool,
    "ARCH": str,
    "nps": ufloat,
    "near_github_api_limit": bool,
    "remote_addr": ip_address,
    "country_code": union(country_code, "?"),
}

results_schema = intersect(
    {
        "wins": uint,
        "losses": uint,
        "draws": uint,
        "crashes": uint,
        "time_losses": uint,
        "pentanomial": [uint, uint, uint, uint, uint],
    },
    valid_results,
)

runs_schema = intersect(
    {
        "_id?": ObjectId,
        "version": uint,
        "start_time": datetime_utc,
        "last_updated": datetime_utc,
        "tc_base": ufloat,
        "base_same_as_master": bool,
        "rescheduled_from?": run_id,
        "approved": bool,
        "approver": union(username, ""),
        "finished": bool,
        "deleted": bool,
        "failed": bool,
        "is_green": bool,
        "is_yellow": bool,
        "workers": uint,
        "cores": uint,
        "results": results_schema,
        "results_info?": {
            "style": str,
            "info": [str, ...],
        },
        "args": intersect(
            {
                "base_tag": str,
                "new_tag": str,
                "base_nets": [net_name, ...],
                "new_nets": [net_name, ...],
                "num_games": intersect(uint, even),
                "tc": tc,
                "new_tc": tc,
                "book": union(epd_file, pgn_file),
                "book_depth": str_int,
                "threads": suint,
                "resolved_base": sha,
                "resolved_new": sha,
                "master_sha": sha,
                "official_master_sha": sha,
                "msg_base": str,
                "msg_new": str,
                "base_options": str,
                "new_options": str,
                "info": str,
                "base_signature": str_int,
                "new_signature": str_int,
                "username": username,
                "tests_repo": url,
                "auto_purge": bool,
                "throughput": ufloat,
                "itp": ufloat,
                "priority": float,
                "adjudication": bool,
                "sprt?": intersect(
                    {
                        "alpha": 0.05,
                        "beta": 0.05,
                        "elo0": float,
                        "elo1": float,
                        "elo_model": "normalized",
                        "state": union("", "accepted", "rejected"),
                        "llr": float,
                        "batch_size": suint,
                        "lower_bound": -math.log(19),
                        "upper_bound": math.log(19),
                        "lost_samples?": uint,
                        "illegal_update?": uint,
                        "overshoot?": {
                            "last_update": uint,
                            "skipped_updates": uint,
                            "ref0": float,
                            "m0": float,
                            "sq0": ufloat,
                            "ref1": float,
                            "m1": float,
                            "sq1": ufloat,
                        },
                    },
                    one_of("overshoot", "lost_samples"),
                ),
                "spsa?": {
                    "A": ufloat,
                    "alpha": ufloat,
                    "gamma": ufloat,
                    "raw_params": str,
                    "iter": uint,
                    "num_iter": uint,
                    "params": [
                        {
                            "name": str,
                            "start": float,
                            "min": float,
                            "max": float,
                            "c_end": sufloat,
                            "r_end": ufloat,
                            "c": sufloat,
                            "a_end": ufloat,
                            "a": ufloat,
                            "theta": float,
                        },
                        ...,
                    ],
                    "param_history?": [
                        [
                            {"theta": float, "R": ufloat, "c": ufloat},
                            ...,
                        ],
                        ...,
                    ],
                },
            },
            at_most_one_of("sprt", "spsa"),
        ),
        "tasks": [
            intersect(
                {
                    "num_games": intersect(uint, even),
                    "active": bool,
                    "last_updated": datetime_utc,
                    "start": uint,
                    "residual?": float,
                    "residual_color?": str,
                    "bad?": True,
                    "stats": results_schema,
                    "worker_info": worker_info_schema,
                },
                if_bad_then_zero_stats_and_not_active,
            ),
            ...,
        ],
        "bad_tasks?": [
            {
                "num_games": intersect(uint, even),
                "active": False,
                "last_updated": datetime_utc,
                "start": uint,
                "residual": float,
                "residual_color": str,
                "bad": True,
                "task_id": uint,
                "stats": results_schema,
                "worker_info": worker_info_schema,
            },
            ...,
        ],
    },
    lax(ifthen({"approved": True}, {"approver": username}, {"approver": ""})),
    lax(ifthen({"is_green": True}, {"is_yellow": False})),
    lax(ifthen({"is_yellow": True}, {"is_green": False})),
    lax(ifthen({"failed": True}, {"finished": True})),
    lax(ifthen({"deleted": True}, {"finished": True})),
    lax(ifthen({"finished": True}, {"workers": 0, "cores": 0})),
    lax(ifthen({"finished": True}, {"tasks": [{"active": False}, ...]})),
    valid_aggregated_data,
)

Example 2

This is a rewrite of Example 1 that is compatible with Python type annotations.

import copy
import math
from datetime import datetime, timezone
from typing import Annotated, Literal, NotRequired, TypedDict

from bson.objectid import ObjectId

from vtjson import (
    at_most_one_of,
    div,
    fields,
    ge,
    glob,
    gt,
    ifthen,
    intersect,
    ip_address,
    keys,
    lax,
    one_of,
    quote,
    regex,
    skip_first,
    url,
)

username = Annotated[str, regex(r"[!-~][ -~]{0,30}[!-~]", name="username"), skip_first]
net_name = Annotated[str, regex("nn-[a-f0-9]{12}.nnue", name="net_name"), skip_first]
tc = Annotated[
    str, regex(r"([1-9]\d*/)?\d+(\.\d+)?(\+\d+(\.\d+)?)?", name="tc"), skip_first
]
str_int = Annotated[str, regex(r"[1-9]\d*", name="str_int"), skip_first]
sha = Annotated[str, regex(r"[a-f0-9]{40}", name="sha"), skip_first]
country_code = Annotated[str, regex(r"[A-Z][A-Z]", name="country_code"), skip_first]
run_id = Annotated[str, ObjectId.is_valid]
uuid = Annotated[
    str,
    regex(r"[0-9a-zA-Z]{2,}(-[a-f0-9]{4}){3}-[a-f0-9]{12}", name="uuid"),
    skip_first,
]
epd_file = Annotated[str, glob("*.epd", name="epd_file"), skip_first]
pgn_file = Annotated[str, glob("*.pgn", name="pgn_file"), skip_first]
even = Annotated[int, div(2, name="even"), skip_first]
datetime_utc = Annotated[datetime, fields({"tzinfo": timezone.utc})]

uint = Annotated[int, ge(0)]
suint = Annotated[int, gt(0)]
ufloat = Annotated[float, ge(0)]
sufloat = Annotated[float, gt(0)]


class results_type(TypedDict):
    wins: uint
    losses: uint
    draws: uint
    crashes: uint
    time_losses: uint
    pentanomial: Annotated[list[int], [uint, uint, uint, uint, uint], skip_first]


def valid_results(R: results_type) -> bool:
    l, d, w = R["losses"], R["draws"], R["wins"]
    Rp = R["pentanomial"]
    return (
        l + d + w == 2 * sum(Rp)
        and w - l == 2 * Rp[4] + Rp[3] - Rp[1] - 2 * Rp[0]
        and Rp[3] + 2 * Rp[2] + Rp[1] >= d >= Rp[3] + Rp[1]
    )


results_schema = Annotated[
    results_type,
    valid_results,
]


class worker_info_schema(TypedDict):
    uname: str
    architecture: Annotated[list[str], [str, str], skip_first]
    concurrency: suint
    max_memory: uint
    min_threads: suint
    username: str
    version: uint
    python_version: Annotated[list[int], [uint, uint, uint], skip_first]
    gcc_version: Annotated[list[int], [uint, uint, uint], skip_first]
    compiler: Literal["clang++", "g++"]
    unique_key: uuid
    modified: bool
    ARCH: str
    nps: ufloat
    near_github_api_limit: bool
    remote_addr: Annotated[str, ip_address]
    country_code: country_code | Literal["?"]


class overshoot_type(TypedDict):
    last_update: uint
    skipped_updates: uint
    ref0: float
    m0: float
    sq0: ufloat
    ref1: float
    m1: float
    sq1: ufloat


class sprt_type(TypedDict):
    alpha: Annotated[float, 0.05, skip_first]
    beta: Annotated[float, 0.05, skip_first]
    elo0: float
    elo1: float
    elo_model: Literal["normalized"]
    state: Literal["", "accepted", "rejected"]
    llr: float
    batch_size: suint
    lower_bound: Annotated[float, -math.log(19), skip_first]
    upper_bound: Annotated[float, math.log(19), skip_first]
    lost_samples: NotRequired[uint]
    illegal_update: NotRequired[uint]
    overshoot: NotRequired[overshoot_type]


sprt_schema = Annotated[
    sprt_type,
    one_of("overshoot", "lost_samples"),
]


class param_schema(TypedDict):
    name: str
    start: float
    min: float
    max: float
    c_end: sufloat
    r_end: ufloat
    c: sufloat
    a_end: ufloat
    a: ufloat
    theta: float


class param_history_schema(TypedDict):
    theta: float
    R: ufloat
    c: ufloat


class spsa_schema(TypedDict):
    A: ufloat
    alpha: ufloat
    gamma: ufloat
    raw_params: str
    iter: uint
    num_iter: uint
    params: list[param_schema]
    param_history: NotRequired[list[list[param_history_schema]]]


class args_type(TypedDict):
    base_tag: str
    new_tag: str
    base_nets: list[net_name]
    new_nets: list[net_name]
    num_games: Annotated[uint, even]
    tc: tc
    new_tc: tc
    book: epd_file | pgn_file
    book_depth: str_int
    threads: suint
    resolved_base: sha
    resolved_new: sha
    master_sha: sha
    official_master_sha: sha
    msg_base: str
    msg_new: str
    base_options: str
    new_options: str
    info: str
    base_signature: str_int
    new_signature: str_int
    username: username
    tests_repo: Annotated[str, url, skip_first]
    auto_purge: bool
    throughput: ufloat
    itp: ufloat
    priority: float
    adjudication: bool
    sprt: NotRequired[sprt_schema]
    spsa: NotRequired[spsa_schema]


args_schema = Annotated[
    args_type,
    at_most_one_of("sprt", "spsa"),
]


class task_type(TypedDict):
    num_games: Annotated[uint, even]
    active: bool
    last_updated: datetime_utc
    start: uint
    residual: float
    residual_color: NotRequired[str]
    bad: NotRequired[Literal[True]]
    stats: results_schema
    worker_info: worker_info_schema


zero_results: results_type = {
    "wins": 0,
    "draws": 0,
    "losses": 0,
    "crashes": 0,
    "time_losses": 0,
    "pentanomial": 5 * [0],
}

if_bad_then_zero_stats_and_not_active = ifthen(
    keys("bad"), lax({"active": False, "stats": quote(zero_results)})
)

task_schema = Annotated[
    task_type,
    if_bad_then_zero_stats_and_not_active,
]


class bad_task_schema(TypedDict):
    num_games: Annotated[uint, even]
    active: Literal[False]
    last_updated: datetime_utc
    start: uint
    residual: float
    residual_color: str
    bad: Literal[True]
    task_id: uint
    stats: results_schema
    worker_info: worker_info_schema


class results_info_schema(TypedDict):
    style: str
    info: list[str]


class runs_type(TypedDict):
    _id: NotRequired[ObjectId]
    version: uint
    start_time: datetime_utc
    last_updated: datetime_utc
    tc_base: ufloat
    base_same_as_master: bool
    rescheduled_from: NotRequired[run_id]
    approved: bool
    approver: username | Literal[""]
    finished: bool
    deleted: bool
    failed: bool
    is_green: bool
    is_yellow: bool
    workers: uint
    cores: uint
    results: results_schema
    results_info: NotRequired[results_info_schema]
    args: args_schema
    tasks: list[task_schema]
    bad_tasks: NotRequired[list[bad_task_schema]]


def final_results_must_match(run: runs_type) -> bool:
    rr = copy.deepcopy(zero_results)
    for t in run["tasks"]:
        r = t["stats"]
        # mypy does not support variable keys for
        # TypedDict
        rr["wins"] += r["wins"]
        rr["losses"] += r["losses"]
        rr["draws"] += r["draws"]
        rr["crashes"] += r["crashes"]
        rr["time_losses"] += r["time_losses"]
        for i, p in enumerate(r["pentanomial"]):
            rr["pentanomial"][i] += p
    if rr != run["results"]:
        raise Exception(
            f"The final results {run['results']} do not match the computed results {rr}"
        )
    else:
        return True


def cores_must_match(run: runs_type) -> bool:
    cores = 0
    for t in run["tasks"]:
        if t["active"]:
            cores += t["worker_info"]["concurrency"]
    if cores != run["cores"]:
        raise Exception(
            f"Cores mismatch. Cores from tasks: {cores}. Cores from "
            f"run: {run['cores']}"
        )

    return True


def workers_must_match(run: runs_type) -> bool:
    workers = 0
    for t in run["tasks"]:
        if t["active"]:
            workers += 1
    if workers != run["workers"]:
        raise Exception(
            f"Workers mismatch. Workers from tasks: {workers}. Workers from "
            f"run: {run['workers']}"
        )

    return True


valid_aggregated_data = intersect(
    final_results_must_match,
    cores_must_match,
    workers_must_match,
)

runs_schema = Annotated[
    runs_type,
    lax(ifthen({"approved": True}, {"approver": username}, {"approver": ""})),
    lax(ifthen({"is_green": True}, {"is_yellow": False})),
    lax(ifthen({"is_yellow": True}, {"is_green": False})),
    lax(ifthen({"failed": True}, {"finished": True})),
    lax(ifthen({"deleted": True}, {"finished": True})),
    lax(ifthen({"finished": True}, {"workers": 0, "cores": 0})),
    lax(ifthen({"finished": True}, {"tasks": [{"active": False}, ...]})),
    valid_aggregated_data,
]