Getting started
Installation
vtjson is available via pip:
$ pip install vtjson
Tutorial
Here is a simple schema:
book_schema = {
"title": str,
"authors": [str, ...],
"editor?": str,
"year": int,
}
The following conventions were used:
As in typescript, a (string) key ending in ? represents an optional key. The corresponding schema (the item the key points to) will only be used for validation when the key is present in the object that should be validated. A key can also be made optional by wrapping it as
vtjson.optional_key()
.If in a list/tuple the last entry is … (ellipsis) it means that the next to last entry will be repeated zero or more times. In this way generic types can be created. For example the schema [str, …] represents a list of strings.
Let’s try to validate some book objects:
good_book = {
"title": "Gone with the Wind",
"authors": ["Margaret Mitchell"],
"year": 1936,
}
bad_book = {
"title": "Gone with the Wind",
"authors": ["Margaret Mitchell"],
"year": "1936",
}
validate(book_schema, good_book, name="good_book")
validate(book_schema, bad_book, name="bad_book")
As expected vtjson throws an exception for the second object:
Traceback (most recent call last):
...
raise ValidationError(message)
vtjson.vtjson.ValidationError: bad_book['year'] (value:'1936') is not of type 'int'
We can turn the book_schema into a genuine Python type.
Book = make_type(book_schema)
print(f"Is good_book an instance of Book? {isinstance(good_book, Book)}!")
print(f"Is bad_book an instance of Book? {isinstance(bad_book, Book)}!")
Is good_book an instance of Book? True!
Is bad_book an instance of Book? False!
We may also rewrite the book_schema as a valid Python type annotation.
from typing import NotRequired, TypedDict
class book_schema(TypedDict):
title: str
authors: list[str]
editor: NotRequired[str]
year: int
Attempting to validate the bad book raises a similar exception as before:
validate(book_schema, bad_book, name="bad_book")
Traceback (most recent call last):
...
raise ValidationError(message)
vtjson.vtjson.ValidationError: bad_book is not of type 'book_schema': bad_book['year'] (value:'1936') is not of type 'int'
vtjson.safe_cast()
functions exactly like cast except that it also verifies at run time that the given object matches the given schema.
book2 = safe_cast(book_schema, good_book)
book3 = safe_cast(book_schema, bad_book)
The exception thrown is similar.
Traceback (most recent call last):
...
raise ValidationError(message)
vtjson.vtjson.ValidationError: object is not of type 'book_schema': object['year'] (value:'1936') is not of type 'int'
Schemas can of course be more complicated and in particular they can be nested
person_schema = {
"name": regex("[a-zA-Z. ]*"),
"email?": email,
"website?": url,
}
book_schema = {
"title": str,
"authors": [person_schema, ...],
"editor?": person_schema,
"year": intersect(int, ge(1900)),
}
regex
, email
and url
are built-in schemas. See Built-in schemas. intersect
is a wrapper. See Wrappers. ge
is a modifier. See Modifiers. It should be obvious that the schema
intersect(int, ge(1900))
represents an integer greater or equal than 1900.
Let’s validate an object not fitting the schema.
bad_book = {
"title": "Gone with the Wind",
"authors": [{"name": "Margaret Mitchell", "email":"margaret@gmailcom"}],
"year": "1936",
}
validate(book_schema, bad_book, name="bad_book")
Traceback (most recent call last):
...
raise ValidationError(message)
vtjson.vtjson.ValidationError: bad_book['authors'][0]['email'] (value:'margaret@gmailcom') is not of type 'email': The part after the @-sign is not valid. It should have a period.
As before we can rewrite the new book_schema as a valid type annotation
from typing import Annotated, NotRequired, TypedDict
class person_schema(TypedDict):
name: Annotated[str, regex("[a-zA-Z. ]*")]
email: NotRequired[Annotated[str, email]]
website: NotRequired[Annotated[str, url]]
class book_schema(TypedDict):
title: str
authors: list[person_schema]
editor: NotRequired[person_schema]
year: Annotated[int, ge(1900)]
Many constraints expressible in vtjson schemas cannot be expressed in the language of type annotations. That’s where typing.Annotated comes in. Consider the following example:
Annotated[str, email]
Type checkers such as mypy only see the str part of this schema, but vtjson sees everything. For more information see Type annotations integration. There is a small caveat here: email
in fact already checks that the object is a string. So as further explained in Type annotations integration, it is more efficient to write:
Annotated[str, email, skip_first]
Here it makes little difference, but the gain in efficiency may be important for larger schemas.
Let’s check that validation also works with type annotations:
validate(book_schema, bad_book, name="bad_book")
Traceback (most recent call last):
...
raise ValidationError(message)
vtjson.vtjson.ValidationError: bad_book is not of type 'book_schema': bad_book['authors'][0] is not of type 'person_schema': bad_book['authors'][0]['email'] (value:'margaret@gmailcom') is not of type 'email': The part after the @-sign is not valid. It should have a period.
Real world examples
Example 1
Below we give the schema of a recent version of the run object in the mongodb database underlying the Fishtest web application https://tests.stockfishchess.org/tests. For the latest version see https://raw.githubusercontent.com/official-stockfish/fishtest/master/server/fishtest/schemas.py. See Example 2 for a version of this example that is compatible with Python type annotations.
import copy
import math
from datetime import datetime, timezone
from bson.objectid import ObjectId
from vtjson import (
at_most_one_of,
div,
fields,
ge,
glob,
gt,
ifthen,
intersect,
ip_address,
keys,
lax,
one_of,
quote,
regex,
set_name,
union,
url,
)
username = regex(r"[!-~][ -~]{0,30}[!-~]", name="username")
net_name = regex("nn-[a-f0-9]{12}.nnue", name="net_name")
tc = regex(r"([1-9]\d*/)?\d+(\.\d+)?(\+\d+(\.\d+)?)?", name="tc")
str_int = regex(r"[1-9]\d*", name="str_int")
sha = regex(r"[a-f0-9]{40}", name="sha")
country_code = regex(r"[A-Z][A-Z]", name="country_code")
run_id = set_name(ObjectId.is_valid, "run_id")
uuid = regex(r"[0-9a-zA-Z]{2,}(-[a-f0-9]{4}){3}-[a-f0-9]{12}", name="uuid")
epd_file = glob("*.epd", name="epd_file")
pgn_file = glob("*.pgn", name="pgn_file")
even = div(2, name="even")
datetime_utc = intersect(datetime, fields({"tzinfo": timezone.utc}))
uint = intersect(int, ge(0))
suint = intersect(int, gt(0))
ufloat = intersect(float, ge(0))
sufloat = intersect(float, gt(0))
def valid_results(R):
l, d, w = R["losses"], R["draws"], R["wins"]
R = R["pentanomial"]
return (
l + d + w == 2 * sum(R)
and w - l == 2 * R[4] + R[3] - R[1] - 2 * R[0]
and R[3] + 2 * R[2] + R[1] >= d >= R[3] + R[1]
)
zero_results = {
"wins": 0,
"draws": 0,
"losses": 0,
"crashes": 0,
"time_losses": 0,
"pentanomial": 5 * [0],
}
if_bad_then_zero_stats_and_not_active = ifthen(
keys("bad"), lax({"active": False, "stats": quote(zero_results)})
)
def final_results_must_match(run):
rr = copy.deepcopy(zero_results)
for t in run["tasks"]:
r = t["stats"]
for k in r:
if k != "pentanomial":
rr[k] += r[k]
else:
for i, p in enumerate(r["pentanomial"]):
rr[k][i] += p
if rr != run["results"]:
raise Exception(
f"The final results {run['results']} do not match the computed results {rr}"
)
else:
return True
def cores_must_match(run):
cores = 0
for t in run["tasks"]:
if t["active"]:
cores += t["worker_info"]["concurrency"]
if cores != run["cores"]:
raise Exception(
f"Cores mismatch. Cores from tasks: {cores}. Cores from "
f"run: {run['cores']}"
)
return True
def workers_must_match(run):
workers = 0
for t in run["tasks"]:
if t["active"]:
workers += 1
if workers != run["workers"]:
raise Exception(
f"Workers mismatch. Workers from tasks: {workers}. Workers from "
f"run: {run['workers']}"
)
return True
valid_aggregated_data = intersect(
final_results_must_match,
cores_must_match,
workers_must_match,
)
worker_info_schema = {
"uname": str,
"architecture": [str, str],
"concurrency": suint,
"max_memory": uint,
"min_threads": suint,
"username": str,
"version": uint,
"python_version": [uint, uint, uint],
"gcc_version": [uint, uint, uint],
"compiler": union("clang++", "g++"),
"unique_key": uuid,
"modified": bool,
"ARCH": str,
"nps": ufloat,
"near_github_api_limit": bool,
"remote_addr": ip_address,
"country_code": union(country_code, "?"),
}
results_schema = intersect(
{
"wins": uint,
"losses": uint,
"draws": uint,
"crashes": uint,
"time_losses": uint,
"pentanomial": [uint, uint, uint, uint, uint],
},
valid_results,
)
runs_schema = intersect(
{
"_id?": ObjectId,
"version": uint,
"start_time": datetime_utc,
"last_updated": datetime_utc,
"tc_base": ufloat,
"base_same_as_master": bool,
"rescheduled_from?": run_id,
"approved": bool,
"approver": union(username, ""),
"finished": bool,
"deleted": bool,
"failed": bool,
"is_green": bool,
"is_yellow": bool,
"workers": uint,
"cores": uint,
"results": results_schema,
"results_info?": {
"style": str,
"info": [str, ...],
},
"args": intersect(
{
"base_tag": str,
"new_tag": str,
"base_nets": [net_name, ...],
"new_nets": [net_name, ...],
"num_games": intersect(uint, even),
"tc": tc,
"new_tc": tc,
"book": union(epd_file, pgn_file),
"book_depth": str_int,
"threads": suint,
"resolved_base": sha,
"resolved_new": sha,
"master_sha": sha,
"official_master_sha": sha,
"msg_base": str,
"msg_new": str,
"base_options": str,
"new_options": str,
"info": str,
"base_signature": str_int,
"new_signature": str_int,
"username": username,
"tests_repo": url,
"auto_purge": bool,
"throughput": ufloat,
"itp": ufloat,
"priority": float,
"adjudication": bool,
"sprt?": intersect(
{
"alpha": 0.05,
"beta": 0.05,
"elo0": float,
"elo1": float,
"elo_model": "normalized",
"state": union("", "accepted", "rejected"),
"llr": float,
"batch_size": suint,
"lower_bound": -math.log(19),
"upper_bound": math.log(19),
"lost_samples?": uint,
"illegal_update?": uint,
"overshoot?": {
"last_update": uint,
"skipped_updates": uint,
"ref0": float,
"m0": float,
"sq0": ufloat,
"ref1": float,
"m1": float,
"sq1": ufloat,
},
},
one_of("overshoot", "lost_samples"),
),
"spsa?": {
"A": ufloat,
"alpha": ufloat,
"gamma": ufloat,
"raw_params": str,
"iter": uint,
"num_iter": uint,
"params": [
{
"name": str,
"start": float,
"min": float,
"max": float,
"c_end": sufloat,
"r_end": ufloat,
"c": sufloat,
"a_end": ufloat,
"a": ufloat,
"theta": float,
},
...,
],
"param_history?": [
[
{"theta": float, "R": ufloat, "c": ufloat},
...,
],
...,
],
},
},
at_most_one_of("sprt", "spsa"),
),
"tasks": [
intersect(
{
"num_games": intersect(uint, even),
"active": bool,
"last_updated": datetime_utc,
"start": uint,
"residual?": float,
"residual_color?": str,
"bad?": True,
"stats": results_schema,
"worker_info": worker_info_schema,
},
if_bad_then_zero_stats_and_not_active,
),
...,
],
"bad_tasks?": [
{
"num_games": intersect(uint, even),
"active": False,
"last_updated": datetime_utc,
"start": uint,
"residual": float,
"residual_color": str,
"bad": True,
"task_id": uint,
"stats": results_schema,
"worker_info": worker_info_schema,
},
...,
],
},
lax(ifthen({"approved": True}, {"approver": username}, {"approver": ""})),
lax(ifthen({"is_green": True}, {"is_yellow": False})),
lax(ifthen({"is_yellow": True}, {"is_green": False})),
lax(ifthen({"failed": True}, {"finished": True})),
lax(ifthen({"deleted": True}, {"finished": True})),
lax(ifthen({"finished": True}, {"workers": 0, "cores": 0})),
lax(ifthen({"finished": True}, {"tasks": [{"active": False}, ...]})),
valid_aggregated_data,
)
Example 2
This is a rewrite of Example 1 that is compatible with Python type annotations.
import copy
import math
from datetime import datetime, timezone
from typing import Annotated, Literal, NotRequired, TypedDict
from bson.objectid import ObjectId
from vtjson import (
at_most_one_of,
div,
fields,
ge,
glob,
gt,
ifthen,
intersect,
ip_address,
keys,
lax,
one_of,
quote,
regex,
skip_first,
url,
)
username = Annotated[str, regex(r"[!-~][ -~]{0,30}[!-~]", name="username"), skip_first]
net_name = Annotated[str, regex("nn-[a-f0-9]{12}.nnue", name="net_name"), skip_first]
tc = Annotated[
str, regex(r"([1-9]\d*/)?\d+(\.\d+)?(\+\d+(\.\d+)?)?", name="tc"), skip_first
]
str_int = Annotated[str, regex(r"[1-9]\d*", name="str_int"), skip_first]
sha = Annotated[str, regex(r"[a-f0-9]{40}", name="sha"), skip_first]
country_code = Annotated[str, regex(r"[A-Z][A-Z]", name="country_code"), skip_first]
run_id = Annotated[str, ObjectId.is_valid]
uuid = Annotated[
str,
regex(r"[0-9a-zA-Z]{2,}(-[a-f0-9]{4}){3}-[a-f0-9]{12}", name="uuid"),
skip_first,
]
epd_file = Annotated[str, glob("*.epd", name="epd_file"), skip_first]
pgn_file = Annotated[str, glob("*.pgn", name="pgn_file"), skip_first]
even = Annotated[int, div(2, name="even"), skip_first]
datetime_utc = Annotated[datetime, fields({"tzinfo": timezone.utc})]
uint = Annotated[int, ge(0)]
suint = Annotated[int, gt(0)]
ufloat = Annotated[float, ge(0)]
sufloat = Annotated[float, gt(0)]
class results_type(TypedDict):
wins: uint
losses: uint
draws: uint
crashes: uint
time_losses: uint
pentanomial: Annotated[list[int], [uint, uint, uint, uint, uint], skip_first]
def valid_results(R: results_type) -> bool:
l, d, w = R["losses"], R["draws"], R["wins"]
Rp = R["pentanomial"]
return (
l + d + w == 2 * sum(Rp)
and w - l == 2 * Rp[4] + Rp[3] - Rp[1] - 2 * Rp[0]
and Rp[3] + 2 * Rp[2] + Rp[1] >= d >= Rp[3] + Rp[1]
)
results_schema = Annotated[
results_type,
valid_results,
]
class worker_info_schema(TypedDict):
uname: str
architecture: Annotated[list[str], [str, str], skip_first]
concurrency: suint
max_memory: uint
min_threads: suint
username: str
version: uint
python_version: Annotated[list[int], [uint, uint, uint], skip_first]
gcc_version: Annotated[list[int], [uint, uint, uint], skip_first]
compiler: Literal["clang++", "g++"]
unique_key: uuid
modified: bool
ARCH: str
nps: ufloat
near_github_api_limit: bool
remote_addr: Annotated[str, ip_address]
country_code: country_code | Literal["?"]
class overshoot_type(TypedDict):
last_update: uint
skipped_updates: uint
ref0: float
m0: float
sq0: ufloat
ref1: float
m1: float
sq1: ufloat
class sprt_type(TypedDict):
alpha: Annotated[float, 0.05, skip_first]
beta: Annotated[float, 0.05, skip_first]
elo0: float
elo1: float
elo_model: Literal["normalized"]
state: Literal["", "accepted", "rejected"]
llr: float
batch_size: suint
lower_bound: Annotated[float, -math.log(19), skip_first]
upper_bound: Annotated[float, math.log(19), skip_first]
lost_samples: NotRequired[uint]
illegal_update: NotRequired[uint]
overshoot: NotRequired[overshoot_type]
sprt_schema = Annotated[
sprt_type,
one_of("overshoot", "lost_samples"),
]
class param_schema(TypedDict):
name: str
start: float
min: float
max: float
c_end: sufloat
r_end: ufloat
c: sufloat
a_end: ufloat
a: ufloat
theta: float
class param_history_schema(TypedDict):
theta: float
R: ufloat
c: ufloat
class spsa_schema(TypedDict):
A: ufloat
alpha: ufloat
gamma: ufloat
raw_params: str
iter: uint
num_iter: uint
params: list[param_schema]
param_history: NotRequired[list[list[param_history_schema]]]
class args_type(TypedDict):
base_tag: str
new_tag: str
base_nets: list[net_name]
new_nets: list[net_name]
num_games: Annotated[uint, even]
tc: tc
new_tc: tc
book: epd_file | pgn_file
book_depth: str_int
threads: suint
resolved_base: sha
resolved_new: sha
master_sha: sha
official_master_sha: sha
msg_base: str
msg_new: str
base_options: str
new_options: str
info: str
base_signature: str_int
new_signature: str_int
username: username
tests_repo: Annotated[str, url, skip_first]
auto_purge: bool
throughput: ufloat
itp: ufloat
priority: float
adjudication: bool
sprt: NotRequired[sprt_schema]
spsa: NotRequired[spsa_schema]
args_schema = Annotated[
args_type,
at_most_one_of("sprt", "spsa"),
]
class task_type(TypedDict):
num_games: Annotated[uint, even]
active: bool
last_updated: datetime_utc
start: uint
residual: float
residual_color: NotRequired[str]
bad: NotRequired[Literal[True]]
stats: results_schema
worker_info: worker_info_schema
zero_results: results_type = {
"wins": 0,
"draws": 0,
"losses": 0,
"crashes": 0,
"time_losses": 0,
"pentanomial": 5 * [0],
}
if_bad_then_zero_stats_and_not_active = ifthen(
keys("bad"), lax({"active": False, "stats": quote(zero_results)})
)
task_schema = Annotated[
task_type,
if_bad_then_zero_stats_and_not_active,
]
class bad_task_schema(TypedDict):
num_games: Annotated[uint, even]
active: Literal[False]
last_updated: datetime_utc
start: uint
residual: float
residual_color: str
bad: Literal[True]
task_id: uint
stats: results_schema
worker_info: worker_info_schema
class results_info_schema(TypedDict):
style: str
info: list[str]
class runs_type(TypedDict):
_id: NotRequired[ObjectId]
version: uint
start_time: datetime_utc
last_updated: datetime_utc
tc_base: ufloat
base_same_as_master: bool
rescheduled_from: NotRequired[run_id]
approved: bool
approver: username | Literal[""]
finished: bool
deleted: bool
failed: bool
is_green: bool
is_yellow: bool
workers: uint
cores: uint
results: results_schema
results_info: NotRequired[results_info_schema]
args: args_schema
tasks: list[task_schema]
bad_tasks: NotRequired[list[bad_task_schema]]
def final_results_must_match(run: runs_type) -> bool:
rr = copy.deepcopy(zero_results)
for t in run["tasks"]:
r = t["stats"]
# mypy does not support variable keys for
# TypedDict
rr["wins"] += r["wins"]
rr["losses"] += r["losses"]
rr["draws"] += r["draws"]
rr["crashes"] += r["crashes"]
rr["time_losses"] += r["time_losses"]
for i, p in enumerate(r["pentanomial"]):
rr["pentanomial"][i] += p
if rr != run["results"]:
raise Exception(
f"The final results {run['results']} do not match the computed results {rr}"
)
else:
return True
def cores_must_match(run: runs_type) -> bool:
cores = 0
for t in run["tasks"]:
if t["active"]:
cores += t["worker_info"]["concurrency"]
if cores != run["cores"]:
raise Exception(
f"Cores mismatch. Cores from tasks: {cores}. Cores from "
f"run: {run['cores']}"
)
return True
def workers_must_match(run: runs_type) -> bool:
workers = 0
for t in run["tasks"]:
if t["active"]:
workers += 1
if workers != run["workers"]:
raise Exception(
f"Workers mismatch. Workers from tasks: {workers}. Workers from "
f"run: {run['workers']}"
)
return True
valid_aggregated_data = intersect(
final_results_must_match,
cores_must_match,
workers_must_match,
)
runs_schema = Annotated[
runs_type,
lax(ifthen({"approved": True}, {"approver": username}, {"approver": ""})),
lax(ifthen({"is_green": True}, {"is_yellow": False})),
lax(ifthen({"is_yellow": True}, {"is_green": False})),
lax(ifthen({"failed": True}, {"finished": True})),
lax(ifthen({"deleted": True}, {"finished": True})),
lax(ifthen({"finished": True}, {"workers": 0, "cores": 0})),
lax(ifthen({"finished": True}, {"tasks": [{"active": False}, ...]})),
valid_aggregated_data,
]