Xavier Mathew

A Blog

Dumbledore had a pensieve. I have this.

At least I thought I did, 10 years ago, when I created it. For now, this is a dead blog, getting deader by the day.

Aragog

Data ingester + explorer.

Exploring data needs the ease of excel and the power of SQL. Aragog let's you quickly load any dataset (CSV, REST API, and anything supported by Pandas) into a SQL database. It automatically infers the schema and generates Django models and list views to explore the data with.

Babu

A toy database that uses struct for storage on disk.

This was Attempt #1/3 at building a database. Learnt to use struct and mmap, but in hindsight, I shouldn't have spent too much time attempting to mimic the Django ORM.

Lisbeth

Insights from data collected from 10,000+ profiles of men and women on a South-Indian Christian matrimonial website.

What characteristics are families looking for in potential brides and grooms? Does it change with occupation/education/work place? Is there a relationship between salary and how rich the person's family is?

Mnemonic

A "mnemonic" is a tool that helps you remember things. This Mnemonic helps you not forget events involving, and statements made by politicians, actors, journalists and celebrities, by continuously monitoring news articles and tweets. Helps with situations like:

What topics is a politician talking about? Has it changed over time? Which politicians are raking up divise issues and which ones are working on problems that matter?
When a celeb tweets about an issue, are they being opportunistic or is this a topic they have always cared about?
Identify bias in newspapers by figuring out which ones are covering, and even more importantly, not covering what topics.
Are people with opinions being harassed online? Are women being threatened with rape? Are these coordinated attacks?

Pythia

Named entity tagging and disambiguation using a Knowledge Graph

Works on top of spaCy's NER by augmenting it with the most relevant entity from a graph of facts. Relevance is calculated as a weighted sum of string similarity (Jaccard Index) and a graph score that represents how closely a given set of nodes/concepts are related (calculated as the sum of the lengths of the shortest paths between all candidate nodes). Out-of-vocabulary entities and question tokens like "who", "where", etc are calculated as the centroid of all the other resolved entities (filtering candidates based on the entity label identified by spaCy)

Xedis

Redis-like in-memory datastructure store with a repl

I was bored.

Xostgres

Postgres-like database with support for indexes, sequences, query planning + optimization.

Attempt #2/3 at building a database. This died a quicker death than the Babu project above.

Xpark

A Frankenstein monster of Apache Spark and Snowflake.

This is attempt #3/3 at building a database. This time, we keep things as simple as possible by constraining the problem (assume "write once, ready many", bulk-ingest only) and offloading problems to existing tools like Parquet and AWS S3 where possible, while also retaining the interesting bits like task graph optimization, indexing, etc for ourselves.

Hello, I'm Xavier.