A Blog
Dumbledore had a pensieve. I have this.
At least I thought I did, 10 years ago, when I created it. For now, this is a dead blog, getting deader by the day.
I like building things and working with data. This page is an index of some of the things I've worked on. Below is a list of things I'm currently working on.
Dumbledore had a pensieve. I have this.
At least I thought I did, 10 years ago, when I created it. For now, this is a dead blog, getting deader by the day.
Data ingester + explorer.
Exploring data needs the ease of excel and the power of SQL. Aragog let's you quickly load any dataset (CSV, REST API, and anything supported by Pandas) into a SQL database. It automatically infers the schema and generates Django models and list views to explore the data with.
A toy database that uses struct for storage on disk.
This was Attempt #1/3 at building a database. Learnt to use struct and mmap, but in hindsight, I shouldn't have spent too much time attempting to mimic the Django ORM.
Insights from data collected from 10,000+ profiles of men and women on a South-Indian Christian matrimonial website.
What characteristics are families looking for in potential brides and grooms? Does it change with occupation/education/work place? Is there a relationship between salary and how rich the person's family is?
A "mnemonic" is a tool that helps you remember things. This Mnemonic helps you not forget events involving, and statements made by politicians, actors, journalists and celebrities, by continuously monitoring news articles and tweets. Helps with situations like:
Named entity tagging and disambiguation using a Knowledge Graph
Works on top of spaCy's NER by augmenting it with the most relevant entity from a graph of facts. Relevance is calculated as a weighted sum of string similarity (Jaccard Index) and a graph score that represents how closely a given set of nodes/concepts are related (calculated as the sum of the lengths of the shortest paths between all candidate nodes). Out-of-vocabulary entities and question tokens like "who", "where", etc are calculated as the centroid of all the other resolved entities (filtering candidates based on the entity label identified by spaCy)
Redis-like in-memory datastructure store with a repl
I was bored.
Postgres-like database with support for indexes, sequences, query planning + optimization.
Attempt #2/3 at building a database. This died a quicker death than the Babu project above.
A Frankenstein monster of Apache Spark and Snowflake.
This is attempt #3/3 at building a database. This time, we keep things as simple as possible by constraining the problem (assume "write once, ready many", bulk-ingest only) and offloading problems to existing tools like Parquet and AWS S3 where possible, while also retaining the interesting bits like task graph optimization, indexing, etc for ourselves.
Design by Nicolas Meuzard and Sarah Dayan.