Dropping data from beginning of history database

Hi Brian,

I've been updating my backup IBKR minute-bar database every day with collect_history for the entire US stk/etf universe (~10k assets). This has been going smoothly. However, every day, the memory consumed by the db goes up by nearly 1GB, which is not long-term sustainable.

I only need the last two months of minute data. Is there something equivalent to drop_ticks for realtime dbs, where I can delete the data older than two months in the history db?

All available APIs are documented, so no.

Any tips then for how to write a script that eliminates older data or otherwise manage disk usage?

It's been seen necessary to have a backup for QuantRocket usstock-minute bundle, but maintaining an IBKR minute bundle on an EC2 instance that's updated daily becomes very expensive. Growing at 1GB a day is equivalent to a cost increase of $1/month, every day (i.e. after a month of updates, monthly storage cost goes up by $20, into perpetuity).

Without a way to purge earlier data, the only option is to recollect the entire database with a later start date, which can be a week-long process.

Learned a bit more sqlite3 to solve this problem, code below for anyone needing to free up disk space by deleting earlier data from history databases.

I used this on an ibkr minute-level db, sharded by sid, and reduced disk usage by 10GB by deleting the first 3 weeks of minute data. Will be using weekly to keep db size in check. If you use, make sure to backup your db beforehand and verify your db directory structure in /var/lib/quantrocket. (Also make sure not to collect the same earlier data again by using the start_date parameter next time you collect_history.)

@Brian Please chime in if there are any nuances/gotchas to how history stores data. This code simply loops through all the sqlite db files and removes the queried data from each.

import os
import pandas as pd
import sqlite3
from sqlite3 import Error

def drop_history_data_before_date(db_name, cutoff_date, subdir = 'sid'):

        db_path = f'/var/lib/quantrocket/quantrocket.v2.history.{db_name}.sqlite/{subdir}'
        cutoff_date = pd.to_datetime(cutoff_date).strftime('%Y-%m-%d')

        counter = 0
        succ_count = 0
        err_count = 0

        file_list = os.listdir(db_path)

        for file in file_list:

            counter += 1

            if file.endswith(".sqlite"):

                # Log
                dots = [i for i, char in enumerate(file) if char == '.']
                sid = file[(dots[-2] + 1):dots[-1]]
                print(f'Deleting data for sid {sid} ({counter} of {len(file_list)})')

                # Open db file
                conn = None
                full_fn = os.path.join(db_path, file)
                try:
                    conn = sqlite3.connect(full_fn)
                except Error as e:
                    print(f'ERROR opening file: {full_fn}')
                    err_count += 1
                    continue

                # Operations
                with conn:
                    try:
                        # Delete data
                        sql = 'DELETE FROM Price WHERE Date < ?'
                        cur = conn.cursor()
                        cur.execute(sql, (cutoff_date,))
                        conn.commit()

                        # Vaccum file to reclaim disk space
                        conn.execute('VACUUM')

                    except Error as e:
                        print(f'ERROR deleting data: {e}')
                        err_count += 1
                        continue

                    succ_count += 1

        print(f'Finished dropping data for {succ_count} db files with {err_count} error(s)')

        return
1 Like