I am making a package that reads a binary file and returns data that can be used to initialize a DataFrame, I am now wondering if it is best to return a dict or two lists (one that holds the keys and one that holds the values).
The package I am making is not supposed to be entirely reliant on a DataFrame object, which is why my package currently outputs the data as a dict (for easy access). If there could be some memory and speed savings (which is paramount for my application as I am dealing with millions of data points), I would like to output the key and value lists instead. These iterables would then be used to initialize a DataFrame.
Here is a simple example:
In [1]: d = {(1,1,1): '111',
...: (2,2,2): '222',
...: (3,3,3): '333',
...: (4,4,4): '444'}
In [2]: keystup=[(1,1,1),(2,2,2),(3,3,3),(4,4,4)]
In [3]: valstup=['111','222','333','444']
In [4]: import pandas as pd
In [5]: dfdict=pd.DataFrame(d.values(), index=pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))
In [6]: dfdict
Out[6]:
0
a b c
3 3 3 333
2 2 2 222
1 1 1 111
4 4 4 444
In [7]: dfpair=pd.DataFrame(valstup, index=pd.MultiIndex.from_tuples(keystup, names=['a','b','c']))
In [8]: dfpair
Out[8]:
0
a b c
1 1 1 111
2 2 2 222
3 3 3 333
4 4 4 444
It is my understanding that d.values() and d.keys() is creating a new copy of the data. If we disregard the fact the a dict takes more memory then a list, does using d.values() and d.keys() lead to more memory usage then the list pair implementation?
dfdict = pd.DataFrame.from_dict(d, orient='index')– jonnat 1 hour ago