Dismiss
Announcing Stack Overflow Documentation

We started with Q&A. Technical documentation is next, and we need your help.

Whether you're a beginner or an experienced developer, you can contribute.

Sign up and start helping → Learn more about Documentation →

I am making a package that reads a binary file and returns data that can be used to initialize a DataFrame, I am now wondering if it is best to return a dict or two lists (one that holds the keys and one that holds the values).

The package I am making is not supposed to be entirely reliant on a DataFrame object, which is why my package currently outputs the data as a dict (for easy access). If there could be some memory and speed savings (which is paramount for my application as I am dealing with millions of data points), I would like to output the key and value lists instead. These iterables would then be used to initialize a DataFrame.

Here is a simple example:

In [1]: d = {(1,1,1): '111',
   ...: (2,2,2): '222',
   ...: (3,3,3): '333',
   ...: (4,4,4): '444'}

In [2]: keystup=[(1,1,1),(2,2,2),(3,3,3),(4,4,4)]

In [3]: valstup=['111','222','333','444']

In [4]: import pandas as pd

In [5]: dfdict=pd.DataFrame(d.values(),  index=pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))

In [6]: dfdict
Out[6]: 
         0
a b c     
3 3 3  333
2 2 2  222
1 1 1  111
4 4 4  444

In [7]: dfpair=pd.DataFrame(valstup,  index=pd.MultiIndex.from_tuples(keystup, names=['a','b','c']))

In [8]: dfpair
Out[8]: 
         0
a b c     
1 1 1  111
2 2 2  222
3 3 3  333
4 4 4  444

It is my understanding that d.values() and d.keys() is creating a new copy of the data. If we disregard the fact the a dict takes more memory then a list, does using d.values() and d.keys() lead to more memory usage then the list pair implementation?

share|improve this question
    
Why not use numpy arrays instead? They have a much lower memory footprint than both lists and dictionaries – keiv.fly 1 hour ago
    
I am not using numpy since I do not know the size of the data, so I have to populate a list or a dict, and then initialize a numpy array or pandas Dataframe. – snowleopard 1 hour ago
    
I will write a benchmark of memory usage of lists vs dicts – keiv.fly 1 hour ago
    
Doesn't this also depend on the datatypes -- str, int and floats.. – Merlin 1 hour ago
    
You can directly convert your dict to a DataFrame with dfdict = pd.DataFrame.from_dict(d, orient='index') – jonnat 1 hour ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.