I have a huge list of text files to tokenize. I have the following code which works for a small dataset. I am having trouble using the same procedure with a huge dataset, however. I am giving the example of a small dataset as below.
In [1]: text = [["It works"], ["This is not good"]]
In [2]: tokens = [(A.lower().replace('.', '').split(' ') for A in L) for L in text]
In [3]: tokens
Out [3]:
[<generator object <genexpr> at 0x7f67c2a703c0>,
<generator object <genexpr> at 0x7f67c2a70320>]
In [4]: list_tokens = [tokens[i].next() for i in range(len(tokens))]
In [5]: list_tokens
Out [5]:
[['it', 'works'], ['this', 'is', 'not', 'good']]
While all works so well with a small dataset, I encounter problem processing a huge list of lists of strings (more than 1,000,000 lists of strings) with the same code. As I still can tokenize the strings with the huge dataset as in In [3], it fails in In [4] (i.e. killed in terminal). I suspect it is just because the body of the text is too big.
I am here, therefore, seek for suggestions on the improvement of the procedure to obtain lists of strings in a list as what I have in In [5].
My actual purpose, however, is to count the words in each list. For instance, in the example of the small dataset above, I will have things as below.
[[0,0,1,0,0,1], [1, 1, 0, 1, 1, 0]] (note: each integer denotes the count of each word)
If I don't have to convert generators to lists to get the desired results (i.e. word counts), that would also be good.
Please let me know if my question is unclear. I would love to clarify as best as I can. Thank you.
set()to build the original list of all words in all strings, and then iterate through that with a count to generate the table of values. – beroe 37 mins ago