Dismiss
Announcing Stack Overflow Documentation

We started with Q&A. Technical documentation is next, and we need your help.

Whether you're a beginner or an experienced developer, you can contribute.

Sign up and start helping → Learn more about Documentation →

I have a huge list of text files to tokenize. I have the following code which works for a small dataset. I am having trouble using the same procedure with a huge dataset, however. I am giving the example of a small dataset as below.

In [1]: text = [["It works"], ["This is not good"]]

In [2]: tokens = [(A.lower().replace('.', '').split(' ') for A in L) for L in text]

In [3]: tokens
Out [3]: 
[<generator object <genexpr> at 0x7f67c2a703c0>,
<generator object <genexpr> at 0x7f67c2a70320>]

In [4]: list_tokens = [tokens[i].next() for i in range(len(tokens))]
In [5]: list_tokens
Out [5]:
[['it', 'works'], ['this', 'is', 'not', 'good']]

While all works so well with a small dataset, I encounter problem processing a huge list of lists of strings (more than 1,000,000 lists of strings) with the same code. As I still can tokenize the strings with the huge dataset as in In [3], it fails in In [4] (i.e. killed in terminal). I suspect it is just because the body of the text is too big.

I am here, therefore, seek for suggestions on the improvement of the procedure to obtain lists of strings in a list as what I have in In [5].

My actual purpose, however, is to count the words in each list. For instance, in the example of the small dataset above, I will have things as below.

[[0,0,1,0,0,1], [1, 1, 0, 1, 1, 0]] (note: each integer denotes the count of each word)

If I don't have to convert generators to lists to get the desired results (i.e. word counts), that would also be good.

Please let me know if my question is unclear. I would love to clarify as best as I can. Thank you.

share|improve this question
    
I would use a set() to build the original list of all words in all strings, and then iterate through that with a count to generate the table of values. – beroe 37 mins ago
up vote 1 down vote accepted

You could create a set of unique words, then loop through and count each of those...

#! /usr/bin/env python

text = [["It works works"], ["It is not good this"]]

SplitList   = [x[0].split(" ") for x in text]
FlattenList = sum(SplitList,[])  # "trick" to flatten a list
UniqueList  = list(set(FlattenList))
CountMatrix = [[x.count(y) for y in UniqueList] for x in SplitList]

print UniqueList
print CountMatrix

Output is the total list of words, and their counts in each string:

['good', 'this', 'is', 'It', 'not', 'works']
[[0, 0, 0, 1, 0, 2], [1, 1, 1, 1, 1, 0]]
share|improve this answer
    
Thank you. This seems powerful, and thanks for the tip to flatten a list. What I am desired to obtain, however, is the word count in each list -- I believe it is [[0, 0, 1, 0,0,0], [1, 1, 0,1, 1,0]]. What you gave in the first line actually solves the worst problem, however, so really thanks a lot! – achimneyswallow 11 mins ago
    
I am sorry. I did not see that the text you used was different. So you are actually right. Thank you a lot!! – achimneyswallow 6 mins ago
1  
Sure. I just changed the text to make sure it worked with repeated words. – beroe 4 mins ago
    
If you can help more. Would you mind to let me know how to remove punctuations and change to lower cases in your code? I had [(A.lower().replace('.', '').split(' ') for A in L) for L in text] because I would love to get only lower cases and remove punctuation. Thank you!! – achimneyswallow 2 mins ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.