Tokenizing a huge quantity of text in python

Question

I have a huge list of text files to tokenize. I have the following code which works for a small dataset. I am having trouble using the same procedure with a huge dataset, however. I am giving the example of a small dataset as below.

In [1]: text = [["It works"], ["This is not good"]]

In [2]: tokens = [(A.lower().replace('.', '').split(' ') for A in L) for L in text]

In [3]: tokens
Out [3]: 
[<generator object <genexpr> at 0x7f67c2a703c0>,
<generator object <genexpr> at 0x7f67c2a70320>]

In [4]: list_tokens = [tokens[i].next() for i in range(len(tokens))]
In [5]: list_tokens
Out [5]:
[['it', 'works'], ['this', 'is', 'not', 'good']]

While all works so well with a small dataset, I encounter problem processing a huge list of lists of strings (more than 1,000,000 lists of strings) with the same code. As I still can tokenize the strings with the huge dataset as in In [3], it fails in In [4] (i.e. killed in terminal). I suspect it is just because the body of the text is too big.

I am here, therefore, seek for suggestions on the improvement of the procedure to obtain lists of strings in a list as what I have in In [5].

My actual purpose, however, is to count the words in each list. For instance, in the example of the small dataset above, I will have things as below.

[[0,0,1,0,0,1], [1, 1, 0, 1, 1, 0]] (note: each integer denotes the count of each word)

If I don't have to convert generators to lists to get the desired results (i.e. word counts), that would also be good.

Please let me know if my question is unclear. I would love to clarify as best as I can. Thank you.

I would use a set() to build the original list of all words in all strings, and then iterate through that with a count to generate the table of values. — beroe, 37 mins ago

beroe · Accepted Answer · 2016-07-29 23:44:18Z

up vote 1 down vote accepted

You could create a set of unique words, then loop through and count each of those...

#! /usr/bin/env python

text = [["It works works"], ["It is not good this"]]

SplitList   = [x[0].split(" ") for x in text]
FlattenList = sum(SplitList,[])  # "trick" to flatten a list
UniqueList  = list(set(FlattenList))
CountMatrix = [[x.count(y) for y in UniqueList] for x in SplitList]

print UniqueList
print CountMatrix

Output is the total list of words, and their counts in each string:

['good', 'this', 'is', 'It', 'not', 'works']
[[0, 0, 0, 1, 0, 2], [1, 1, 1, 1, 1, 0]]

answered 19 mins ago

beroe

4,76511243

Thank you. This seems powerful, and thanks for the tip to flatten a list. What I am desired to obtain, however, is the word count in each list -- I believe it is [[0, 0, 1, 0,0,0], [1, 1, 0,1, 1,0]]. What you gave in the first line actually solves the worst problem, however, so really thanks a lot! – achimneyswallow 11 mins ago

I am sorry. I did not see that the text you used was different. So you are actually right. Thank you a lot!! – achimneyswallow 6 mins ago

1

Sure. I just changed the text to make sure it worked with repeated words. – beroe 4 mins ago

If you can help more. Would you mind to let me know how to remove punctuations and change to lower cases in your code? I had [(A.lower().replace('.', '').split(' ') for A in L) for L in text] because I would love to get only lower cases and remove punctuation. Thank you!! – achimneyswallow 2 mins ago

add a comment |

asked	today
viewed	15 times
active	today

current community

your communities

more stack exchange communities

Tokenizing a huge quantity of text in python

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python string nlp tokenize or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Tokenizing a huge quantity of text in python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python string nlp tokenize or ask your own question.

Related

Hot Network Questions