There are a lot of factors that influence the amount of traffic a site receives, as well as how engaged its users are. In this post we’ll take a look at one of the many techniques we use to help understand a site’s traffic: clustering.
What is clustering?
At a high level, clustering is a machine learning technique that puts similar things into the same bucket. This can be done in a supervised or unsupervised fashion. Supervised clustering is like sorting coins based on denomination; you already know exactly what your clusters are. In practice, you’re often dealing with dirty or damaged coins, so it’s not immediately obvious what the denomination is, and hence why you need some machine learning. Unsupervised clustering is a form of clustering where items are lumped together automatically based on how similar they are. Typically you have to specify how many clusters you want your algorithm to spit out at the end, and there’s always a possibility that these clusters won’t be particularly obvious (For example, your algorithm might say, “Hey, I found a bunch of coins covered in green mud!”).
We use unsupervised clustering to help us figure out what the topic of a site is, and how that topic influences its traffic.
Understanding Internet Usage Patterns with Site Categories
For any site, we get a raw understanding of their traffic from our data panel, but we need to scale that up to the entire Internet-using population. To understand broader Internet usage patterns, it helps to know what kind of site we’re talking about. We figure out what sites are about by categorizing their topics. For instance, our data might tell us that sites about data science get 10 times as much traffic as sites about beanie babies (I wish). This means that if I see the same number of panelists visiting a data science themed site as a beanie baby themed site, I can confidently say there are many more data science fans in the wild.
While categorizing sites might sound easy (and it is, for a single site) things get difficult when we want to do this for every site on the web. There aren’t enough interns anywhere to do this before the sun burns out.
There are a couple of other things that make categorizing sites by topics tricky. The first is that there are a ton of possible topics. Every word in the dictionary could be a topic, but slicing sites so finely makes it harder to glean useful information (i.e., ‘sports’ is a more useful topic than ‘world series of underwater handstands, 1917’). One way we combat this is clustering, which is the fancy machine learning term for lumping like things together. As an example, the cluster of sports topics would include things like baseball, football, soccer, and underwater handstands (maybe).
That example hints at the second difficulty with categorizing site topics: a site can have multiple topics and they may not be related in a meaningful way. Contrast espn.com with sportsauthority.com. They’re both about sports, but one is a news aggregator (among other things) and the other is a store. We deal with this issue by letting a site belong to multiple clusters. This is like saying sportsauthority.com looks like a store from one side, and like a sports site from another side.


Identifying Clusters
Now let’s circle back to how we actually identify these topic clusters. We’re not necessarily interested in how you’d group sites based on only browsing their content. Instead, we’re interested in sites that have similar traffic patterns, which also gives us information about what sites are about.
Let’s take a random site that we know nothing about, foobar.com, for example. From my panel I might notice that people who visit foobar.com are much more likely to visit foo.com and bar.com than those who never go to foobar.com. This tells me two things: 1) foobar.com, foo.com and bar.com are probably about something similar, and 2) these sites probably receive comparable amounts and kinds of traffic. That second piece of information is really important. If I knew how much traffic foobar.com actually receives, I could leverage that information to give you an accurate estimate of how much traffic foo.com and bar.com receive. A similar statement can be made about links between sites (this is how Google got started years ago).
For our purposes, we generate a lot of clusters from our data sources and then let another layer of machine learning figure out which ones are actually useful. This means that these clusters are just one subset of the features (another fancy machine learning term for a variable or attribute or, typically, a column in a spreadsheet) we use to estimate various traffic metrics. How we use these features and how we let an algorithm pick which ones to use are a topic for another time.
In sum, topic clusters are extremely beneficial in helping us understand broad, Internet-wide usage patterns. They help us determine whether or not a site is the kind that people go to every day to keep up with the latest news or the kind of site they check out once a month. Despite being based solely on people’s browsing behavior, these clusters have distinct subjects like “sports” or “tech news”. We’ll dive into how these clusters and other features get incorporated into our models in future blog posts.
Until then, read more about what it’s like to be a data scientist in our post, Understanding Data Science and Why It’s So Important

[email protected]
คุณต้องการสินเชื่อเร่งด่วนในการแก้ปัญหาความต้องการทางการเงินของคุณ? เราเสนอ
เงินให้กู้ยืมตั้งแต่ 5,000.00 เพื่อ 250,000,000.00 แม็กซ์เรามีความน่าเชื่อถือ
ที่มีประสิทธิภาพได้อย่างรวดเร็วและแบบไดนามิกที่มีการตรวจสอบเครดิตและไม่มีการให้บริการ
100%
รับประกันกู้ยืมจากต่างประเทศในช่วงระยะเวลานี้ นอกจากนี้เรายังออก
เงินให้กู้ยืมในสกุลเงินยูโรบาทไทยบาทสเตอร์ลิงและอัตราแลกเปลี่ยนเงินดอลลาร์เป็น
2%
ของเงินให้สินเชื่อทุกคนถ้าคุณสนใจได้รับกลับมาให้เราผ่าน
[email protected] ด้วยข้อมูลต่อไปนี้ชื่อของคุณ: ประเทศ: เมือง:
ที่อยู่: จำนวนเงินที่จำเป็น: ระยะเวลา: อายุ: เพศ: อาชีพ: โทรศัพท์ No:
ขอขอบคุณนายเวลลิงตัน
TALK
ABOUT VALUE (1)
Real
value only produced in the process of the fundamental mass — energy
transformation process. And this process only can happen in extreme conditions
that are in the central black holes of the galaxies. This extreme condition is
produced by the intrinsic natural tendency of mass state matter —
concentration (gravity). This extreme conditions (extreme high temperature and
pressure, extreme strong electromagnetic field) is caused by extreme
concentration of large amount of mass. This process is self-causing,
self-inducing, self-adjusting, self-maintained. The highly ordered (highly
concentrated) energy produced by this process is being jet out far away from
the gravitational centre. It is this highly concentrated and far away from the
gravitational centred energy that possesses concentrated real value. The process
of this continuously converting from mass state matter into energy state matter
and then the energy state matter dissipating
and converting to mass state matter and destined to realising into effective
information of this highly ordered energy is the process of real value
producing and consuming. In this process of mass — energy transformation, mass
state matter transferred into highly ordered energy; entropy decreasing. It is
this entropy decreasing process and only this process that produced and producing
real value. This real value creating fundamental mass-energy transformation
process is automatically controlled and adjusted by feedback mechanisms of the
universe that makes the amount of mass and energy dynamically balanced in the
universe. Therefore, there will be no heat death or gravitational death. Though
this value creating fundamental mass-energy transformation process is
continuously proceeding in the universe, dual to space isolation effects, for
any local space (e.g. the Solar system) and in a certain time span (e.g. 5 to
10 billion years)the available real value is limited. Though value is created continuously in the
universe, the amount and available time of value on the Earth is set. Even the
Sun is just a consumer of value. It is a third stage distributor of value. It
decides the time and amount of value available to Earth. The Earth is the
fourth stage distributer of value. It provides life with eco system. The
central black hole of the Galaxy is the primary producer and distributor of
value. It converts mass into energy and redistribute them into space against
the force of gravity. The big stars are the secondary distributor of value.
They produce elements (star dust) and disperse them (with energy rich gas mass)
into space that later will form solar systems. Value can be measured and
represented by energy but they are not the same thing. Value is what makes the
world work. Value is decreased entropy. That can only happen in the central
black holes of galaxies. But the process of real value consumption is happening
all the time and everywhere in all normal conditions. The process of energy
renewal is the process of value creation; the process of energy dissipation is
the process of value consumption. The consumed value is the input for the
production of effective information. Effective information is the final output.
Effective information is the ultimate product and destination of the movement
of the universe.
The
process of value creation is the process of entropy decreasing. It is this
process that maintained all matter movements including the creation and
maintaining of life. The energy state of a local space expresses its degree of
usability. The two beams that being jetted out of central black hole is the
most concentrated energy. Its task is to travel fast and far away from
gravitational centre. The further away from gravitational centre the more value
it will possess. After it reached its maximum value, its second task is decay
into energy rich gas to form big stars and produce elements — star dust, and
cast them away with milder energy rich gas — the mixture of hydrogen and
helium to form long lasting, steady energy emitting yellow stars and element
rich planets. And the value at this stage can be used to create and maintain
life. This is the real value we human depend on to survive and evolve into high
intelligent being to outlive the Earth and the Sun to reach perpetual existence
and development. The real value available to us is set. Only decrease by
natural process and human use. Natural process use real value sparely and long lasting
(on the Earth it may be 1 to 2 billion years according to scientists) but human
use are not. For instance, we can destroy the Earth thermodynamic supporting
system in just 200,000 to 400,000 years by simply using geothermal energy. Simply
assume that we can take shortcut to evolve into super beings and possess the
ability to survive independently without the Earth before we exhausted the real
value on the Earth is ungrounded and against the risk management principles.
Actually it is against the thermodynamics. If we follow that opinion and put
into practice, we are doomed. The best way is to plan the use of real value to
maximise human survival and development time so that we can maximise our chance
to develop into a super being to reach perpetual existence. That is: minimising
human use of real value to make Earth life span as close to its natural life
span as possible — 2 billion years. And in the meantime maximise our
production of effective information. That means transferring our purpose of all
activities from produce and consume material products to minimise the use of
resources and maximise the production of effective information. Use more real
value does not necessarily increase the production of effective information. We
use hundreds of times of resources now than the 18 and 19 centuries but
weighing the quality and quantity of produced effective information, we did not
outperform the ancestors. Information evolution rate (including biological
information and thinking information) is fundamentally controlled by
thermodynamic processes and it has its own pace. The life span of a local
thermodynamic supporting system for instance, Earth thermodynamic supporting
system is also fundamentally controlled by thermodynamic processes. Its natural
life span should be able to allow life to evolve to the level to become the
effective information (the output information) of this system if this life did
the right thing. That is: they should be on the same dimension of time — 1 to
2 billion years. Only after this life evolved to this level they can live
independently from the Earth and use the real value in a bigger range in space
and achieve perpetual existence and development. If we exhausted the real value
on the Earth before we evolve to the level that enables us to survive
independently from the Earth, our fate is doomed.
Though
it is just a speculation and cannot proof it that Earth natural life span should
match human evolution span to reach the level to live independently from the Earth,
under such circumstances:
The
rational choice should be applying the risk management principles and choose
the save mode( that is choosing the reversible
process): planned use of resources
according to the Earth natural life span.
If
human reach the level to survive without the Earth earlier that is better. The
extra resources can be left over to part of the people to continue evolve on
the Earth (as a safety measure).
If
human reach the level to survive without the Earth just before the Earth eco
system finished, that is our luck and also is our rational choice saved us.
If
human endeavoured to our best but still cannot reach the level to survive
without the Earth before the Earth life span reach its end. That is the nature
not permit. That is bad luck not human err. We got nothing to regret.
There
is another choice, which is what we are doing now: on one hand, propagate the
unrealistically optimistic idea that we can possess the ability to make use of
nearly unlimited resources in decades or centuries. Teleport human in 500
years. Take short cut to travel in space with worm holes, bent space-time,
etc… On the other hand, under the direction of natural principles of free
competition and natural selection, driven by natural tendencies and locked in
gaming relations, everyone is competing others to get access to more resources
to make more money and get more power. There is a frenzy of increase
population, expand market, booming production and consumption. They call it
stimulating economy, promote development. And it is said that is the way to be
and the only way to solve all problems. It is said they are working hard to
create value! The resources stay there will be no value and only after they
used it, it turned into value and turned into money. For instance, the
geothermal energy underground has no value if you do not use it, it will be
wasted naturally (geothermal energy is the marker of Earth life span and
health. It has the highest natural value on Earth. Nature uses it sparely and
long lasting to maintain the Earth life. Short of it, the Earth will be
unhealthy and short lived by millions of years. Unhealthy Earth will make all life
on it sick or die out; short lived Earth means we will not have enough time to
evolve into higher beings to escape the doomed Earth when the time comes.). They
turn this super high natural value into rubbish use value — heating energy
for house and electricity. Geothermal energy is unreplenishable on Earth but
they catalogue it into renewable energy. Is this a creation of value or
destroying value? You judge, because that is your future. {I just watched a TV
documentary named “How to save this world”. In it they (the main stream)
suggest the whole world follow the example of Costa Rica which will totally
switch to geothermal energy in 6 years. That reminded me of a famous question
in a movie: “let’s see can insanity be cured”. Tell the truth, I am not too
optimistic about it}. The necessary result of this choice is human early
extinction. Maybe a couple of hundred thousand years. Since this choice is an
irreversible processes, even when people see the necessary prospect of this
practice, and want to turn to planned use of resources, the remaining resources
may not be enough to carry them to the evolution level to reach far future any
more. Have to mention, turn around now, still not too late.
Our
society is still a highly manmade structure under the direction of natural
principle: free competition and natural selection. And under the direction of
this principle, the naturally stable strategies are all natural tendency driven
natural processes. So it is intrinsically impossible to carry out rational
reforms under the direction of natural principle. It is this natural principle
directed social environment that locked people in a game relations and unable to
escape from its grasp. People can only adapt to this social environment and
unable to make rational choice.
So
to allow rational choice been made to reach future, the social system has to be
changed into a total manmade system in which the fundamental directing
principle is rational thinking. Nature cannot help in this system change
process this time. So human have to depend on their own efforts, use correct
theory directing their actions and force the change to happen (because this
change is a negative entropy driven processes).
How
to reach future
To
reach future, first thing to do is to make the system change from natural
system into manmade system. Only in manmade system people can make rational
choice and action (and in the jungle, only behave like an animal you can
survive). To achieve this, a core alliance force has to be organised under
correct principle and theory, through political, social, military manoeuvers,
unite thoughts, unite action, unite leaderships, unite goals; precede the
change from natural system into manmade system. United leadership but still
keep the borders for each country.
Thoughts
united under correct theory, leadership united under correct goals (and the
goals cannot be too many and have to order them by importance and time sequence).
The
final goal is human perpetual existence. All other goals serve this goal.
Present
primary goal: establish the organisation and mechanisms of global united
thoughts, leadership and action.
The stage goals for
the realisation of present primary goal:
Establish
directing theory (through research, debate and proof).
Propagate
and practice theory.
Establish
the organisation and mechanisms of global government.
When
leadership united:
First,
reduce population quantity (by one child policy), lift population quality (not
by kill old, sick and disabled. Human value is not weighed by animal standards
and human quality is not lifted by animal methods. Natural processes only do
destruction work on manmade structures. And you cannot eliminate genetic
disease by kill or sterilise the person. Everyone carries hundreds or thousands
of expressed or hidden genetic defects. That is evolution process and evolution
can solve the problem. Lifting quality has to use scientific and humane
methods).
In
the meantime reduce resources consumption, reduce material product production
and pollution. Try to reach zero pollution as soon as possible.
Concentrate
efforts on effective information production. Establish the principle that the purpose
of human activities is to develop effective information. All material product
production and resources consumption are the input for this purpose. The
efficiency of human activities is expressed by how much effective information
produced by using how much resource.
Present
social development goals should be based on present social development stage
and environment conditions; it should not be the same as previous social
development goals. Present social development goals are: reducing population,
reduce material product production, reduce pollution; but not increasing population,
booming market, increasing production of material products, stimulating
consumption, chasing luxury life style.
See
from the present trend of population growth and resources consumption, this
busting phase of this periodical cycle is coming close. And in view of the
booming time had been so long and the growth has been so fast, while the
resources have been used up so much, the bust should be unprecedentedly
violent. Maybe the last, that blows open the way to a new world order —
rational manmade system which will lead us to the perpetual existence and
development. Any long term investment (maybe longer than 20 years) may not get
returns. It will get blown away by violent bust. The time (and conditions) to
dream American dreams has irreversibly passed. The Americans realised their
American dreams at the time of capitalism ascending phase — the booming phase
of this periodical cycle, but now this world is at the edge of the descending
phase — the busting phase of this periodical cycle. Obviously it is not the
best timing for dreaming American dreams at the moment.
This
frenzy feast on further generations’ survival resources by the present
generation is at its peak heat but that also mean its bust will coming soon. Face
forward to build the future but not get your hands burned by losing money or
get your hands dirty by slaughtering the unborn further generations by
exhausting their survival resources and ruining their environment. We (and
them) still have 2 billion years to go on this Earth. Life is a chain. Cut from
anywhere, previous rings on this chain becoming meaningless (whatever dream you
dreamed or whatever life style you lived).
Only
and always, keep our goal on the minimum and fundamental — survival, so that
we can succeed in reaching the far future. Otherwise, we will die out as a race
in chasing mirage.
Thanks, it is very helpfull
Thanks for your post.
I learn so much.