Latent semantic indexing

Latent Semantic Indexing

Understanding latent semantic indexing is quite

complex and usually requires a degree in math in order

to figure out and understand.

There are a few methods that can be used in order to

index and retrieve all the relevant pages of the users

query.

The obvious method of retrieving the relevant pages is

by matching words from a search query to the same text

found within the web pages that are available.

The problem with simple word matching is that they are

extremely inaccurate. This is because there are so

many ways for a user to express the desired concept,

which they are looking for.

This is known as synonymy. This also happens because

many words have multiple meanings. This is known as

polysemy.

With synonymy, the user’s query may now actually match

the text on the relevant pages. They will be

overlooked and the problem of polysymy means the terms

in a user’s query will often match terms in irrelevant

pages.

Latent semantic indexing, or LSI is an attempt to

overcome this problem. By looking at the patterns of

words distributed across the entire web.

Pages are considered that have many words in common

and are thought to be close in semantically close in

meaning.

Pages that contain a few words in common are

semantically distant. The result is a relatively

accurate and similar value that has calculated for

every content word or phrase.

In response to a query, the LSI database will return

pages it thinks to be correct and relevant to the

query’s search.

he LSI algorithm doesn’t understand anything about

word meanings and does not require an exact match to

return useful web pages.

How google uses latent semantic indexing

How Google Uses Latent Semantic Indexing

Most people who have used the Internet, or even know

what it is know what Google is. But most people don’t

know what exactly it is that Google does.

Or rather what makes Google do what it does. Google

searches are able to be so accurate because of Latent

Semantic Indexing.

Latent Semantic Indexing allows a search engine to

determine what a page is about by searching for one or

more keywords selected by the user.

It adds an important step to the document index

process. LSI records keywords that a document contains

as well as examines the document collection as a

whole.

By placing importance on related words, or words in

similar positions, LSA has a net effect of making the

value of pages lower so they only match specific

terms.

Search engines such as Google try to figure out phrase

relationships when they are processing keyword

queries, which in turn improve the rankings of pages

with related phrases.

This happens even when those pages are not focused on

the target theme. Some pages are too focused on one

phrase and they tend to rank worse than you would

expect them to.

In fact, some are even filtered out for being too over

optimized. Pages that are focused on a wider net of

related keywords tend to have more stable rankings.

Although the LSI algorithm doesn’t understand anything

about what the words mean, the patterns it notices

make the search engine look extremely intelligent.

Latent semantic indexing explained

Latent Semantic Indexing Explained

If you plan on having a web page which you want many

people to visit, or if you are interested in knowing

just how your keyword searches turn up the results

that they do, then you will want to know a little more

about latent semantic indexing and just how it works.

Latent semantic indexing is a technique that projects

queries and documents into space with latent semantic

dimensions. In the latent semantic space, a query and

a document are similar even if they don’t share any of

the same terms if their terms are semantically

similar.

LSI is similarly metric to word overlap measures. LSI

has fewer dimensions than the original space and is a

method for dimensionality reduction.

There are several different mappings for latent

semantic indexing from high dimensional to low

dimensional spaces. LSI chooses the optimal mapping in

a sense that minimizes the distance.

Choosing the number of dimensions is a unique problem.

A reduction can remove much of the noise while keeping

too few dimensions may lose important information.

LSI performance is improved considerably after ten to

twenty dimensions and peaks at seventy to one hundred

dimensions. Then it slowly begins to diminish again.

There is a pattern of performance that is observed

with other datasets as well.

Latent semantic indexing is a creation gives us a

better gauge of the content of a web page to discover

the overall theme.

It is a more sophisticated measure of what sites and

their pages are all about. Webmasters don’t

necessarily need to redo all of their web pages

keywords, but it does optimize efforts and it does

mean depth needs to be a greater consideration.

Latent semantic indexing information

Latent Semantic Indexing Information

The latent semantic indexing information retrieval

model builds the prior research of information

retrieval. LSI uses the singular value decomposition,

or SVD, to reduce the dimensions of the space and

attempts to solve the problems that seem to plague the

auto info retrieval system.

The LSI represents terms and documents in rich and

high dimensional space. This allows the underlying

semantic relationships that come between the terms and

documents.

The latent semantic indexing model views the terms in

a document as unreliable indicators of the information

within the document. The variability of word choice

obscures the semantic structure of the documents

involved.

When the term-document space is reduced, the

underlying semantic relationships are then revealed.

Much of the noise is eliminated when the space is

reduced.

Latent Semantic Indexing differs from other attempts

at using reduced space models for info retrieval. LSI

represents documents in a high dimensional space.

Both terms and documents are represented in the same

space and no attempt is made to change the meaning of

each dimension. Limits imposed by the demands of

vector space are focused on relatively small document

collections.

LSI is able to represent and manipulate larger data

sets and makes them viable for real-world

applications.

Compared to other information retrieving techniques,

the LSI performs quite well. Latent Semantic Indexing

provides thirty percent more related documents than

the standard word based retrieval system,

LSI is also fully automatic and very easy to use. It

requires no complex expressions or confusing syntax.

Terms and documents are represented in the space and

feedback can be integrated with the LSI model.

Latent semantic indexing defined

Latent Semantic Indexing Defined

Latent semantic indexing, by definition, is a

mathematical or statistical technique for extracting

and representing the similarity of meaning of words

and passages by analysis of large bodies of text.

The definition may be a little difficult to

understand, but basically latent semantic indexing

takes the keywords you put into your search engine and

go through each and every web page searching out the

best results for the key words you are seeking.

There are several different mappings for latent

semantic indexing from high dimensional to low

dimensional spaces.

LSI chooses the optimal mapping in a sense that

minimizes the distance. Choosing the number of

dimensions is a unique problem. A reduction can remove

much of the noise while keeping too few dimensions may

lose important information.

LSI performance is improved considerably after ten to

twenty dimensions and peaks at seventy to one hundred

dimensions. Then it slowly begins to diminish again.

There is a pattern of performance that is observed

with other datasets as well.

Latent semantic indexing considers pages that have

many words in common and close in meaning, sorts them

out, and presents them to the seeker.

The result is an LSI indexed database with similarity

and values that are calculated for every content word

and phrase. In response to a query, the LSI database

returns the pages it sees fit best to the keywords.

The algorithm doesn’t understand anything about what

the words mean and does not require an exact match to

return results that are useful to the seeker.

How latent semantic indexing is achieved

How Latent Semantic Indexing Is Achieved

In order to understand how Latent Semantic Indexing is

achieved, it is important to know some basic high

school math, particularly Cartesian coordinates.

Typically when a search query is sent a term-document

matrix is created. The pages that have been previously

processed send back results that contain the correct

semantic meanings.

All formatting from the pages including

capitalization, punctuation and extraneous makeup are

removed.

Also, the conjunctions, common verbs, pronouns and

prepositions are removed. Lastly, the common endings

are removed and what you have left are the stem words.

In order to plot the position of the web page, you

need to think of the page in terms of a three -

dimensional shape.

Using three words instead of three lines, you are able

to achieve this image. The position of every page that

contains these three words is known as a term space.

Each page forms a vector in the space and the vectors

direction and magnitude determine how many times the

three words appear in the structure.

With three words, it is easy to imagine what the

resulting form may look like, and the resulting query

would turn up a good number of correct searches.

Instead, if every word and every page were

represented, then the dimensions would be endless. But

it is not practical to assume seeing every web page in

existence. This is just not possible, nor is it

probable.

Typically a term-document matrix is created from pages

that have been pre-processed. This is so that only the

words, which have the semantic meaning, remain. All

formatting of the pages include capitalization,

punctuation.

An overview of latent semantic indexing

An Overview of Latent Semantic Indexing

Latent semantic indexing is a technique that projects

queries and documents into space with latent semantic

dimensions.

In the latent semantic space, a query and a document

are similar even if they don’t share any of the same

terms if their terms are semantically similar.

LSI is similarly metric to word overlap measures. LSI

has fewer dimensions than the original space and is a

method for dimensionality reduction.

This reduction takes a set of objects that exist in a

high-dimensional space and rearranges them and

represents them in a lower dimensional space instead.

They are often represented in two or three-dimensional

space just for the purpose of visualization. Latent

Semantic Indexing, or LSI is a mathematical

application technique sometimes known as singular

value decomposition.

The projection into the LSI space is chosen so that

the representations in the space of origin are changed

as little as possible. Then it is measured by the sum

of the squares of the difference.

There are several different mappings for latent

semantic indexing from high dimensional to low

dimensional spaces.

LSI chooses the optimal mapping in a sense that

minimizes the distance. Choosing the number of

dimensions is a unique problem.

A reduction can remove much of the noise while keeping

too few dimensions may lose important information. LSI

performance is improved considerably after ten to

twenty dimensions and peaks at seventy to one hundred

dimensions.

Then it slowly begins to diminish again. There is a

pattern of performance that is observed with other

datasets as well.

Indexing by latent semantic analysis

Indexing By Latent Semantic Analysis

Indexing by latent semantic analysis is natural

language processing technique of vectorial semantics

that analyzes the relationship between documents and

the terms contained within. They also produce a set of

concepts related to the documents.

The new concepts of space from the latent semantic

indexing analysis can be used to compare the documents

in the concept space. This is also known as data

clustering or document classification.

They can be used to find similar documents across

languages, which is called cross language retrieval,

and can be used to find relations between terms, known

as synonymy and polysmemy.

Given a query of terms, the LSI analysis can be

translated into the concept space and find matching

documents. This is commonly known as information

retrieval.

But a fundamental problem with the synonymy and

polysemy is in the natural language processing.

Synonymy is where different words describe the same

idea.

A query in a search engine may fail to retrieve a

document that does not contain the words appearing in

the query, even if the document is relevant. So even

if words have the same meanings, the search query may

not turn up all related articles.

Polysemy is where the same word has multiple meanings.

When a query is made the search may return irrelevant

documents containing the desired word in the wrong

meaning.

LSI adds an important step to the indexing process.

LSI records which keywords a document contains as well

as examines the whole collection to see which other

documents contain the same keywords.

Probabilistic latent semantic indexing

Probabilistic Latent Semantic Indexing

Probabilistic latent semantic indexing is an automated

document that is based on a statistical latent model

for factor analysis of data.

It is an approach to automatic indexing and

information retrieval, which overcomes problems by

mapping documents and terms to a LSI space.

Although LSI has been applied with much success in

different domains, it has a number of deficits. These

are due to its statistical foundation.

One typical scenario of human and machine interaction

in the information retrieval is by using natural

language queries.

A natural language query provides a number of key

words and expects the system to pull up all relevant

articles or pages that include the key words.

But the systems are not infallible. Most search

engines will come up with a big number of unrelated

searches. This is usually due to a key word having two

meanings or where an idea, or multiple uses of key

words comes up with many words.

These problems are called polysymy and synonymy. But

many of the newer, better-derived latent semantic

indexing programs have reduced much of this unneeded

search results.

Many retrieval methods are based on simple word

matches. It is well known that literal term matching

has severe drawbacks.

But newer LSA’s are more specific in their searching

and do a much better job than what the old search

queries would give for results.

The standard procedure for maximum likelihood

estimates a latent variable model as the expectation.

Dimensions of latent semantic indexing

Dimensions Of Latent Semantic Indexing

Latent semantic indexing is commonly used to match web

search queries to documents in retrieval applications.

LSI has improved the retrieval applications.

It has improved retrieval performance for some, but

not all, collections when compared to traditional

vector space retrieval or VSR.

Latent semantic indexing allows a search engine to

determine what a page is about by searching for one or

more keywords that are selected by the user.

LSI adds an important step to the document index

process. Latent semantic indexing records keywords

that a document contains as well as examines the

document collection as a whole.

By placing importance on related words, or words in

similar positions, LSA has a net effect of making the

value of pages lower so they only match specific

terms.

Latent semantic indexing has fewer dimensions than the

original space and is a method for dimensionality

reduction.

This reduction takes a set of objects that exist in a

high-dimensional space and rearranges them and

represents them in a lower dimensional space instead.

They are often represented in two or three-dimensional

space just for the purpose of visualization.

Latent Semantic Indexing is a mathematical application

technique sometimes known as singular value

decomposition. The number of dimensions needed is

typically large.

This has implications for indexing run time, query run

time and the amount of memory required. In order to

plot the position of the web page, you need to think

of the page in terms of a three-dimensional shape.

Using three words instead of three lines, you are able

to achieve this image. The position of every page that

contains these three words is known as a term space.

Each page forms a vector in the space and the vectors

direction and magnitude determine how many times the

three words appear in the structure.