Wednesday, December 12, 2007
`We are drowning in information but starved for knowledge.'
What is data mining?
Data mining sits at the interface between statistics, computer science, artificial intelligence, pattern recognition, machine learning, database management and data visualisation (to name some of the fields).
Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately comprehensible patterns or models in data to make crucial decisions. Data mining is not a product that can be bought. Data mining is a discipline and process that must be mastered - a whole problem solving cycle.
The main part of data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. The idea is that it is possible to strike gold in unexpected places as the data mining software extracts patterns not previously discernible or so obvious that no-one has noticed them before. The analysis process starts with a set of data, uses a methodology to develop an optimal representation of the structure of the data during which time knowledge is acquired. Once knowledge has been acquired this can be extended to larger sets of data working on the assumption that the larger data set has a structure similar to the sample data. This is analogous to a mining operation where large amounts of low grade materials are sifted through in order to find something of value.
Is data mining `statistical déjà vu'?
Whereas statistical analysis traditionally concerns itself with analysing primary data that has been collected to check specific research hypotheses (`primary data analysis'), data mining can also concern itself with secondary data collected for other reasons (`secondary data analysis'). Furthermore, data can be experimental data (perhaps the result of an experiment which randomly allocates all the statistical units to different kinds of treatment), but in data mining the data is typically observational data.
Data warehousing provides the enterprise with a memory
Companies are collecting data on seemingly everything. For example, a customer-focused enterprise regards every record of an interaction with a client or prospect (e.g. each call to customer support, each point-of-sale transaction, each catalogue order, each visit to a company web site) as a learning opportunity. But, learning requires more than simply gathering data. In fact, many companies gather hundreds of gigabytes of data without learning anything. For example, data are gathered because they are needed for some operational purpose, such as inventory control or billing. Once data served that purpose, data languish on tape or get discarded. The data's hidden value has largely gone untapped. For learning to take place, data from many sources (e.g. billing records, scanner data, registration forms, applications, call records, coupon redemption, surveys, manufacturing data) must first be gathered together and organised in a consistent and useful way - in a way that facilitates the retrieval of information for analytic purposes. This is called data warehousing. Data warehousing allows the enterprise to remember what it has noticed about its customers. Data warehousing provides the enterprise with a memory.
Data mining provides the enterprise with intelligence
Memory is of little use without intelligence. That is where data mining comes in. Intelligence allows us to comb through our memories noticing patterns, devising rules, coming up with new ideas to try, and making predictions about the future. The data must be analysed, understood and turned into actionable information. Using several data mining tools and techniques that add intelligence to the data warehouse, you will be able to exploit the vast mountains of data, for example, generated by interactions with your customers and prospects in order to get to know them better. Typical customer-focused business questions are:
What customers are most likely to respond to a mailing?
Are there groups (or segments) of customers with similar characteristics or behavior?
Are there interesting relationships between customer characteristics?
Who is likely to remain a loyal customer and who is likely to jump ship?
Where should the next branch be located?
What is the next product or service this customer will want?
Answers to questions like these lie buried in your corporate data, but it takes powerful data mining tools to get at them, i.e. to dig user info for gold. Data mining provides the enterprise with intelligence. Companies can use data mining findings for more profitable, proactive decision making and competitive advantage.
With data mining, companies can, for example, analyze customers' past behaviors in order to make strategic decisions for the future. Keep in mind, however, that the data mining techniques and tools are equally applicable in fields ranging from law enforcement to radio astronomy, medicine, and industrial process control.
Please contact us today in order to discuss how data mining can be applied to your field or work. Get Statooed.
Data mining myths versus realities
A great deal of what is said about data mining is incomplete, exaggerated, or wrong. Data mining has taken the business world by storm, but as with many new technologies, there seems to be a direct relationship between its potential benefits and the quantity of (often) contradictory claims, or myths, about its capabilities and weaknesses. When you undertake a data mining project, avoid a cycle of unrealistic expectations followed by disappointment. Understand the facts instead and your data mining efforts will be (hopefully) successful. A list of the most common data mining myths versus realities you will find here.
Data mining can not be ignored - the data is there, the methods are numerous, and the advantages that knowledge discovery brings are tremendous. Companies whose data mining efforts are guided by `mythology' will find themselves at a serious competitive disadvantage to those organizations taking a measured, rational approach based on facts.
Tuesday, December 11, 2007
Saturday, December 8, 2007
Don’t get too excited it will take a while for everything to settle, but it looks like it is Google Page Rank update time again. This means your blog could shift around in search results leading to more or less traffic. For those of you getting higher page ranks you might be able to leverage it for some more money, for those of you dropping in page rank, you will probably not be able to make as much money through things like banner and text link advertising.
Good luck to everyone, and may your page ranks be high.
There is a great post up on Read/WriteWeb about bloggers, the types that are currently out there, and where the blogosphere may be heading. It is a bit long, but very enjoyable, especially if you are interested in blogging for a variety of reasons.
Here is a snippet from the article:
It was a good conference and we had several interesting conversations, but I walked away with a strange feeling. Somehow it seemed that blogging just isn’t that hot anymore. The feeling has been exacerbated by the latest slow down in news. My feeds just do not update that often these days. Can it be that the digestion phase applies to blogs just as it applies to startups? In this post we’ll investigate whether the blogosphere is going through a digestion phase.
Definitely worth a read though I hope we are not in a digestion phase. I still like the crazy unbridled growth we’ve seen over the past two or three years.
Tuesday, November 20, 2007
There are several styles of search query syntax that vary in strictness. Where as some text search engines require users to enter two or three words separated by white space, other search engines may enable users to specify entire documents, pictures, sounds, and various forms of natural language. Some search engines apply improvements to search queries to increase the likelihood of providing a quality set of items through a process known as query expansion.
index-based search engineThe list of items that meet the criteria specified by the query is typically sorted, or ranked, in some regard so as to place the most relevant items first. Ranking items by relevance (from highest to lowest) reduces the time required to find the desired information. Probabilistic search engines rank items based on measures of similarity and sometimes popularity or authority. Boolean search engines typically only return items which match exactly without regard to order.
To provide a set of matching items quickly, a search engine will typically collect metadata about the group of items under consideration beforehand through a process referred to as indexing. The index typically requires a smaller amount of computer storage, and provides a basis for the search engine to calculate item relevance. The search engine may store of copy of each item in a cache so that users can see the state of the item at the time it was indexed or for archive purposes or to make repetitive processes work more efficiently and quickly.
Notably, some search engines do not store an index. Crawler, or spider type search engines may collect and assess items at the time of the search query. Meta search engines simply reuse the index or results of one or more other search engines.
Sunday, November 4, 2007
Wednesday, October 31, 2007
While Statcounter only registers the last 500 hits, the Motigo counter saves all hits and therefore becomes interesting when one is searchibf for historical information. The longer a site is in the air, the more interesting the statistical data gets because one can make prognoses. Here is an overview of the countries visitors come from. Seems like they come also from abroad now. Even some dozens of hits from www.google.com themselves (in California, USA), so they are reading this too.....
"Our mission is to provide high performance real-time widgets for the blogging community that are free and easy to use. FEEDJIT is founded by two serial entrepreneurs whose previous businesses include a search engine and a blogging platform. If you'd like to contact us, please email"
<firstname.lastname@example.org> with any questions, feedback or bug reports.
Monday, October 29, 2007
The above textcomes from their own site.
Details on: http://feedjit.com/
I am testing this new application now on this statistical blog as well. It is a Widget that shows the visitors how others get on your site. Other statistical programms have this information as well but mostly it isn't accesible for the visitors. Visitors can follow the same links as previous visitors.
Analysing data from your website is important. Perception is often vastly different to reality. Website statistics can be misleading if not interpreted properly. A basic analysis can be done using statistics programs provided by your hosting company. Using these statistics, you should be able to evaluate;
Return on investment in SEO. Search engine optimisation companies who charge thousands of dollars for their services often base their claim on getting a handful of top search terms but it may be that only a handful of visitors actually find your website by typing in those search terms. Are you spending hundreds of dollars on hosting, management, search engine optimisation and copywriting a month for a website that no-one visits?
Are your premium google listings, google adwords, adverting costs from other search engines and websites providing a return on investment, or are you spending thousands of dollars for a handful of visitors?
Where are your visitors coming from?
What are the popular search terms and phrases used to find your website?
Are marketing campaigns like letter box drops, competitions, raffles, etc bringing more visitors to your website?
Website data like all data collection has limitations. Currently all website statistical programs and web services on the internet assume that a unique user is equal to a unique I.P. Users may choose to access your website from different locations and/or may not have a static IP. Users of dial up internet will not have a static IP and so the website data will count them as a different visitor. Understanding the limitations, is important when interpreting data.
Many people believe that a popular website gets lots of hits. The number of hits is the equal to the number of file requests on a webpage, so if you have lots of images on your website, you will have lots of hits. The number of hits tells you nothing about the popularity of your website or whether your website is working for your business.
All the log analyzers and statistics programs available use simple mathematics. The problem is averages can lie if the proper mathematical model is not used. A statistician knows that averages, medians are meaningless without discussing variance, standard deviations, testing hypothesis, choosing the correct distribution, removing outliners.
Finding meaning in statistics
Statistics is not flawless but it will paint a picture and a good analysis of website usage will benchmark website performance against your desired outcomes and provide options for better website design.
Some questions statistical analysis may provide answers to include -
How many visitors are coming to the website?
Where do visitors come from? If they come from search engines, what key words and phrases are they using to find your website?
How long do visitors stay on your website?
What are the click paths of various type of visitors to your website?
What linking sites do your visitors come from?
Do the statistics give an indication of demographics?
What is the main reason people come to the website to do, read or buy?
However if you want to measure improvement, you need to do a proper analysis and calculate standard deviation. Otherwise, you could prematurely come to a conclusion that there had been an improvement when perhaps it was only variance or seasonal factor that influenced the result.
Using raw logs
Proper statistical analysis can only be done using raw data and by importing data into mathematical programs or log analysing software and eliminating outliners.eg. eliminating the ip addresses of staff members from website usage data. Obviously this is more expensive and the scope and purpose of website statistical analysis must be determined. It is also necessary to compare various tools, as the way that each program manipulates the data will vary and some log analyzers may be programmed incorrectly. Just because it spits out an answer or pie graph does not mean that the answer is correct.
Friday, October 26, 2007
Thursday, October 25, 2007
This overview is always interesting. Do visitors come back after a first visit. Seems that is the case here, judging these graphical presented results by www.statcounter.com
From your project log of the last x number of pageloads, we extract the total number of unique visitors present in it. Each unique visitor has a cookie, which is incremented each time they return to visit your website (a couple of hours is needed between visits depending on your settings). From this info we can show you how often visitors return to see your website again and again.
The best and most successful websites are the ones with a very high return frequency. If you have a low or non-existant return frequency you may want to change your website to encourage your visitors to come again and again.
You will quickly see which pages are the most heavily visited by your visitors and what ones are being left well enough alone. If some pages are being overlooked it could be a good idea to improve or make the navigation to those pages more obvious. Or entice your visitors to find out more about these pages.
Drill down the data - show all your visitors that visited this particular page during their visit!
Common problem with Popular Page Stats
If you only install the StatCounter code on one page of your website we can only track one page of your website. It is highly recommended to install the same code on all pages of your website you want to track.
Friday, October 19, 2007
Running total of visits to the above URL since 1 Oct 2007: 141Total since archive, i.e. 1 Oct 2007 - present: 141 (not necessarily all displayed - see below). Visits on previous 'day': 1.Additional Notes about totals and map updates (for full guide see Map Key):The map shows individual visits to the web site shown at the top of the page, clustered within a given distance. The location of each visit is based on the IP address of the computer used. Update frequency variation: In order for your map to be 'updated' (whether daily, weekly, or monthly) the number of visitors shown must also have grown by a certain percentage since the last update.
This percentage may have changed recently, and is explained more fully in our FAQ about update frequency.Total/subtotal discrepancies above are typically caused either by IP addresses not currently being in the database or by the gap between counter tallies (continous) and map updates (which may be daily, weekly, or monthly).
See FAQ for additional explanations.
Friday, October 12, 2007
Tuesday, October 9, 2007
For the free StatCounter tool it only measures the last 500 visitis. Not that much for a test. For access to the complete features one has to pay.
Monday, October 8, 2007
The obvious question is have you tested your website in the browsers your visitors are using? A website can look great in one particular browser and not work in any other. It is always recommended to code websites using the standards maintained by W3C.
Saturday, October 6, 2007
StatCounter gives you a list of keywords people looked for and then found your site. Here you can see how people came to this actual Statistical Data site. It shows the interest of you visitors.
This a great way to see up to the second how the latest visitors to your website are using the search engines to find your website, and then magnify them to see how they used your website.
Why does the same search term appear again and again with the same ip address?
This is how your visitors are using your website! They are clicking onto your website from a search engine result, then a little later they are doing it again and sometimes again and again. We track exactly what your visitors are doing don't be suprised by some of their very strange behaviour. Learn from it!
Choosing the Best Keywords
Effective search engine optimization depends on choosing the best keywords and I recommend using a combination of the Google Keyword Analysis and Overture Keyword Selector Tool.
Friday, October 5, 2007
Wednesday, October 3, 2007
If something like I mentioned above happens, I will write you a report on how I use the installed startistical programs to find out the details. With the right tools one can find out quite interesting details about a site.
Monday, October 1, 2007
Other statistical programms have this option too. But Clustrmap shows the image straight away on the site for all to see.
Sunday, September 30, 2007
Saturday, September 29, 2007
Returning Visitors - Based purely on a cookie, if this person is returning to your website for another visit an hour or more later
First Time Visitors - Based purely on a cookie, if this person has no cookie then this is considered their first time at your website.
Unique Visitor - Based purely on a cookie, this is the total of the returning visitors and first time visitors - all your visitors.
Page Load - The number of times your page has been visited.
What can the Summary Stats tell me?
There are two dimensions to the stats for a 'Standard StatCounter project'. There are the 'Summary Stats', and the 'Detailed Log Analysis'. The 'Summary Stats' provide a lifetime daily count of the totals of visitors to your website each day. And allows you to run reports since the day you started the project! After a few months of tracking, it is absolutely superb to look back and to see the daily, weekly and monthly trends of your visitors. Does your website have a weekday rush and a weekend slump? Did your traffic take a surge leading up to a holiday season? Is your website in general growing or stagnating? It is a wonderful tool to quickly assess the current success of your website.
How do the Summary Stats work?
The 'Summary Stats' determines whether a visitor has been to your website before by using a cookie. So if a user has cookies disabled we have no way of knowing if they are unique or not, and will by default be considered unique. However the majority of visitors have cookies enabled.
To make up for relying on cookies in the summary stats, the rest of the stats are based on your detailed log analysis of the last xxx number of pageloads. The uniqueness in this case is based on your visitors' IP addresses. This method works very well for the majority, but yet again there is an exception. AOL users, and visitors who use what is known as a 'dynamic web proxy' that changes each time they access a webpage. So if a single AOL user visits 7 webpages on your website it will likely come up as 7 different IP addresses!
Both cookies and IP addresses have their strengths and weaknesses for determining the uniqueness of a visitor. It is impossible to be 100% accurate the entire time, but with the Standard StatCounter Project you get the best of both worlds. Cookies for the 'Summary Stats' and IP addresses for the 'Detailed Log Analysis'!
The Advanced StatCounter Project combines the best of both worlds to use an almost fool-proof system (the only problem is AOL visitors who have cookies disabled! Not many!) but it is very server intensive, and won't be possible to provide as a free service until the cost of hardware and dedicated servers come down in price.
Certain procedures which input one or more numbers and output a number are called numerical operations. Unary operations input a single number and output a single number. For example, the successor operation adds one to an integer: the successor of 4 is 5. More common are binary operations which input two numbers and output a single number. Examples of binary operations include addition, subtraction, multiplication, division, and exponentiation. The study of numerical operations is called arithmetic.
The branch of mathematics that studies abstract number systems such as groups, rings and fields is called abstract algebra.
Friday, September 28, 2007
Thursday, September 27, 2007
The statistics of eXTReMe Tracker today of the countries that have visited this site. As you can see they even include that flags. A small gadget. eXTReME Tracker record historically, so I will use their data to give an idea of how the visitors-information will change over a longer periode. Off course in the beginning a lot of hits by myself because of the testing. But you see that is now only 50% of the hits. The blog is being found, also through search-engines. And that bings hits as well.
Wednesday, September 26, 2007
Motigo has registered these hot on this site. It seperates the returning hits and the individual visitors. As you can see the hits have dropped because there are only few changes to the site now.