We live in a world which contains an enormous amount of data that grows every day and is at our disposal. Whether it comes from social media, the internet, or file records. We wouldn’t be very bright, if we didn’t find ways to manipulate and use this data so that we could extract patterns and information from it. One of the ways to extract patterns and use data in an intelligent way is through the use of data mining.
Data mining has evolved throughout the years and has grown to become an immensely powerful tool. Researchers use it for the same purposes. There are two classes of data mining algorithms, from the arsenal of algorithms, you can majorly use classification algorithms or clustering procedures. When work is done through data mining, the accuracies of the results are also published at times. We are dedicated in this research paper to analyze the accuracies that researchers publish and hopefully get better results by conducting similar tests on samples that we find from a well-known repository of machine learning datasets and try and hypothesize why one algorithm performs better than another. In other papers, we have simply given our recommendations after reading the processes that the researchers have taken, in order for them to improve their work and hopefully implement some of our suggestions into what it is they are doing.
Frequently asked questions
What is the main topic of the analysis presented?
The analysis focuses on checking the accuracies and performance of various data-mining algorithms using the KNIME open-source platform, comparing results against published research papers on cancer prediction, human resources, sentiment analysis, and other data mining applications.
What datasets were used in the analysis?
The analysis utilized data from well-known machine learning repositories, including datasets related to breast cancer research, human resources data for employee churn prediction, movie reviews from the IMDB database for sentiment analysis, and others.
What data mining algorithms were tested?
Several classification and clustering algorithms were tested, including decision trees, random forests, gradient boosted trees, k-nearest neighbors, support vector machines (SVM), and Naive Bayes.
How was the data partitioned for training and testing?
The data was typically split into a 70/30 ratio, where 70% of the data was used for training the algorithms and 30% was used for testing their predictive performance.
What evaluation metrics were used to assess the performance of the algorithms?
The performance of the algorithms was evaluated using confusion matrices and accuracy tables, generated by a scorer attached to the end of each algorithm's workflow in KNIME.
What were the key findings in the cancer prediction analysis?
The analysis found that Support Vector Machines (SVM) generally provided the highest accuracy (97.6%), attributed to their suitability for numerical data. Random Forests also performed well with high accuracy. Probabilistic and pattern-based classifiers were less effective.
What were the key findings in the human resources churn prediction analysis?
Random Forest performed the best, with 98.6% accuracy, in predicting employee churn (whether employees retain or leave), as they are very effective at modelling patterns. Naive Bayes had the lowest score, at 83.2%.
What were the key findings in the sentiment analysis?
A C4.5 tree resulted in an accuracy of 62%. Ensemble classifiers such as Random Forests are recommended because of how they deal with incorrect classifications. Naive Bayes performed poorly.
What recommendation was given for Shazam?
The addition of text input using lyric data to increase the accuracy of the service.
What was recommended for privacy-protecting data mining?
Using data encryption, a well-established cyber security practice, may be better than changing the true value of the data. Data can be obscured using randomization and range-based input.
How does the analysis compare to published research papers?
In most cases, the analysis found that the tested algorithms performed better than those reported in the research papers, potentially due to variations in the data sets used or the specific implementation of the algorithms.
What role does big data play in data mining?
Big data analysis requires databases to harbor all of the information, such as NoSQL databases such as Hadoop. This can be utilized in machine learning, predictive analytics, and data mining, and is used to predict future events.
What recommendations were given for visualization data mining?
Word length should be used to measure data when looking at hyperlinks. We can represent data with a million pixels on our computer screen and a single data field can be allocated.
What recommendations were given for web usage mining?
Social computing and sentiment analysis should be integrated. The resulting data should be used in recommender systems. These are systems that recommend users what to buy.
- Quote paper
- Junaid Khan (Author), Wajeeh Ahmed (Co-author), 2017, Data Mining Algorithms, its Functions and Structure, Munich, GRIN Verlag, https://www.hausarbeiten.de/document/373667