Lazy Price - Natural Language Processing

Lazy Prices

What is “Lazy Prices”?

Lazy Prices points to loss in ROI. It is a repeatable process used to generate 10-K and 10-Q financial reports. Lazy prices highlight the tendency that companies tend to default to language used in previous reporting. Firms are classified based on document similarity: Changers and Non-Changers

Objective

The objective of this project is to analyze 10-k filings from past 20 years using S&P 500 data by highlight text-changes that deviate from the norm using Natural Language Processing (NLP), generate similarity and sentiment scores and perform descriptive and predictive analysis. In addition, emulate the hypothesis that suggests large text changes in 10-Ks signal lower future return on investment.

The master data set consist of 68 attributes, below are the most significant attributes used for analysis

Understanding the data

There is a significant increase in the amount of word counts in 10-Ks from 2000-2010. However, the word counts have been consistent in the last 10 years

Two methods are used to measure the documentation similarity

1) Cosine Similarity

■ measures the cosine of the angle between two vectors projected in a multi-dimensional space

2) Jaccard Similarity

■ compares vectors from two sets to see which elements are shared and which are distinct

Similarity Score Quantiles

The distribution of “changers” (Q) and “non-changers” (Q5), based on the cosine similarity score

This figure shows the distribution between “changers” and “non-changers”. The cosine similarity distribution is very left-skewed. Using a bin size of 50, it has a mean of 0.93, and median of 0.94. Q1 are documents that have a cossim score between 0.419 to 0.91. As for Q5, the documents have cossim score between 0.95 and 0.99

The graph below gives a visual description of the document similarity. It shows that most documents have a high level of similarity between them, the same trend can be seen with both cosine and jaccard similarity. It is safe to conclude that most 10-K’s have a lot of similarity when compared to it’s previous year.

Sentiment Data Comparison Statistics

Figure (A, B,C) below, provides a comparison of sentiment statistics between Q1 (“changers”) and Q5 (“non-changers”). Firms’ reporting changes are concentrated in the management discussion (MD&A) and risk factor (RF) sections of the 10-K filings, as seen in Figure A. Similarly, changes in text in the risk factor (RF) sentiment, polarity are more significant among Q1 than Q5 category, as seen in Figure B & C.

Top 10 Q1(“changers”) Companies with Downtrend

As a result of our analysis figure 6 exhibits the top 10 Companies with a Down trend in Q1 (“Changers”) category, these companies exhibited negative 3, 6 and 12 Month returns after reporting 10-K

Top 10 Q5 (“non-changers”) Companies with Uptrend

Figure below exhibits top 10 Companies with an Uptrend in Q5 (“Non-Changers”) category, these companies exhibited positive 3, 6 and 12 Month returns after reporting 10-K. Mettler Toledo International.inc had 7 years of positive returns in the past 20 years

Predictive Modeling

Logistic regression, linear discriminant analysis and support vector machines are use and compare to predict the direction (positive/negative) of stocks returns over the 3-, 6- and 12-month periods.

The models are first train with default parameters. The results shows that the training and testing accuracies do not have variance, thus there are no overfitting/underfitting issue and there is no advantage of using one technique over another.

BY tuning hyperparameters, all three models achieved an accuracy between 62-70 %. Overall, prediction of the direction of returns over 12-months periods achieved better accuracy than3 and 6 month periods.

Confusion Matrix

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, allows visualization of the performance of an algorithm, typically a supervised learning one. It is a handy presentation of the accuracy of a model with two or more classes. The table presents predictions on the x-axis and actual outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

Conclusion

This study shows the polarity score on the sentiment section is a strong predictor of a company’s future return. The data was divided into five quantiles based on the cosine similarity. Quantile 1 (Q1) was classified as “Changers” and quantile 5 (Q5) was classified as the “Non-Changers”. From the analysis of Q1 (“Changers”) and Q5 (“Non-Changers”) , the study concludes that the management discussion and analysis (MD&A) and risk factor (RF) polarity score shows the most significant relationship towards the company’s future return. The “Changers’ with negative sentiment exhibit lower investment returns over a 12-month period following the release of the 10-K reports. Whereas companies that are in “non-changers” tend to have higher future returns

Lazy Prices

What is “Lazy Prices”?

Cold Call Insurance Prediction Using Supervised Methods