ModelING Litigation in USPTO patent Claims using Doc2Vec

“The key to artificial intelligence has always been the representation.” —Jeff Hawkins

Zachary D White Feb - 2019

Capstone

How not to do a capstone

Guidelines

Business Application

Applied Natural Language Processing

1

2

3

Valuable to Large Companies

The Conversation

WHat Why How & Who

What is Patent Scope?

§ Legalese Warning

    it is used in the general lexicon to refer to the breadth of the patent, or how much intellectual space resides within the metes and bounds of the patent claims.

§ Laymen Terms please

    A general definition may well be that the scope of a patent is the boundaries (or limits) of the invention protected by the patent.

§ In practical terms this means what??

     This  definition  manifests  in  practice  as  the  universe  of  inventions  that  infringe on the patent.

    “Like  other  general  legal  terms—such  as  negligence  or  fraud—the  practical  use  of  a  definition  of  the  scope  of  a  patent  comes  only  in  its  application to specific cases of infringement."

"INTOLERABLY AMBIGUOUS!"

2 stars

-A friend

Why IS Patent Scope important?

How Do you measure scope?

Problem statement

Problem statement

New Product + Lawsuit = Bad

 

Search + Patent  

WHAT IS PATENT INFRINGEMENT?

Simulations

What is a Patent?

Patent Claim

Terms

ICC - Independent Claim Count is the number of independent claims made by a patent

ICL - Independent Claim Length is the number of words in said claim**.

PC - Primary Claim is the first independent Claim listed in a patent.

 

My method

Unsupervised Learning

Supervised Learning

New Ideas

UMM Data... HEllo

USPTO Data

df = df = pd.read_csv('claim_2000_2014_v001.csv')
df.head()


 	pat_no 	    claim_no 	    claim_txt
0 	8697278 	17 	17. Battery comprising an interior of the batt...
1 	7385756 	81 	81. A catadioptric projection objective for im...
2 	7387146 	1 	1. A heavy duty tire comprising a tread portio...
3 	7387253 	43 	43. A system comprising: (a) a optical reader ...
4 	7387278 	17 	17. A parachute ripcord pin for holding a para...

DF.INFO()

Number of patents

The claim number for a particular patent

Int64Index: 8216031 entries, 0 to 8216149
Data columns (total 3 columns):
pat_no       object
claim_no     int64
claim_txt    object
dtypes: int64(1), object(2)
memory usage: 250.7+ MB

The text contained in the claim

The Patent ID number of a particular claim

CLEAN DATA &Create FEATURES

     pat_no 	claim_no      claim_txt 	                                ICL    ICC   litigation
     8697278    17        battery comprising interior battery active ele...     106    2     0
     8697278    1         battery cell casing comprising first casing el...     97     2     0
     7385756    81        catadioptric projection objective imaging patt...     108    33    0
     7385756    94        catadioptric projection objective imaging patt...     116    33    0
     7385756    79        catadioptric projection objective imaging patt...     103    33    0
  • Text Cleaning
    • Removed legal stop words
    • Lower cased
    • Standard text cleaning
  • Computed
    • Independent Claim Length (ICL)
    • Independent Claim Count (ICC)
  • Merged Patent-wise Litigation Data from Stanford

Closer Look

df['litigation'].value_counts()

0    8185679
1      30352
Name: litigation, dtype: int64

0 = number of non-litigated claims
1 = number of litigated claims
df['litigation'].value_counts(normalize=True)

Non-Litigated Claims  99.63%
Litigated Claims       0.37%
Name: type, dtype: float64 

All Claims

df['litigation'].value_counts()

0    197975
1      7526
Name: litigation, dtype: int64

0 = number of non-litigated claims
1 = number of litigated claims
df['litigation'].value_counts(normalize=True)

Non-Litigated Claims  96.34%
Litigated Claims       3.66%
Name: type, dtype: float64 

Primary Claims only

Descriptive Statistics

All Claims 
df.loc[:,['ICL','litigation']].groupby(['litigation']).agg(['count','min','max','mean','std'])

 	ICL
	count 	min 	max 	mean 	        std
litigation 					
0 	8185679 1 	18433 	104.827044 	73.717400
1 	30352 	1 	1509 	94.007380 	62.130354

#Two Tailed Independent T-Test
 	statistic 	pvalue
ICL 	30.260286 	3.155879e-198
Primary Claims Only
df_gbd.loc[:,['ICC','ICL','litigation']].groupby(['litigation']).agg(['count','min','max','mean','std'])


 	ICC
	count 	 min 	max 	mean 	        std
litigation 					
0 	2907222  1 	276 	2.814694 	2.326064
1 	7526 	 1 	358 	4.030162 	7.705218


 	ICL
	count 	 min 	max 	mean 	        std
litigation 					
0 	2907222  1 	18433 	106.592626 	79.092973
1 	7526 	 2 	1509 	99.652139 	63.708417

# Two Tailed Independent T-Test results
 	statistic 	pvalue
ICL 	9.432147 	5.235061e-21
ICC 	-13.683259 	4.067142e-42

Modeling Techniques

Document Vectorization

Gensim Doc2Vec Language Model

XGBOOST Logistic Regression

Doc2vec

## Sample of a tagged document
next(corpus_for_doc2vec.__iter__())

>TaggedDocument(words=['battery', 'comprising', 'interior', 'battery', 'active', 'elements', 'battery',
 'cell', 'casing', 'said', 'cell', 'casing', 'comprising', 'first', 'casing', 'element', 'first', 'contact', 
'surface', 'second', 'casing', 'element', 'second', 'contact', 'surface', 'wherein', 'assembled', 'position',
 'first', 'second', 'contact', 'surfaces', 'contact', 'first', 'second', 'casing', 'elements', 'encase',
 'active', 'materials', 'battery', 'cell', 'interior', 'space', 'wherein', 'least', 'one', 'gas', 'tight',
 'seal', 'layer', 'arranged', 'first', 'second', 'contact', 'surfaces', 'seal', 'interior', 'space', 
'characterized', 'one', 'first', 'second', 'contact', 'surfaces', 'comprises', 'electrically', 'insulating',
 'void', 'volume', 'layer', 'first', 'second', 'contact', 'surfaces', 'comprises', 'formable', 'material', 
'layer', 'fills', 'voids', 'surface', 'void', 'volume', 'layer', 'hermetically', 'assembled', 'position',
 'form', 'seal', 'layer'], tags=['8697278-17'])

Build a Language Model

import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
assert gensim.models.doc2vec.FAST_VERSION > -1 #parallelize doc2vec
from gensim.utils import simple_preprocess


class MyDataframeCorpus(object):
    def __init__(self, source_df, text_col, tag_col):
        self.source_df = source_df
        self.text_col = text_col
        self.tag_col = tag_col
      
    def __iter__(self):
        for i, row in self.source_df.iterrows():
            yield TaggedDocument(words=simple_preprocess(row[self.text_col]), 
                                 tags=[row[self.tag_col]])

#Model 1
corpus_for_doc2vec = MyDataframeCorpus(df, 'claim_txt', 'claim_no')

#Model 2
corpus_for_doc2vec = MyDataframeCorpus(df, 'claim_txt', 'litigation')
%%time

#Params
model_smple = Doc2Vec(vector_size=100, # 100 should be fine based on the standards
                window=8, #number of context words
                alpha=.025, #initial learning rate
                min_alpha=0.00025, #learning rate drops linearly to this
                min_count=2, #ignores all words with total frequency lower than this.
                dm =1, #algorith 1=distributed memory , 0=distributed bag of words (PV-DBOW)
                epochs=30,
                workers=workers)#cores to use


model_smple.build_vocab(corpus_for_doc2vec,progress_per=500000)

## Build vocab from streaming tagged document Show progress every 500k rows
## Should take ~ 30 minutes

# Train Language Model
model_smple.train(corpus_for_doc2vec, total_examples=model_smple.corpus_count, epochs=model_smple.epochs)


>>Wall time: 22h 02min 24s

Model In Action

def model_infer_test(df,model):
    #grab a random claim from random_claim function
    tagged_claim = next(random_claim(df))
    
    #Infer a vector from that claim
    inferred_vector = model.infer_vector(tagged_claim.words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) 
    
    #Print top 5 similar claims
    print(sims[:5])

    # Compare and print the most/median/least similar documents from the train corpus
    print('Test Document ({}): «{}»\n'.format(tagged_claim.tags, ' '.join(tagged_claim.words)))
    print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
    for label, index in [('MOST', 0),('Second', 1) ,('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
        
        print(label,index)
        #print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(df[sims[index][0]].words)))
        print(u'%s %s: «%s»\n' % (label, sims[index], df.loc[df['claim_no']==sims[index][0],['claim_txt']].values))

# This will let you get random tagged documents
def random_claim(df):
    doc_index =  np.random.randint(0,len(df))
    #row = [df.iloc[doc_index,:]['claim_txt'],df.iloc[doc_index,:]['claim_no']]
    yield TaggedDocument(words=simple_preprocess(df.iloc[doc_index,:]['claim_txt']), 
                             tags=df.iloc[doc_index,:]['claim_no'])
model_infer_test(df,model_smple)

[('6208475-7', 0.9243185520172119), ('6208475-3', 0.883671224117279), ('6680491-1', 0.6106572151184082),
 ('8913211-14', 0.5911911725997925), ('8913211-1', 0.5858227610588074)]

Test Document (6208475-7): «optical member inspection apparatus obtaining image data used inspections applying
 illumination light optical member one side thereof photographing optical member side comprising holder 
comprising frame shaped base portion plurality spaced optical member holding portions provided base portion
 enclose space supporting optical member formed base portion said optical member holding portions receiving 
face outer margin optical member mounted restriction wall restrict movement optical member mounted receiving
 face contacting outer margin optical member said receiving face contacting outer margin face optical member
 light passes operation optical member said restricting wall located said moving face»


SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d100,n5,w5,mc2,s0.001,t16):

MOST 0

MOST ('6208475-7', 0.9243185520172119): «[['optical-member inspection apparatus obtaining image data used 
inspections applying illumination light optical member one side thereof photographing optical member side,
 comprising:a holder comprising frame shaped base portion plurality spaced optical-member holding portions
 provided base portion enclose space supporting optical member formed base portion, said optical-member 
holding portions receiving face outer margin optical member mounted restriction wall restrict movement 
optical member mounted receiving face contacting outer margin optical member, said receiving face contacting
 outer margin face optical member light passes operation optical member, said restricting wall located said 
moving face.']]»

Second 1

Second ('6208475-3', 0.883671224117279): «[['holder holding outer margin optical member, comprising:a frame
 shaped base portion; anda plurality spaced optical-member holding portions provided base portion enclose 
space supporting optical member formed base portion, said optical-member holding portions receiving face 
outer margin optical member mounted restriction wall restrict movement optical member mounted receiving face
 contacting outer margin optical member, said receiving face contacting outer margin face optical member 
light passes operation optical member, said restricting wall located said receiving face.']]»

MEDIAN 4108015

MEDIAN ('6592902-1', 0.08781707286834717): «[['pharmaceutical composition comprising (a) particulate 
eplerenone d90 particle size 25 400 microns, amount 10 mg 1000 mg, (b) one pharmaceutically acceptable 
carrier materials; said composition controlled release oral dosage form wherein 50% said eplerenone 
dissolved vitro least 1.5 hours 1% sodium dodecyl sulfate solution 37� c.']]»

LEAST 8216030

LEAST ('6823942-1', -0.3702787756919861): «[['tree saver apparatus preventing excess pressure christmas tree
 well, wherein christmas tree comprises tubing, master valve, top valve, second master valve, wing valve, 
wherein apparatus comprises:a. hydraulic system comprising: i. piston connected piston rod; ii. cylinder 
connected piston; iii. upper plunger connected piston; iv. hand wheel connected upper plunger; b. connection
 least 3 outlets connected cylinder; c. frac wing valve connected connection least 3 outlets; d. frac valve
 first port second port, wherein first port connected connection least 3 outlets; e. mandrel first end 
second end, wherein first end connected piston rod; f. landing bowl connected top valve.']]»

Data transformation

 	ICL 	ICC 	litigation 	0 	1 	2 	3 	4 	5 	6 	... 	90 	91 	92 	93 	94 	95 	96 	97 	98 	99
claim_no 																					
8697278-17 	106 	2 	0 	-0.021915 	0.124170 	-0.467477 	1.725972 	1.353391 	-0.197958 	1.234415 	... 	-1.201108 	-0.272672 	0.649778 	-1.023098 	0.146432 	0.122571 	0.215042 	-1.076661 	-0.508896 	-1.690101
8697278-1 	97 	2 	0 	-0.271349 	0.260269 	-0.315466 	1.746054 	0.874254 	-0.225060 	0.971804 	... 	-1.190130 	-0.417508 	0.587534 	-1.157797 	0.314950 	0.189521 	0.474607 	-1.119020 	-0.376173 	-1.699343
7385756-81 	108 	33 	0 	1.429105 	0.936555 	-0.296551 	-1.073354 	-0.421059 	-0.114407 	-0.881726 	... 	0.885844 	0.934247 	0.678008 	-0.752642 	-0.121980 	-0.623260 	-2.384536 	0.471751 	0.874406 	1.251178
7385756-94 	116 	33 	0 	1.027478 	0.484034 	-0.535233 	-1.124252 	-0.382812 	-0.101879 	-0.348340 	... 	0.772912 	0.951951 	0.901475 	-0.721144 	-0.304449 	-0.278908 	-2.655865 	0.323664 	0.943390 	0.687369
7385756-79 	103 	33 	0 	1.396978 	0.419591 	-0.072397 	-0.642861 	-0.482944 	0.496666 	-1.178749 	... 	0.074441 	1.187986 	0.909484 	0.278953 	-0.483475 	-0.028059 	-2.268470 	0.986993 	0.039141 	0.374451

Modeling Techniques

Document Vectorization

Gensim Doc2Vec Language Model

XGBOOST Logistic Regression

XGBOOST

Model Set up TTS

### Test Train Split
y = df['litigation']
X = df.drop(['litigation','claim_no'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=42)

#free up memory
del df,y,X

#Class balance test and train
y_test.value_counts(normalize=True)

0    0.996355
1    0.003645
Name: litigation, dtype: float64

y_train.value_counts(normalize=True)
0    0.996285
1    0.003715
Name: litigation, dtype: float64

# Convert input data from numpy to XGBoost format
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

#again to conserve memory
del X_train, X_test, y_train, y_test
from sklearn.model_selection import train_test_split 
import xgboost as xgb


## Testing Features against language vectors

# Model ICL_100F
df= pd.read_csv('claims_vectors_litigation.csv',
                usecols=lambda x : x not in ['claim_txt','ICC'])
# Model ICC_100F
df= pd.read_csv('claims_vectors_litigation.csv',
                usecols=lambda x : x not in ['claim_txt','ICL'])
# Model ICL_ICC
df= pd.read_csv('claims_vectors_litigation.csv',
                usecols=['claim_no','ICL','ICC','litigation'])
#Model ICL_ICF_100F
df= pd.read_csv('claims_vectors_litigation.csv',
                usecols=lambda x : x not in ['claim_txt'])
# Model 100F
df= pd.read_csv('claims_vectors_litigation.csv',
                usecols=lambda x : x not in ['ICL','ICC','claim_txt'])


Training the Model

# Specify sufficient boosting iterations to reach a minimum
num_round = 5000

param = {'objective'   : 'binary:logistic', # Specify binary classification
         'eta'         : .5,
         'max_depth'   : 8,
         'predictor'   : 'gpu_predictor',
         #'verbosity'   : 3,
         'tree_method' : 'gpu_hist', # Use GPU accelerated algorithm
         'random_state': 42
        }

gpu_res = {} # Store accuracy result
tmp = time.time()
# Train model
tst_model = xgb.train(param, dtrain, num_round, evals=[(dtest, 'test')], evals_result=gpu_res)
print("GPU Training Time: %s seconds" % (str(time.time() - tmp)))

INITIAL Results

Turning water into wine

 model 	    % non-lit claims 	% lit claims 	RMSE 	    % predicted  #predicted
ICC_100F 	99.6355 	0.3645 	        0.003138 	0.0507    15
ICL_ICC_100F 	99.6355 	0.3645 	        0.003142 	0.0503    15
ICL_100F 	99.6355 	0.3645 	        0.003267 	0.0378    11
100F 	        99.6355 	0.3645 	        0.003283 	0.0362    11
ICL_ICC 	99.6355 	0.3645 	        0.003490 	0.0155    5

What do our results mean?

Text

Politely laugh at the image below

Why Recall?

Test model on Less data!

.. A LOT LESS DATA

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19895884 entries, 0 to 3432364
Data columns (total 12 columns):
step                       int64
action                     object
amount                     float64
nameOrig                   object
oldBalanceOrig             float64
newBalanceOrig             float64
nameDest                   object
oldBalanceDest             float64
newBalanceDest             float64
isFraud                    int64
isFlaggedFraud             int64
isUnauthorizedOverdraft    int64
dtypes: float64(5), int64(4), object(3)
memory usage: 1.9+ GB
Accuracy by admission status
col_0           0     1       All
isFraud                          
0        19887444    54  19887498
1            3763  4623      8386
All      19891207  4677  19895884

 Percentage accuracy
0.9998081512739017

Percentage Fraud caught
0.5512759360839494

Train a new model

def split_search_pred(df):
    # must change this to check if data is already in correct format.
    # change / reshape data
    df.drop(['isFlaggedFraud','isUnauthorizedOverdraft'],1,inplace=True)
    df.rename(columns={'action':'type','oldBalanceOrig':'oldbalanceOrg','newBalanceOrig':'newbalanceOrig','oldBalanceDest':'oldbalanceDest','newBalanceDest':'newbalanceDest'},inplace=True)
    
    #encode type
    le.fit(df['type'])
    df['type']=le.transform(df['type'])
    
    #split into dependant and independant variable(s)
    # T T S assign
    X = df.drop(['isFraud','nameOrig','nameDest'],1)
    Y = df['isFraud']
    
    X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.25, stratify=Y, random_state=7351)
    
    params = {'loss': ['deviance','exponential'],
         'learning_rate' : [ .01, 0.1],
         'n_estimators': [800],
         'subsample': [.8],
         'min_samples_split':[2],
         'min_weight_fraction_leaf': [0,.1],
         'max_depth': [2],
         'tol':[0.00001],
          'random_state':[7351]
         }
      
    xgb_model = xgb.XGBClassifier()

    skf = StratifiedKFold(n_splits=10)
    xgb_grid = GridSearchCV(xgb_model, params, cv=skf,scoring='recall',verbose=10)
    
    start_time = time.time()
    
    #Fit the Data
    xgb_grid.fit(X_train, y_train)
    print("--- %s seconds ---" % (time.time() - start_time))
    
    return grid

Test and evaluate

df1 = pd.read_csv('Run2.csv')
cost_df1 = model_data(df1,xgb_grid_opt)



def model_data(df,model):
    # must change this to check if data is already in correct format.
    # change / reshape data
    df=df
    df.drop(['isFlaggedFraud','isUnauthorizedOverdraft'],1,inplace=True)
    df.rename(columns={'action':'type','oldBalanceOrig':'oldbalanceOrg','newBalanceOrig':'newbalanceOrig','oldBalanceDest':'oldbalanceDest','newBalanceDest':'newbalanceDest'},inplace=True)
    
    #encode type
    le = pp.LabelEncoder()
    le.fit(df['type'])
    df['type']=le.transform(df['type'])
    
    #split into dependant and independant variable(s)   
    X = df.drop(['isFraud','nameOrig','nameDest'],1)
    Y = df['isFraud']
    
    print('Starting prediction')
    
    #use model to predit
    
    cost = model_evaluate(model,X,Y)
    cost['nameOrig'] = df['nameOrig']
    cost['nameDest'] = df['nameDest']
    return cost
Test set accuracy:
Percent Type I errors: 0.0002624220326785749
Percent Type II errors: 1.8851297672463193e-05

 Accuracy by admission status
col_0          0     1      All
isFraud                        
0        3339648   877  3340525
1             63  1357     1420
All      3339711  2234  3341945

Evaluation Contd

What do our results mean

Test set accuracy:
Percent Type I errors: 0.0002624220326785749
Percent Type II errors: 1.8851297672463193e-05

 Accuracy by admission status
col_0          0     1      All
isFraud                        
0        3339648   877  3340525
1             63  1357     1420
All      3339711  2234  3341945

Type 1 and Type 2 Error represents waste

 

Represents the justification for keeping your job

Projected fraud department "case-load"

What do we do now?


cost_df1['isRefund'] = 0 #comment this out later, just to reset the refund column inbetween runs
cheat = cost_df1[((cost_df1['type']==1) | (cost_df1['type']==4)) & (cost_df1['fraud']==1)]
cheat.reset_index(inplace=True)
cheat.head(12)



cALCULATING THE rEFUND

def adj_confmat(df):

    for i in range(len(df.index)):
        if i+1==(len(df.index)):
            break
        row = df.iloc[i,:]
        # look for transfer transactions correctly predicted as fraud
        if (row['type']==4) and (row['fraud']==1) and (row['prediction']==1):
            # flag the transfer as being refunded
            df.at[i,'isRefund']=1
            #look at subsequent cash_out if it was "not caught" assume that it was because the transfer was caught
            #ensure that the amounts are the same "not a realistic assumption insitu"
            if (df.iloc[i+1,:]['type']==1) and (df.iloc[i+1,:]['fraud']==1) and (df.iloc[i+1,:]['prediction']==0) and (row['amount']==df.iloc[i+1,:]['amount']):
                df.iat[i+1,'prediction']==1
        # look for Cash_outs that were  predicted, but the previous transfer was not caught
        if (row['type']==1) and (row['fraud']==1) and (row['prediction']==1):
            #look at previous transfer and set prediction to 1 as the
            if (df.iloc[i-1,:]['type']==4) and (df.iloc[i-1,:]['fraud']==1) and (df.iloc[i-1,:]['prediction']==0) and (row['amount']==df.iloc[i-1,:]['amount']):
                df.at[i-1,'prediction']=1
                #print(df.iloc[i-1,:])
                df.at[i,'isRefund']=1
                #print("we found 1")
    return df

test = adj_cheat(cheat)

 Accuracy by admission status
col_0          0     1      All
isFraud                        
0        3339648   877  3340525
1             18  1402     1420
All      3339666  2279  3341945


test[test['isRefund']==1]['amount'].sum()

Dollars Refunded to Customers
2,851,462,501.44$

Percentage of Fraud Caught

98.73%

Decision Threshold

ARE WE DONE NOW? Is the answer ever yes?



df1_adj_eval = adj_conf(df1_fraud)

 Accuracy by admission status
col_0          0     1      All
isFraud                        
0        3339648   877  3340525
1             18  1402     1420
All      3339666  2279  3341945


df1_adj_eval[df1_adj_eval['isRefund']==1]['amount'].sum()

Dollars Refunded to Customers
2,851,462,501.44$

Percentage of Fraud Caught

98.73%
    pred_pos  pred_neg
pos   3340300       225
neg       102      1318

Adjusted for refunded transactions
Accuracy by admission status
col_0          0     1      All
isFraud                        
0        3340300   225  3340525
1             24  1396     1420
All      3340324  1621  3341945

# Dollars Refunded
$2,849,545,091.61

#Cost Difference between 
#Decision Thresh = .75 and .5

$1,917,409.83

# Predicted Case Load difference

~ 610 Fewer cases

Decision Threshold

With Refund

Final words

Lessons learned

pip install joblib

from joblib import dump, load
xgb_grid_opt = load('grid_opt_df1')
# T T S assign
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.25, stratify=Y, random_state=7351)


....


skf = StratifiedKFold(n_splits=10)
grid = GridSearchCV(clf, params, cv=skf,scoring='recall',verbose=10)

Future work / Sources

Thank you! :)

How does it work?

This is a stock image to look at while I talk at you.

Made with Slides.com